A speech analysis/synthesis method aims at representing a speech waveform, produced by a person speaking, as a time sequence of parameters (see Figure below). Based on this time sequence, the speech waveform can be reconstructed. The analysis/synthesis methods are cornerstones for many speech technologies (e.g. text-to-speech, telecommunications, voice restoration). For the majority of applications, these methods need to have two key properties: (i) a high perceived quality of the speech sound, and, (ii) a statistical characterization of the parameters' sequence necessary for statistical approaches, which have attracted great interest during the last decades in speech technologies. The current analysis/synthesis methods that provide a statistical characterization exhibit however a lack of perceived quality. This issue does not pose a problem in applications designed for noisy environments (e.g. navigation devices, smart-phone applications, announcements in train stations). On the contrary, it prohibits the use of statistical approaches in quiet environments, e.g. in the music, cinema and video game industries, where the listener is fully aware of all the details of the sound. This problem is mainly due to the lack of an accurate representation of the residual information and its correlation with the spectral amplitude information. The primary goal of the HQSTS project is to create a high-quality analysis/synthesis method that will broaden the applications of statistical approaches of speech technologies in quiet environments, where a high-quality is an absolute necessity.
When we speak, we make the air vibrating around us. This vibration can be recorded and we call it a waveform (top plot of the figure below with the corresponding audio file above). As you can see in this plot, the vibration can be strong in some parts and fairly silent any many parts.
The second plot from the top, called "F0" represent how high is the voice. This height is not only used to sing a meldoy! It is also used in speech to express intonations that conveys emotions in most latin and germanic languages and also to discriminate phonemes in many asian languages. This height is usually called fundamental frequency and abreviated F0.
Third and colorful plot of the figure is the Spectral Amplitude. The warmer the color the stronger the frequency components of the sound (e.g. if the lower region of the spectral amplitude is warmer than higher regions, the sound is likely to sound dark and not shiny).
Once the F0 and the spectral amplitude of a speech sound is given, it remains only a residual that determines the noisiness, crispiness, granularity of the sound. Whereas we have strong candidate to represent the F0 and spectral amplitude mathematically (as you see in the figure, with a curve and a smooth surface, respectively), we do not have such a reliable representation for this residual.
During the HQSTS project, most of our research work focused on how to represent this element. We suggested a noise mask that tells us if a time frequency point is either noisy, or not. In the fourth and last plot of the figure below, the black regions represent noise and the white ones represent non-noisy regions. We rely on the time and frequency resolutions of this noise mask to model different extend of noise, crispiness and granularity. During the last 50 years, many techniques have been suggested to represent this residual. In this research, we tried to suggest a new one that offers a high quality synthesis (see below) while offering many technical advantages that are out of the scope of this summary.
Through analysis of a waveform, features are extracted, namely F0, spectral amplitude and noise mask, by using signal processing algorithms. Then, we designed a synthesiser that make use of these features in order to reconstruct the waveform. The sequence of sounds below shows the reconstruction, by adding each element one after the other:
|F0 alone ...|
|... with spectral amplitude|
|... with noise mask|
|The original recording|
So, why representing this waveform by those so-called features in order to reconstruct the same exact sound? Because the waveforms created by a person speaking can heardly be represented, as is, in computers for making artifical voices. (this limitation has been recently lifted by Google's Deepmind team through WaveNet, but Google's synthesis still uses spectral features for their statistical synthesizers). Features, however, have interersting mathematical characteristics that make them very friendly for Deep Learning or Gaussian Mixture Models. By using an intermediate representation of the speech sounds, the features can be modelled, transformed and sculpted, like a graphist would manipulate the colors of a pictures with a graphic editor.
These features, as well as the synthesiser, are obviously key elements of these technologies since the audio quality will fully depend them. The features have to represent all of the sound characteristics that we can perceive and they should not represent more than that otherwise the computer will start modelling and processing elements that we won't even perceive.
Have a look at the research's tasks if you want to know more!
This project is receiving funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655764.