HQSTS

High-Quality voice model for STatistical parametric speech Synthesis

We describe here the various tasks we have been dealing with during the research project.

A brief voiced summary

From a voice production perspective, the voice can be decomposed is mainly components. The first one is the vibration of the vocal folds that modulate the air coming from the lungs generating pulses of air. Most of this modulation is fairly regular, which generate a fundamental frequency (F0), and the wavform's samples of these air pulses are quite predictible. Because of this determinism in the beavior of this modulated air, we call it deterministic component. Not all of the voice's waveform is deterministic of course because a lot of noise is generated during speech. This noise might come from the irregular vibration of the vocal folds as well as from the turbulences of the air through all the vocal apparatus. A whispered and breathy voice is created by increasing the turubulences of the air at the level of the vocal folds. A fricative (e.g. as in the /s/ of the word simple), is created by air turbulences at the tip of the tongue. In general, the modulation of the air by the vocal folds is fairly regular, so that a fundamental frequency (F0) is perceived as a melody. When this F0 drops very low, the vocal folds cannot hold a sustained and regular vibration and starts "clapping" at irregular intervals, like a car engine that starts coughing under low gaz. This irregular mode of phonation is called creaky voice (or vocal fry). From an audio analysis point of view, this phonation produces is a quite annoying waveform because it is as much as deterministic as a modulated pulse of air, but these pulses happen irregularly. Many voice modelling techniques thus confuse these waveforms with noise and then the resulting syntheses sound hoarse. Finally, some parts of the speech are neither created by the vibration of the vocal folds nor by air turbulences. The plosives (e.g. as in the /p/ of pie) are small bursts of noise with a very specific wave's shape. All these voice components are not concatenated one after the other like puzzle pieces. Each component merges with its predecessor and successor. These fuzzy transitions are called transients. Everything happend quite smoothly when a /i/ turn into a /o/ like in "Yo man". But when a plosive preceeds three air pulses of a creaky voice segment that ends with another plosive, that's a whole different story. Finally, one might describe a myriad of other voice components, a research project being of limited duration (2 years here) with a clear focus on a single problematic, we chose to address the components summarized above.

Deterministic component

The deterministic component is driven by a fundamental frequency (F0). The spectral amplitude envelope (3rd plot in figure below) gives the shape of the glottal pulses. For the F0, despite suffering of averaging in statistical modelling, which is a common issue to all features, the F0 is fairly well estimated at the level of the waveform analysis (on a frame-by-frame basis, a frame being of ~25ms duration). The F0 modelling suffers more of global issues that are related to prosody, which is out of the scope of this research. On the contrary, the spectral envelope widely impacts the perceived quality at the level of the vocoder. As a preliminary work, we tried various spectral envelope estimation techniques, from very simple ones to more sofisticated ones. The main conclusion of this task is that the most simple and robust envelope is preferable for statistical modelling. Even though a sophisticated method might improve the overall precision of the spectral envelope estimate in many time-frequency regions, it usually has a more complex statistical distribution to model by the models due to larger extent of critical errors. We therefore decided to focus on the next task, which consist of modeling noise, a component that currently needs more attention, while keeping in mind that the reconstruction of the deterministic component has to be preserved (one cannot improve the noise if it degrades the spectral amplitude).

Sustained fricatives and Breathiness in voiced segments

New phase and noise analysis tools have been developed recently. Specificaly, the Phase Distortion Deviation PDD show information that was lacking for developing a new noise model. We therefore took advantage of these new tools to focus on phase and noise modeling for suggesting a new analysis/synthesis process (i.e. a vocoder) for text-to-speech synthesis. Simplifications of models in speech processing often lead to less contextual processing (e.g. two different models for two different contexts, as for voiced and unvoiced segments) and thus less artefacts. Within this perspective we tried to address the fricatives the same way as the breathiness in voiced segments by merging the two audio signal representations used for each component. In most current vocoders, voiced and unvoiced segments are dealt with either different models or similar models with discontinued parameter trajectories. In the PML vocoder developed during this project, the noise model is uniform in the sense that it models the voiced and unvoiced segments the exact same way, by turning time-frequency regions of deterministic content into noise (see bottom plot of figure below). Conversely to traditional approaches where stric limits are often used between voiced and unvoiced segments, the approach we took allows a complete independence of change of noise state between any time-frequency region, the noise state is not forced to change from voiced to unvoiced, all at once at a given time. It can smoothly appear in higher frequencies while leaving more space for determinism, as in a fainting voice at the end of a sentence.

 Audio Audio element not supported Waveform F0 Spectral amplitude Phase Distortion Deviation Noise mask
This property already existed in a work that we previously developed (the HMPD vocoder), but the signal model in log domain used in PML happened to be of better quality than that of HMPD, as shown in PML's article. With PML's new signal model, we worked on a different way of combining deterministic and noise components. In traditional speech modelling, the noise component is added in the linear domain: $$S(t_i,\omega) = V(t_i,\omega) \cdot \big( e^{-j\omega t_i} \;\; \boldsymbol{+} \;\; N(t_i,\omega) \big)$$ with $$S(t_i,\omega)$$ the spectrum of the ith synthesized speech pulse, $$V(t_i,\omega)$$ the spectrum of the vocal tract resonnances, the complex exponential $$e^{-j\omega t_i}$$ places the pulse on the time line and $$N(t_i,\omega)$$ the spectrum of the noise source. Even though the traditional approach is similar to the voice production (the air turbulences litterly adds with the deterministic air pulses), it is not very convenient from a point of view of control of the perceived characteristic of the voice, which is the goal targeted in speech synthesis. Indeed, using an addition in the linear domain, the amplitude spectrum is defined by the ratio between the deterministic and noise sources, and by the spectral amplitude envelope. However, with an addition in the log spectral domain, as in the PML vocoder, the amplitude spectrum is defined only by the spectral amplitude envelope, the rest are complex imaginary parts and thus define the phase spectrum. $$S(t_i,\omega) = V(t_i,\omega) \cdot e^{\big(-j\omega t_i \;\; \boldsymbol{+} \;\; j\angle N(t_i,\omega) \cdot M(t_i,\omega)\big)}$$ This reduces drastically the issues of parameters inter-dependences. This model has been shown to work very well in analysis/resynthesis context (as shown in PML's article Fig.5) and improved the state of the art of parametric text-to-speech, as shown in PML's article.

Creaky voice

Creaky voice has been addressed in this project through a modification of the noise mask (bottom plot of figure above) that is used in the PML vocoder. Detection of creakiness is realised implicitly through various 2D image processing of the pulseness properties of the signal. The PML noise mask is then modified according to this creaky voice detection (see figure below). The blue intervals indicates segments of creaky voice and the red interval shows a plosive.
 Waveform Pulseness Denoised pulseness Thresholding and Closure Noise mask correction
Results of these experiments have been published as an extension of PML's analysis in PML's article. These results do not show a systematic improvement on the 6 voices tested, but showed improvements on a sub-category of voice segments that are particularly creaky, i.e. it does improve the creaky voice segments as expected, but might degrade some other segments as well.

Plosives and Transients

Plosives and transients are both rapid variation of the speech signal along the time line. They are both properly reconstructed on a frame-by-frame basis, like the F0, and they both suffer tremendously from averaging effects that appear during statistical modelling. Thus, in terms of vocoding technique, there was not much room for improvement.

In the past, the smooth behavior of the parameters trajectories was indirectly modelled by predicting derivatives and static values of the parameters. Then, the Maximum Likelihood Parameter Generation (MLPG) algorithm was used to reconstruct a smooth trajectory. The big advantage of it was to ensure a continuity of the audio characteristics, the big disadvantage was to force a continuity of the audio characteristics ... in other words, when continuity was necessary (e.g. between two vowels like in a diphthong), it was realized, but, at the sdame time, the discontinuous characteristic of the plosives and transients was all scattered. With current techniques of Deep Neural Networks (DNN) it is now possible to model the smooth trajectories of the parameters using Recurrent Neural Networks (RNN) (like Bidirectional Long Short-Term Memory (BLSTM)), which avoids the use of the MLPG algorithm.

During this research project we studied and tested DNN techniques in the aim of improving the rapid time variations of plosives and transients. Recent progress in DNN reported in the litterature showed that 2D Convolutional Layers are very efficient for modelling and generating images. We thus followed this trace and considered the spectral amplitude envelope as an image (see figure below).

To train a neural net model, the Least Square (LS) approach was commonly used up to a few years ago. The least square approach tells the neural net to predict the mean. In terms of statistical modelling, this implies a very crude simplification of the value prediction that obeys a probability distribution. Let's imagine a video game about darts. You will have to face various opponents ... who systematically aim the center, precisely. This game would be correct in the sense that, if consider all sorts of dart player, the mean, the average point, where they hit the dartboard would be indeed the very center. Even though the game would be prefectly correct, it wouldn't be very playful for the gamer. The video game should not aim at the mean. It should generate hit points that looks like that of a good dart player. In DNN, the Least Square (LS) solution predicts the mean. The Wassertein Generative Adversarial Network (WGAN) predicts the spectral amplitude of a good speaker.

The figure below shows first the original spectral amplitude as measured from an audio recording. The orizontal axis is the time and the frequency is on the vertical axis. The second row shows the spectral amplitude that is generated by a BLSTM trained using the Least Square solution. The third row shows the spectral amplitude generated by 2D Convolutional layers trained using WGAN. We can see that the thin structures in the mid and high frequencies in the bottom plot are more similar to that of the orignal compared to the plot in the middle, which look oversmoothed.
 Original BLSTM + LS 2D Convolutional + WGAN
 Original Audio element not supported PML Resynthesis Audio element not supported BLSTM + LS Audio element not supported 2D Convolutional + WGAN Audio element not supported

Eventually, even though we were aiming at improving the plosives and transients, we can see that we also improved many aspects of the spectral amplitude.

The system to generate these sounds has been published in the form of an open-source code on GitHub.com call Percival.