High-Quality voice model for STatistical parametric speech Synthesis
Machine Intelligence Lab, University of Cambridge, UK. | |
Horizon 2020 - MSCA funding. |
From a voice production perspective, the voice can be decomposed is mainly components. The first one is the vibration of the vocal folds that modulate the air coming from the lungs generating pulses of air. Most of this modulation is fairly regular, which generate a fundamental frequency (F0), and the wavform's samples of these air pulses are quite predictible. Because of this determinism in the beavior of this modulated air, we call it deterministic component. Not all of the voice's waveform is deterministic of course because a lot of noise is generated during speech. This noise might come from the irregular vibration of the vocal folds as well as from the turbulences of the air through all the vocal apparatus. A whispered and breathy voice is created by increasing the turubulences of the air at the level of the vocal folds. A fricative (e.g. as in the /s/ of the word simple), is created by air turbulences at the tip of the tongue. In general, the modulation of the air by the vocal folds is fairly regular, so that a fundamental frequency (F0) is perceived as a melody. When this F0 drops very low, the vocal folds cannot hold a sustained and regular vibration and starts "clapping" at irregular intervals, like a car engine that starts coughing under low gaz. This irregular mode of phonation is called creaky voice (or vocal fry). From an audio analysis point of view, this phonation produces is a quite annoying waveform because it is as much as deterministic as a modulated pulse of air, but these pulses happen irregularly. Many voice modelling techniques thus confuse these waveforms with noise and then the resulting syntheses sound hoarse. Finally, some parts of the speech are neither created by the vibration of the vocal folds nor by air turbulences. The plosives (e.g. as in the /p/ of pie) are small bursts of noise with a very specific wave's shape. All these voice components are not concatenated one after the other like puzzle pieces. Each component merges with its predecessor and successor. These fuzzy transitions are called transients. Everything happend quite smoothly when a /i/ turn into a /o/ like in "Yo man". But when a plosive preceeds three air pulses of a creaky voice segment that ends with another plosive, that's a whole different story. Finally, one might describe a myriad of other voice components, a research project being of limited duration (2 years here) with a clear focus on a single problematic, we chose to address the components summarized above.
The deterministic component is driven by a fundamental frequency (F0). The spectral amplitude envelope (3rd plot in figure below) gives the shape of the glottal pulses. For the F0, despite suffering of averaging in statistical modelling, which is a common issue to all features, the F0 is fairly well estimated at the level of the waveform analysis (on a frame-by-frame basis, a frame being of ~25ms duration). The F0 modelling suffers more of global issues that are related to prosody, which is out of the scope of this research. On the contrary, the spectral envelope widely impacts the perceived quality at the level of the vocoder. As a preliminary work, we tried various spectral envelope estimation techniques, from very simple ones to more sofisticated ones. The main conclusion of this task is that the most simple and robust envelope is preferable for statistical modelling. Even though a sophisticated method might improve the overall precision of the spectral envelope estimate in many time-frequency regions, it usually has a more complex statistical distribution to model by the models due to larger extent of critical errors. We therefore decided to focus on the next task, which consist of modeling noise, a component that currently needs more attention, while keeping in mind that the reconstruction of the deterministic component has to be preserved (one cannot improve the noise if it degrades the spectral amplitude).
New phase and noise analysis tools have been developed recently. Specificaly, the Phase Distortion Deviation PDD show information that was lacking for developing a new noise model. We therefore took advantage of these new tools to focus on phase and noise modeling for suggesting a new analysis/synthesis process (i.e. a vocoder) for text-to-speech synthesis.
Simplifications of models in speech processing often lead to less contextual processing (e.g. two different models for two different contexts, as for voiced and unvoiced segments) and thus less artefacts. Within this perspective we tried to address the fricatives the same way as the breathiness in voiced segments by merging the two audio signal representations used for each component. In most current vocoders, voiced and unvoiced segments are dealt with either different models or similar models with discontinued parameter trajectories.
In the PML vocoder developed during this project, the noise model is uniform in the sense that it models the voiced and unvoiced segments the exact same way, by turning time-frequency regions of deterministic content into noise (see bottom plot of figure below).
Conversely to traditional approaches where stric limits are often used between voiced and unvoiced segments, the approach we took allows a complete independence of change of noise state between any time-frequency region, the noise state is not forced to change from voiced to unvoiced, all at once at a given time. It can smoothly appear in higher frequencies while leaving more space for determinism, as in a fainting voice at the end of a sentence.
Audio | |
Waveform | |
F0 | |
Spectral amplitude | |
Phase Distortion Deviation | |
Noise mask |
Creaky voice has been addressed in this project through a modification of the noise mask (bottom plot of figure above) that is used in the PML vocoder. Detection of creakiness is realised implicitly through various 2D image processing of the pulseness properties of the signal. The PML noise mask is then modified according to this creaky voice detection (see figure below). The blue intervals indicates segments of creaky voice and the red interval shows a plosive.
Waveform | |
Pulseness | |
Denoised pulseness | |
Thresholding and Closure | |
Noise mask correction |
Plosives and transients are both rapid variation of the speech signal along the time line. They are both properly reconstructed on a frame-by-frame basis, like the F0, and they both suffer tremendously from averaging effects that appear during statistical modelling. Thus, in terms of vocoding technique, there was not much room for improvement.
In the past, the smooth behavior of the parameters trajectories was indirectly modelled by predicting derivatives and static values of the parameters. Then, the Maximum Likelihood Parameter Generation (MLPG) algorithm was used to reconstruct a smooth trajectory. The big advantage of it was to ensure a continuity of the audio characteristics, the big disadvantage was to force a continuity of the audio characteristics ... in other words, when continuity was necessary (e.g. between two vowels like in a diphthong), it was realized, but, at the sdame time, the discontinuous characteristic of the plosives and transients was all scattered. With current techniques of Deep Neural Networks (DNN) it is now possible to model the smooth trajectories of the parameters using Recurrent Neural Networks (RNN) (like Bidirectional Long Short-Term Memory (BLSTM)), which avoids the use of the MLPG algorithm.
During this research project we studied and tested DNN techniques in the aim of improving the rapid time variations of plosives and transients. Recent progress in DNN reported in the litterature showed that 2D Convolutional Layers are very efficient for modelling and generating images. We thus followed this trace and considered the spectral amplitude envelope as an image (see figure below).
To train a neural net model, the Least Square (LS) approach was commonly used up to a few years ago. The least square approach tells the neural net to predict the mean. In terms of statistical modelling, this implies a very crude simplification of the value prediction that obeys a probability distribution. Let's imagine a video game about darts. You will have to face various opponents ... who systematically aim the center, precisely. This game would be correct in the sense that, if consider all sorts of dart player, the mean, the average point, where they hit the dartboard would be indeed the very center. Even though the game would be prefectly correct, it wouldn't be very playful for the gamer. The video game should not aim at the mean. It should generate hit points that looks like that of a good dart player. In DNN, the Least Square (LS) solution predicts the mean. The Wassertein Generative Adversarial Network (WGAN) predicts the spectral amplitude of a good speaker.
The figure below shows first the original spectral amplitude as measured from an audio recording. The orizontal axis is the time and the frequency is on the vertical axis. The second row shows the spectral amplitude that is generated by a BLSTM trained using the Least Square solution. The third row shows the spectral amplitude generated by 2D Convolutional layers trained using WGAN. We can see that the thin structures in the mid and high frequencies in the bottom plot are more similar to that of the orignal compared to the plot in the middle, which look oversmoothed.
Original | |
BLSTM + LS | |
2D Convolutional + WGAN |
Original | |
PML Resynthesis | |
BLSTM + LS | |
2D Convolutional + WGAN |
Eventually, even though we were aiming at improving the plosives and transients, we can see that we also improved many aspects of the spectral amplitude.
The system to generate these sounds has been published in the form of an open-source code on GitHub.com call Percival.