Podcast Speech Synthesis using Tacotron2

Adam Jacobs
8 min readMar 26, 2021

Dear reader, welcome to my first technical medium post! :) I’m excited to bring you practical and analysis content for those who likes reading about things you can actually use yourself, and not only read about!!!

In this project a speech synthesis voice was trained on Spotify Podcasts with the text-to-speech deep learning model Tacotron2 (paper by Google, implementd by Nvidia). This article contains how the training was done, and the steps in a re-usable data pipeline to make it more scalable (which can be re-applied to other ML projects). All the code, the project report, and results can be found here.

Why Speech Synthesis?

Speech Synthesis is becoming of more importance now in an age of more of usage more conversational devices in our homes. Many speech synthesis systems rely on read speech data because it is historically what has been available, and because spontaneous speech data is difficult to utilize as it is highly variable and disfluent. When building speech and conversational systems it is of importance to include features that humans can relate to. Otherwise it will not be considered natural. Therefore, in order to improve the naturalness of synthesized speech, this project created a text-to-speech (TTS) system built with the end-to-end deep learning model Tacotron2 and spontaneous speech data found in Spotify podcasts.

Podcast Data

The Spotify Podcasts Dataset collects data from around 100.000 episodes of different podcasts shows uploaded to Spotify. All the audio files together sum up to around 50.000 hours of data, and the transcripts (generated by Google’s cloud speech-to-text API2 according to Spotify) contain more than 600 million words. These transripts also contains speaker tags that shows who is speaking, for example:

{“startTime”: “3s”, “endTime”: “3.300s”, “word”: “Hello,”, “speakerTag”: 1}

Data Pipeline Summary

In summary, first relevant podcasts are selected, then they are segmented by breaths, transcripts added to the segments, outliers filtered, and finally put it into the model. See the visualization of the pipeline below. Each operation is implemented as a Guild AI command which tracks your experiments for you. And you can add, modify or remove any operations you want in our pipeline, making it re-usable and scalable

Podcast Data Selection

The podcast data is downloaded using rclone via a remote server containing the data. This is encapsulated in a guild command, and can be replaced with any data downloading commands or scripts for any other data sources. Also, the .ogg formated audio files were converted to .wav using the FFmpeg library.

Since the pre-trained models were trained on female voices the goal was to find a female spontaneous podcast voice to train with and at least 20h before any type of data filtering. This was done by only looking at podcast with at least 20h of episodes and only one speaker tag in transcripts. However, the speaker tags were not perfect so a few hours of manual labor was needed to listen to some of the podcast, and select candidates. This is the only manual step in the pipeline, and could be automated in the future.

Segmentation: Breathing Detection

From previous research we know that that Tacotron2 is very sensitive to breath and silence, therefore the data is segmented accordingly. A pretrained CNN-LSTM speaker-dependent breath detector model was used, which was trained on a female voice (Székely et al). This is implemented as a guild command. The model outputs classified utterances in breath, clean data and noise by looking at Mel-Spectrograms

Note that ”noise” is interpreted by the model as another person speaking, since the model was trained on dialogue data and trained to detect one specific voice. Some segments were short so the “noise” was put into them to make them longer and since a solid amount of data were available the filtering of the segments in the pipeline would probably remove undesirable utterances, which it did.

Merging Transcripts and Segments

After breath segmentation the timestamps of the audio segments were taken and matched with the relevant text in the transcripts. A word is included if its end or start time overlaps with the timestamp of the speech data of the segment. For example, the following word will be included for the segment:

Word: {“startTime”: “3s”, “endTime”: “3.300s”, “word”: “Hello,”, “speakerTag”: 1}

Segment: {somesegment.wav, “startTime”: “3.05s”, “endTime”: “7.500s”}

This could be done less naively with for example a forced aligner, but this did not seem to affect the result in considerable negative way.

Filtering Segments

Due to the differences of tones, intensities and speed in the voice during natural speech, all these segments present really different characteristics, which may hinder the training process. We don’t want any outlier events in our voice data, like a big “comedic” laugh or the person talking louder than usual.

Since the acoustic features of podcast have useful distributions they can be filtered out by only including segments with acoustic features within a reasonable threshold. So any segments will be filtered if it contains audio where the voice is not like it usually sounds like - or if it is not speech! Below are examples of the distributions:

The filtering was done by removing segments that is outside a threshold from the average pitch, intensity, energy, and the speech rate. So we keep segments if each of its measured acoustic features i lies within each interval μ_{i} ± α · σ_{i}, where α is a parameter for the width of the threshold (the number of standard deviations away from the average).

Pitch, intensity, and energy can be calculated with any library for acoustic data. The speech rate was calculated by the number of phonemes divided by the time length in seconds, using a the G2P grapheme to phoneme conversion library.

Multiple experiments was performed, individually with intervals of either 2 or 4 standard deviations away from the mean. I.e computing each μ_{i} ± α · σ_{i} for α=1,2. The segments were also filtered if outside the length of 3–10 sec. If the data is too short then Tacotron2 will not learn much from it, and for longer utterances Tacotron2 ran out of memory.

Training

Tacotron2 is a sequence to sequence (encoder/decoder) network with attention that encodes text sequences, and decodes it into a Mel Spectrogram, which is a way to visualize audio within a time interval. Then this is put into a Wavenet vocoder which produces the audio data. So it goes from text -> image -> audio data.

The training is started from a pre-trained model trained by NVIDIA on the LJ Speech dataset. Then the model is trained using the data from the previous steps in the pipeline. Two GeForce RTX 2080 GPUs was used for training. It took about 1.5 days to train for each experiment.

Tacotron 2 requires Docker for easy setup. This project used Tacotron in accordance with the instructions provided in its repository. However, it was found that the following steps are necessary to train the model without running into software issues:

  • Use the Docker container named tacotr:ver1
  • Do not attempt to install any dependencies — they are already in the docker image!
  • Instead of using the WaveGlow model linked in that repository, use the one from PyTorch Hub.
  • If you run into issues, then looking at the GitHub issues!
  • Make sure the docker image have access to GPUs if you use it (which you need unless you want to train for a very long time!)

Most default values for hyperparameters were used, provided in the file hparams.py. The only settings that was adjusted were the distributed run and fp16 run to True. Surprisingly, no hyperparameters had to be tuned so it is pretty easy to train as long as you have the hardware and data needed!

Limitations and Applications

It is important to write re-usable data pipelines so that you can do many experiments and also hand it over to other people that wants to do similar work. The pipeline is automatic in all steps except in the podcast selection, which could be improved with more exact speaker diarization, i.e better speaker tags.

Furthermore, most of the work was about data pre-processing. And no hyperparams were tuned during training. It seems that models in deep learning speech synthesis are quite reproducible parameter wise, and the most important aspect is having quality data.

The pre-processing steps in the data pipeline could be improved further easily by adding more guild operations. The filtering was quite general since it only looks at statistical outliers but did not specifically look to filter other acoustic disturbances like music. Since the filtered segments were not investigated it is uncertain if it managed to filter out music. However, it does not seem to be reflected in the synthesized voice.

A cool application of this project is generating a voice for podcast content. It will be quite feasible since a very natural sounding and spontaneous voice was produced. However, with more state of the art models, such as flowtron, there can be more control of the prosody of the voice, e.g how monotonous it should sound. However, this is also restricted by the voice of the podcast speakers. Another thing to think about is the danger of being able to re-produce anyones voice, but we will not discuss it here!

Final Thoughts

In conclusion, a natural and spontaneous sounding voice were trained and synthesized from podcast speech, which can be heard here. And together with that a re-usable framework for the data pipeline was built, where the data and synthesis model can easily be replaced. The pre-processing could be more automatic in the selection, and more dynamic with the filtering. However, it be easily improved upon by programming new independent guild operations to the pipeline. See the repo for more info, and to read the report and listen to the results.

And lastly, a big thanks to my project group colleagues Javier García San Vicente, Johannes Benedikt Wichtlhuber, Mikolaj Bochenski. And our Supervisor: Éva Székely!

--

--

Adam Jacobs

A Machine Learning Student at KTH. Learning new tech and business by writing about it!