How to add speaker tags to audio files for transcription

Guillaume Goyon, 16 June 2020

SHARE :

This blog post will go through the process of diarization, which is the task of adding speaker tags to an audio file for a transcription. It will quickly describe techniques to work with speaker vectors and an easy way to perform it using our tool.

What is diarization (adding speaker tags)?

Adding speaker tags to a transcription or answering the question “who spoke when?” is a task named diarization.

This task is not as easy as it seems, because algorithms do not nearly have the same level of understanding of sound that we have. It involves finding the number of speakers and when they spoke using the sound wave signal.

Also, it is a necessary step in Automatic Speech Recognition systems, as it lets us organize the text transcription and have additional information about the audio.

At Amberscript, we analysed different approaches and integrated the best one in our product. In this post, you will find some elements on what the existing techniques are, followed by a short guide on how to add speaker tags using our tool.

Why is diarization a complicated task?

diarization pipeline diagram

Adding speaker tags is not easy, because it involves a lot of steps. Let’s quickly describe the usual pipeline.

First, you have to split the audio in segments of speech. That means removing the parts without speech and splitting the segments of audio at speaker turns, so you end up with segments involving one speaker only.

After splitting, you must find a way to regroup segments that belongs to the same speaker under the same speaker ta. This very task is itself split into several steps.

You must extract a speaker vector for the segments and then cluster the speaker vectors to finally regroup the vectors in the same cluster under the same speaker tag. The difficulty of this task is the origin of the diarization challenge called DIHARD.

Now, on to the extraction of the said speaker vectors.

Automatic generation of speaker vectors

Usually, making the activity segments is not the most complicated part. This is called Speech Activity Detection (SAD) or Voice Activity Detection (VAD). It is usually done by using some threshold on the activity at a given moment on the audio.

What is harder is the task to make speaker vectors out of the obtained segments. For this, you can check different techniques to extract the speaker vector (called speaker embedding) in the table below:

NameInternal tool
i-vectorsStatistical models
x-vectorsTime delay neural networks
d-vectorsRecurrent neural networks
ClusterGANGenerative adversarial networks

The complete list would be much longer, but we can limit it to these techniques that are the most common.

I-vector is based on Hidden Markov Chains on Gaussian Mixture Models: two statistical models to estimate speaker change and determine speaker vectors based on a set of known speakers. It is a legacy technique that can still be used.

X-vector and d-vectors systems are based on neural networks trained to recognise a set of speakers. These systems are better in terms of performance, but require more training data and setup. Their features are used as speaker vectors.

ClusterGAN takes this a step further and tries to transform an existing speaker vector into another one that contains better information by using 3 neural networks competing against each other.

When this step is done, we end up with speaker vectors for each segment.

Clustering of the speaker vectors

After getting those speaker vectors, you need to cluster them. This means grouping together speaker vectors that are similar, hence likely to belong to the same speaker.

The issue on this step is that you may not necessarily know the number of speakers for a given file (or set of files), so you are not sure how many clusters you want to obtain. An algorithm can try to guess it, but may get it wrong.

Again, several algorithms exist and may be used to perform this task, so the most common ones are included in the table below:

NameInternal tool
K-meansIterative clustering
PLDAStatistical models
UIS-RNN
Recurrent neural network

PLDA refers to a scoring technique used in another algorithm. K-means is usually the standard way to go for clustering, but you have to define a distance between two speaker vectors and PLDA is actually more suitable for this case.

UIS-RNN is a recent technique that allows online decoding, adding new speakers as they appear and is very promising.

After the clustering step, you can add the speaker tags to the segments that belong to the same cluster, so you end up with tags for each segment.

What is left after diarization for a full transcription?

When diarization is done, you still need to actually transcribe the file (which means getting the text out of the audio), but the technology behind this merits another post!

The output of the transcription will then be a full transcription with the words of the audio file, plus the speakers associated to each part of the text.

How to add speaker tags using the Amberscript tool

Now onto the real part, how can you add said speaker tags without having to perform all the technical steps above?

You can simply head to our website and log in. When this is done, you will be able to upload a file and select the number of speakers (for better accuracy) and then let the algorithm run!

You do not have to worry about which technique to choose. After a few minutes, your file will be fully transcribed, and you can check in the editor if the speaker tags have been added correctly.

You can even correct mistakes if you can find any, and then download your transcript ready for publication.

Conclusion

To conclude, let’s say there are a lot of diarization techniques available and this process is really complicated, but we built a tool using the best available technique to let you add speaker tags to your audio files so you can get the best transcription.