How Does Speech to Text Software Work?
Andrei Cucleschin, 11 July 2019
Andrei Cucleschin, 11 July 2019
With the introduction speech to text software like Alexa, Cortana, Siri and Google assistant, voice recognition started to replace typing as a way of interacting with our digital devices.
Recent technological developments in the area of speech recognition not only made our life more convenient and our workflow more productive, but also open opportunities, that were deemed as “miraculous” back in the days.
Speech-to-text software has a wide variety of applications, and the list continues to grow on a yearly basis. Healthcare, improved customer service, qualitative research, journalism – these are just some of the industries, where voice-to-text conversion has already become a major game-changer.
Professionals in various areas need high-quality transcripts to perform their work-related activities. The technology behind the voice recognition advances at a fast pace, making it quicker, cheaper and more convenient than manual transcription.
Although the tech is not there yet to reach human performance, the software’s accuracy can get up to 95%. Transcription services used to be time-consuming and labor-intensive, whereas now human involvement in this process is limited to making small adjustments.
P.s – if that’s too much for you to read at the moment, feel free to skip to the summarizing infographic at the bottom of this page.
The core of an automatic transcription service is the automatic speech recognition system. In brief, such systems are composed by acoustic and linguistic components running on one or several computers.
The acoustic component is responsible of converting the audio in your file into a sequence of acoustic units – super small sound samples. Have you ever seen a waveform of the sound? That’s we call analogue sound or vibrations that you create when you speak – they are converted to digital signals, so that the software can analyze them. Then, mentioned acoustic units are matched to existing “phonemes” – those are the sounds that we use in our language to form meaningful expressions.
Thereafter, the linguistic component is responsible of converting these sequence of acoustic units into words, phrases, and paragraphs. There are many words that sound similar, but mean entirely different things, such as peace and piece.
The linguistic component analyzes all the preceding words and their relationship to estimate the probability which word should be used next. Geeks call these “Hidden Markov Models” – they are widely used in all speech recognition software. That’s how speech recognition engines are able to determine parts of speech and word endings (with varied success).
Example: he listens to a podcast. Even if the sound “s” in the word “listens” is barely pronounced, the linguistic component can still determine that the word should be spelled with “s”, because it was preceded by “he”.
You can easily test it in real life, as any other speech application, Google Translate has language models behind it. You can voice type a random word that has several meanings, and then supply the translator with a context (by putting a word in a sentence) – it is very likely that you’ll see more accurate transcription and translation.
Before you are able to use an automatic transcription service, these components must be trained appropriately to understand a specific language. Both, the acoustic part of your content, that is, how it is being spoken and recorded, and the linguistic part, that is, what is being said, are critical for the resulting accuracy of the transcription.
Here at Amberscript, we are constantly improving our acoustic and linguistic components in order to perfect our speech recognition engine.
There is also something called a “speaker model”. Speech recognition software can be either speaker-dependent or speaker-independent.
Speaker-dependent model is trained for one particular voice, such as speech-to-text solution by Dragon. You can also train Siri, Google and Cortana to only recognize your own voice (in other words, you’re making the voice assistant speaker-dependent).
It usually results in a higher accuracy for your particular use case, but does require time to train the model to understand your voice. Furthermore, the speaker-dependent model is not flexible and can’t be used reliably in many settings, such as conferences.
You’ve probably guessed it – speaker-independent model can recognize many different voices without any training. That’s what we currently use in our software at AmberScript.
No! There are many speech-to-text tools that serve different purposes. Some of them are designed for simple repetitive tasks, some of them are incredibly advanced. Let’s look at the different levels of speech recognition.
1) Did you ever call a company, and then the voice bot asked you to leave your phone number? That’s the simplest speech recognition tool, that works on pattern-matching, it has a limited vocabulary, but it does the job (in this case, understanding digits).
2) The next level of speech recognition involves statistical analysis and modelling (such as Hidden Markov Models) – we’ve already touched upon in one of the previous sections.
3) The ultimate level of speech recognition is based on artificial neural networks – essentially it gives the engine a possibility to learn and self-improve. Google’s, Microsoft’s, as well as our engine is powered by machine learning.
Although the last decade speech recognition technology has been advancing at a very fast pace, speech-to-text software is still posed with a number of challenges. The current limitations of speech-to-text software are:
-Recording conditions. The performance of both, human and automated transcription, is largely dependent on the recording conditions. Voice recognition software still struggles to interpret speech in a noisy environment or when many people speak at the same time.
P.s – check out our post on How to improve your audio quality and optimise the transcription of speech to text to learn some practical tips that will boost the quality of your automatic transcription.
-Recognizing certain dialects and accents. Language is a complex structure, and everyone talks in a slightly different way. A multitude of dialects and accents create additional complexity for the model. However, this complexity can be managed by gathering different kinds of data.
-Understanding homonyms. Homonyms are words that sound the same, but differ in meaning and spelling. For example, fare (as the price of a ticket) and fair (as unprejudiced). Choosing the right option requires understanding of the context. Although modern speech-to-text engines are powered by AI, interpreting the unique context in a right way remains difficult for the machines.
Our engine is estimated to reach up to 95% accuracy – this level of quality was previously unknown to the Dutch market. We would be more than happy to share, where this unmatched performance comes from:
Let’s discuss the next major step forward for the entire industry, that is – Natural Language Understanding (or NLU). It is a branch of Artificial Intelligence, that explores how machines can understand and interpret human language. Natural Language Understanding allows the speech recognition technology to not only transcribe human language, but actually understand the meaning behind it. In other words, adding NLU algorithms is like adding a brain to a speech-to-text converter.
NLU aims to face the toughest challenge of speech recognition – understanding and working with unique context.
-Machine translation. That’s something that is already being used in Skype. You speak in one language, and your voice is automatically transcribed to text in a different language. You can treat as the next level of Google Translate. This alone has enormous potential – just imagine how much easier it becomes to communicate with people who don’t speak your language.
-Document summarization. We live in a world full of data. Perhaps, there is too much information out there. Imagine having an instant summary of an article, essay or an email.
-Content categorization. Similar to a previous point, content can be brought down into distinctive themes or topics. This feature is already implemented in search engines, such as Google and YouTube.
-Sentiment analysis. This technique is aimed at identifying human perceptions and opinions through a systematic analysis of blogs, reviews or even tweets. This practice is already implemented by many firms, particularly those that are active on social media.
Yes, we’re heading there! We don’t know whether we’re gonna end up in a world full of friendly robots or the one from Matrix, but machines can already understand basic human emotions.
-Plagiarism detection. Simple plagiarism tools only check whether a piece of content is a direct copy. Advanced software like Turnitin can already detect whether the same content was paraphrased, making plagiarism detection a lot more accurate.
There are many disciplines, in which NLU (as a subset of Natural Language Processing) already plays a huge role. Here are some examples:
We’re currently integrating NLU algorithms in our systems to make our speech recognition software even smarter and applicable in a wider range of applications.
We hope that now you’re a bit more acquainted with the fascinating field of speech recognition! Feel free to look at our blog for more interesting reads like this!