Help me understand Voice Recognition tech
I am interested in getting an app that would allow me to make notes via voice-to-text. I work in a field with HIPAA protections. I’m having trouble figuring out the nuances of privacy related to these apps.
First off, is this kind of software considered “AI”? How does it even recognize that a sound equals a word? Do they use LLM tech? Does the tech learn to recognize my voice better over time? Does it use my recordings to learn to understand other’s voices? Is this all a black box? How can I take precautions such that no one except me hears the things I transcribe?
This is just such confusing tech! It seems like it’s fairly old and common but the more I think about it in relation to current age AI, the more creeped out I get! And yet my doctor uses one regularly… I’ll be asking her about it too, don’t worry.
Thank you!
There is a lot to unpack in your post and this will be very long, sorry about that:
First off, what you are requesting is called “Automated Speech Recognition” or ASR in short and the fundamental idea behind it to receive a speech signal and convert it into a workable format. Usually this workable format means text or prompted tasks. Whether this is AI or not depends largely on how broad you define AI. I wouldn’t classify it as AI as, in its core, it’s just statistical analysis. But AI can help fixing errors, more on that later.
ASR works on a Hidden Markov Model (HMM), a statistical model that is only dependent on the state attained in a previous event, so it’s recognizing previously observed patterns. These patterns are taught to the model by a training process.
The generalized process works like this:
-
cut the audio signal into small frames and analyze them according to a set of features like tonality, voicedness, formants. This process is called feature extraction. Create data vectors that contain information about the features of the raw signal.
-
load these features into a decoder. The decoder is an acoustic model that looks up the phonems it recognized through the features in a dictionary and computes the most likely word in its dictionary. These results are retained and sequences of words are compared to the decoders language model. What it recognizes and how well it recognizes signals is based on its own dictionary and the language model used afterwards.
Language models are essentially just presets that dictate what is accepted as a valid signal input. For an activation phrase, this would be a very simple grammar-based model that recognizes only the exact predefined token for the activation and rejects everything else. For general use, you can write a more adaptive grammar, or many different grammars at the same time, but you will still run into cases where the model rejects an input because it cannot find a grammar that matches the signal. This is called out-of-grammar (OOG) speech.
To reduce OOG errors, you can train a statistical language model (SLM) which is basically just a huge library of natural language data so it doesn’t rely on fixed grammars. An large language model (LLM) is like a very advanced SLM with a ridiculous amount of training data and trained, contextual connections between subjects. It’s called large because it requires an insane amount of data to function on even a very basic level. You can easily mix grammar-based and SLM approaches, so that you only need to use the SLM when an input is not recognized.
Source: Writing programs that recognize speech inputs and do tasks based upon them, like what your doctor probably has, was my last job until I quit. Whether we used a grammar-based approach or an SLM approach was entirely up to the specific use case. Purely grammar-based is more privacy-friendly because the computational work required is easily managed by most smartphones or other small portable devices and can easily be done offline. SLM solutions were generally not portable to handheld devices without relying on a cloud service doing the recognition (or at least not if you wanted an acceptable speed of input processing).
tl;dr If you want just plain text-to-speech where the program just writes down what it thinks you said and does not do any error correction, then you can do that offline (the language model my workplace used was from Dragon). If you want your assistant to “understand” what you were trying to say, you will require AI of some form and they are not very privacy-friendly.
That’s fascinating! Really cool explanation, thank you.
It sounds like Dragon has gotten a couple shoutouts. haven’t heard of them before. I wouldn’t mind starting with some plain offline TTS program. I suppose Samsung already has that feature built-into the phone, but that leads us back to the privacy concerns.