Help me understand Voice Recognition tech

posted 5 months ago

Help me understand Voice Recognition tech

I am interested in getting an app that would allow me to make notes via voice-to-text. I work in a field with HIPAA protections. I’m having trouble figuring out the nuances of privacy related to these apps.

First off, is this kind of software considered “AI”? How does it even recognize that a sound equals a word? Do they use LLM tech? Does the tech learn to recognize my voice better over time? Does it use my recordings to learn to understand other’s voices? Is this all a black box? How can I take precautions such that no one except me hears the things I transcribe?

This is just such confusing tech! It seems like it’s fairly old and common but the more I think about it in relation to current age AI, the more creeped out I get! And yet my doctor uses one regularly… I’ll be asking her about it too, don’t worry.

Thank you!

Sort:

Hot Top Controversial New Old

[ - ]

Diabolo96@lemmy.dbzer0.com

15 points

5 months ago

It’s AI and your voice won’t be used for training if you use a local model.

Use Whisper stt. It run on your computer so nothing will be out. You can adapt the model size based on how powerful your computer is. The bigger the model the better at transcribing it will be.

permalink

report

[ - ]

TimewornTraveler@lemm.eeOP

0 points

5 months ago

That sounds interesting. I was hoping for something that I could use on a mobile app. I’m not sure what “adapting the model size” means so this might be more complicated than I’m looking for.

permalink

report

parent

[ - ]

Diabolo96@lemmy.dbzer0.com

2 points

5 months ago

I was hoping for something that I could use on a mobile app.

Record then transcribe later ? But you can try https://whisper.ggerganov.com ( this runs on your browser but nothing is sent. So works even on your Android/IOS phone.) the website owner is a trusted dev that made whisper.cpp and llama.ccp, the latter basically being the backbone of the entire LLM industry.

I’m not sure what “adapting the model size” means so this might be more complicated than I’m looking for.

A bit of complexity is generally the price to pay for freedom from the constant surveillance and data gathering. Plus, It’s actually super easy. Bigger model means better transcription quality, but the smaller ones are really good already. The base.en is probably all you need anyway.

On pc, you can generally try any app from github. They basically all use the same backend.

I found a few : https://whishper.net/ https://github.com/chidiwilliams/buzz

permalink

report

parent

[ - ]

King_Bob_IV@startrek.website

6 points

5 months ago

I work for a Canadian EMR company and we deal with a couple of options for medical voice software. I know Dragon naturally speaking has a medical offering that likely would meet any regulatory requirements. There are also some subscription based ones that I don’t know if there are US versions of, but If you google the medical options you should be able to find some options.

permalink

report

[ - ]

abhibeckert@lemmy.world

6 points

5 months ago

I work in a field with HIPAA protections.

Definitely need to be careful then.

is this kind of software considered “AI”?

The best voice recognition is based on AI — yes.

Before AI, voice recognition existed but it was generally pretty shit and really struggled with accents, low quality microphones, background noise, people saying things that don’t strictly make sense. E.g. if you say “We’ll burn that bridge when we get to it.” a good AI might replace “burn” with the word “cross”… it will at least have the capability to do that, wether or not it will would depend on your settings - is “accuracy” about what someone said or what someone actually meant? That’s configurable in the best systems.

Does the tech learn to recognize my voice better over time?

Old software did. These days systems work so well that would just add cost with zero benefit. Good speech recognition will understand your speech perfectly as long as your microphone is decent and “learning” wouldn’t help much with that one potential problem area.

Some speech systems do learn in order to recognise/identify people (for example, a voice assistant might use it to figure out who “me” is in a command like “remind me to do get milk when I get to the shops”. And a good transcription service will recognise different people talking in a single recording, and provide an appropriately annotated transcript. That’s about the extent of “recognising” your voice, it doesn’t generally learn from you over time.

Is this all a black box?

Kinda yeah. The researchers paid a huge number of people in third world countries to compare recordings to transcriptions, and make a “correct / incorrect” judgement call. Then fed all of that, and a whole bunch of other things (it’s believed every YouTube video ever uploaded might have been involved…) into a very complex model.

Tweaks are made but it’s just too much data (OpenAI says they used 680,000 hours of audio) to fully get your head around all of it. A bit like trying to understand how the human brain recognises speech — we have a broad idea but don’t really know.

Does it use my recordings to learn to understand other’s voices? How can I take precautions such that no one except me hears the things I transcribe?

Check the privacy statement for the service. They might, for example, send your recordings to be assessed for accuracy by employees/subcontractors. AFAIK (not a lawyer) that would be a breach of HIPAA.

AFAIK some Apple speech recognition features are HIPAA compliant. Look that up to verify it but in general iPhones and Macs Apple have AI speech processing hardware on the device allowing fully local processing… but not all features are done locally and in some cases they may transmit “anonymised” (useless if you speak someone’s name…) speech to employees/contractors to improve the software. That can be disabled in settings.

Amazon and OpenAI do everything in the cloud but have fully HIPAA compliant versions of their services (I assume those are not cheap…)

You could try open source models — I don’t know how good they are in practice.

permalink

report

[ - ]

iopq@lemmy.world

4 points

5 months ago

There are offline solutions that never transmit the data

permalink

report

[ - ]

shortwavesurfer@monero.town

3 points

5 months ago

Futo voice input https://app.futo.org/fdroid/repo

permalink

report

[ - ]

sugar_in_your_tea@sh.itjust.works

2 points

5 months ago

I haven’t tried it, but I really like the “local only” approach they’ve been using in other apps.

I just wish their licenses were more open (even AGPL would be fine). Here’s the source code in case someone is interested.

permalink

report

parent

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

Community stats

18K
Monthly active users
11K
Posts
505K
Comments

Our Rules

Approved Bots

Community stats

Community moderators