2

SeamlessM4T: Multimodal Model for Speech Translation

posted 1 year ago

by

AsAnAILanguageModel@sh.itjust.works

in

machinelearning@kbin.social

0 commentshide report

Meta releases SeamlessM4T, a general multilingual speech/text model claimed to surpass OpenAI’s Whisper. It’s available on github and everything can be used for free in a non-commercial setting.

Model Features:

Automatic speech recognition for ~100 languages.
Speech-to-text translation for ~100 input/output languages.
Speech-to-speech translation for ~100 input languages and 35 output languages.
Text-to-text and text-to-speech translation for nearly 100 languages.

Dataset:

SeamlessAlign: Open multimodal translation dataset with 270,000 hours of speech and text alignments.

Technical Insights:

Utilizes a multilingual and multimodal text embedding space for 200 languages.
Applied a teacher-student approach to extend this embedding space to the speech modality, covering 36 languages.
Mining performed on publicly available repositories resulted in 443,000 hours of speech aligned with texts and 29,000 hours of speech-to-speech alignments.

Toxicity Filter:

The model identifies toxic words from speech inputs/outputs and filters unbalanced toxicity in training data.
The demo detects toxicity in both input and output. If toxicity is only detected in the output, a warning is included and the output is not shown.
Given how impaired llama2-chat has been due to these kind of filters, it’s unclear how useful these models are in a general setting.

Sort:

Hot Top Controversial New Old

No comments yet!

Machine Learning

!machinelearning@kbin.social

Machine learning (ML) is a field devoted to understanding and building methods that let machines “learn” – that is, methods that leverage data to improve computer performance on some set of tasks. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Community stats

1
Monthly active users
27
Posts
10
Comments

Community moderators