Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SeamlessM4T, a Multimodal AI Model for Speech and Text Translation (fb.com)
167 points by mchiang on Aug 22, 2023 | hide | past | favorite | 35 comments


I gave it a spin a little bit ago. Per usual, install docs didn't quite work OOTB, here's how I got it working: https://llm-tracker.info/books/howto-guides/page/speech-to-t...

One limitation that seems undocumented, the current code only supports relatively short clips so isn't suitable for long transcriptions:

> ValueError: The input sequence length must be less than or equal to the maximum sequence length (4096), but is 99945 instead.


Seems like you could easily do a little bash/python script to chop up the recording and batch process each, then stitch the results together?


Probably, although you could more easily use WhisperX and get the same results twice as fast and without any additional scripting.


Does WhisperX do translation? The repo suggests it's only for TTS, timestamps and diarization.


Yes, just use the "translate" instead of the "transcribe" option.



There is also a Hugging Face Space for some quick tests without downloading the model:

https://huggingface.co/spaces/facebook/seamless_m4t


Will there be a whispercpp equivalent? Half the reason I love whisper is how dead simple it is to get running. I will take somewhat lower accuracy for easier operation.

Edit: unless there is native speaker diarization. That would be a huge value add.


They have a smaller model[1] that might be amenable to Whisper-ization.

Much smaller language matrix though.

[1]: https://github.com/facebookresearch/seamless_communication/b...


I'm curious about this too. Lately I've been building an open source tool to help bring make pulling + running models easier locally – https://github.com/jmorganca/ollama – right now we work with the awesome llama.cpp project, however, other model types have definitely come up. LLMs are a small section of what's available on huggingface for example.

It's especially interesting how you could combine different model types - e.g. translation + text completion (or image generation) – it could be a pretty powerful combination...


All I want is llama-2-34b (seriously what's taking so long on this specific model) but this is interesting too I guess.


Yet somehow, many here underestimated Meta’s position in AI and proclaimed that Meta was dying and was not important and far behind in the AI race.

How things change dramatically in one year with such exaggeration of Meta’s collapse in 2022.

Not only they are in the lead in $0 free AI models, they are also at the finish line in the AI race to zero.


Lol, they botched the first example - that it translates “Our goal is to create a more connected world” to Vietnamese: It has a glancing typo at the end of the sentence “hơn” instead of “hơ.” Also it really messed up the pronounciation: It read “Chúng tôi” as “Chúng ta” - they are totally different words phonetically. The pronunciation also sounds like it’s made by someone who is mentally sick. So they botched in both translation and pronunciation.

That’s so embarrassing - especially for something to show how good their stuff is (although I think it’s probably not the ai’s fault) - just shows how sloppy their people are.

I know they have plenty of Vietnamese engineers there. Did the PR dept just throw this final version of the video out without reviewing with them?


SeamlessM4T-Medium { 1.2B params, filesize 6.8 GB }. Wondering how it compares to OpenAi's Whisper.


Go to the blog and skip to results: https://ai.meta.com/blog/seamless-m4t/


For these tasks and languages, SeamlessM4T achieves state-of-the-art results for nearly 100 languages and multitask support across automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation—all in a single model. We also significantly improve performance for low and mid-resource languages supported and maintain strong performance on high-resource languages.

To more accurately evaluate the system without depending on text-based metrics, we extended our text-less metric into BLASER 2.0, which now enables evaluation across speech and text units with similar accuracy compared to its predecessor. When tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks (average improvements of 37% and 48%, respectively) compared to the current state-of-the-art model.

SeamlessM4T also outperforms previous state-of-the-art competitors.


281M and 235M param models too.

https://github.com/facebookresearch/seamless_communication/b...

I don't really know how the metrics they list compare to whisper, I'm very curious if these are fast enough for realtime speech2text? I think whisper technically could but it was difficult to do or something like that?


The speech recognition in their demo is very very bad (~60% in my empirical test, vs. 95% with WhisperCPP). The translation is also very inaccurate.

That said, I fully support open releases and look forward to future versions and improvements.


Disappointing license. Here's a useful thing, but be sure to not use it for the majority of use cases


Meta is killing it with this open models. Not sure why Tamil Language is missing on Output.


Non-commercial as per frickin usual


What's the license


CC BY-NC 4.0


Importantly, non-commercial. Almost all of Facebooks stuff used to be Apache, this new stance is really shitty of them and I hope limits adoption. Deigning to allow others to play with models (and make improvements, give feedback, build an ecosystem) that only you can profit from is not good community behavior. I'd rather see them make it freemium or paid if that's their goal, this is the equivalent of a kid licking a cookie so the others can't eat it.


> Almost all of Facebooks stuff used to be Apache, this new stance is really shitty of them and I hope limits adoption

The AI research environment has changed from the earlier default-open publication - unlike it's competitors, FAIR is still releasing model weights instead of serving the models behind an API.

> this is the equivalent of a kid licking a cookie so the others can't eat it.

More like the other kid baking a cookie with the words "Free Cookie" on it so others can eat it if they are hungry, but can't sell it for money. It'd be foolish for FAIR to donate preconfigured homing-missiles to OpenAI and others via one-way tech transfer.


  It'd be foolish for FAIR to donate preconfigured homing-missiles to OpenAI and others via one-way tech transfer.
No, they could GPL it, and I don't think they're worried about competition taking the models anyway, there's nothing particularly special about the weights or training data, just the compute. I think part of it is pressure from AI "safety" hangers-on who pretend that AI is dangerous so only those who don't want to abide by license terms should have unfettered access. The other commercial reasons are harder to understand. With pytorch they became the standard that everyone builds off of, they could do that with their recent AI, particularly LLaMA but they chose this silly route.

Also, LLaMA has a more permissive license than this translation one, and is a more powerful model, so I don't really see the "homing missiles to open AI" angle.


> [...] I don't think they're worried about competition taking the models anyway, there's nothing particularly special about the weights or training data, just the compute

If that is the case, then what do you suppose is the reason most researcher outfits stopped releasing model weights, or offer more restrictions when they do?

Using the GPL won't prevent the larger AI competitors from using your model outputs from tuning their non-public models to consistently beat yours, but a non-commercial clause does.

> Also, LLaMA has a more permissive license than this translation one, and is a more powerful model, so I don't really see the "homing missiles to open AI" angle.

LLaMa lags ChatGPT 4, but SeamlessM4T is ahead of WhisperX in some ways


True, LLaMA2 is more like "donating homing missiles to everyone except OpenAI, Google, and Apple."


I largely ignore the licenses on this stuff and use it for commercial purposes. If they want to sue me, let them. In fact, I dare them.

Their models are not copyrightable and I would not waste this opportunity to establish this fact.


I was trying to figure out what does it mean and this is the summary from Bard so take it with a grain of salt.

The CC BY-NC 4.0 license allows for the following uses of the licensed material:

* Reproduction: You can copy and distribute the licensed material in any medium or format.

* Distribution: You can distribute the licensed material to others.

* Public performance: You can perform the licensed material publicly.

* Public display: You can display the licensed material publicly.

* Modification: You can remix, transform, and build upon the licensed material.

* Derivative works: You can create derivative works based on the licensed material.

However, there are some restrictions on how you can use the licensed material under the CC BY-NC 4.0 license:

* Commercial use: You cannot use the licensed material for commercial purposes.

* Sublicensing: You cannot sublicense the licensed material.

* Moral rights: The licensor retains all moral rights in the licensed material.

Here are some examples of how the CC BY-NC 4.0 license can be used:

* A teacher can use a CC BY-NC 4.0 licensed image in a presentation for their class.

* A student can create a CC BY-NC 4.0 licensed remix of a song.

* A software developer can use a CC BY-NC 4.0 licensed library in their open source project.

* A photographer can share their photos on a CC BY-NC 4.0 licensed website.


Please do not post output from LLMs here. It is against the rules and we have plenty of knowledgeable people to answer questions. We all have access to these chat bots if we want their answer.


To be fair to the op, prefacing as an LLM output was the right choice, and although I do have access to these same tools, it was nice to get a description and know possibly that it’s a little off the mark, but I’m further ahead than before.

To be fair to you, i agree. HN is probably the last place you want to use low-effort AI comments, regardless of how helpful they may be. Let’s leave the AI comments for Reddit.


Where is this against the rules?



....'M4T', ahem, might mean slightly more than you think it does




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: