Will there be a whispercpp equivalent? Half the reason I love whisper is how dead simple it is to get running. I will take somewhat lower accuracy for easier operation.
Edit: unless there is native speaker diarization. That would be a huge value add.
I'm curious about this too. Lately I've been building an open source tool to help bring make pulling + running models easier locally – https://github.com/jmorganca/ollama – right now we work with the awesome llama.cpp project, however, other model types have definitely come up. LLMs are a small section of what's available on huggingface for example.
It's especially interesting how you could combine different model types - e.g. translation + text completion (or image generation) – it could be a pretty powerful combination...
Lol, they botched the first example - that it translates “Our goal is to create a more connected world” to Vietnamese: It has a glancing typo at the end of the sentence “hơn” instead of “hơ.” Also it really messed up the pronounciation: It read “Chúng tôi” as “Chúng ta” - they are totally different words phonetically. The pronunciation also sounds like it’s made by someone who is mentally sick. So they botched in both translation and pronunciation.
That’s so embarrassing - especially for something to show how good their stuff is (although I think it’s probably not the ai’s fault) - just shows how sloppy their people are.
I know they have plenty of Vietnamese engineers there. Did the PR dept just throw this final version of the video out without reviewing with them?
For these tasks and languages, SeamlessM4T achieves state-of-the-art results for nearly 100 languages and multitask support across automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation—all in a single model. We also significantly improve performance for low and mid-resource languages supported and maintain strong performance on high-resource languages.
To more accurately evaluate the system without depending on text-based metrics, we extended our text-less metric into BLASER 2.0, which now enables evaluation across speech and text units with similar accuracy compared to its predecessor. When tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks (average improvements of 37% and 48%, respectively) compared to the current state-of-the-art model.
SeamlessM4T also outperforms previous state-of-the-art competitors.
I don't really know how the metrics they list compare to whisper, I'm very curious if these are fast enough for realtime speech2text? I think whisper technically could but it was difficult to do or something like that?
Importantly, non-commercial. Almost all of Facebooks stuff used to be Apache, this new stance is really shitty of them and I hope limits adoption. Deigning to allow others to play with models (and make improvements, give feedback, build an ecosystem) that only you can profit from is not good community behavior. I'd rather see them make it freemium or paid if that's their goal, this is the equivalent of a kid licking a cookie so the others can't eat it.
> Almost all of Facebooks stuff used to be Apache, this new stance is really shitty of them and I hope limits adoption
The AI research environment has changed from the earlier default-open publication - unlike it's competitors, FAIR is still releasing model weights instead of serving the models behind an API.
> this is the equivalent of a kid licking a cookie so the others can't eat it.
More like the other kid baking a cookie with the words "Free
Cookie" on it so others can eat it if they are hungry, but can't sell it for money. It'd be foolish for FAIR to donate preconfigured homing-missiles to OpenAI and others via one-way tech transfer.
It'd be foolish for FAIR to donate preconfigured homing-missiles to OpenAI and others via one-way tech transfer.
No, they could GPL it, and I don't think they're worried about competition taking the models anyway, there's nothing particularly special about the weights or training data, just the compute. I think part of it is pressure from AI "safety" hangers-on who pretend that AI is dangerous so only those who don't want to abide by license terms should have unfettered access. The other commercial reasons are harder to understand. With pytorch they became the standard that everyone builds off of, they could do that with their recent AI, particularly LLaMA but they chose this silly route.
Also, LLaMA has a more permissive license than this translation one, and is a more powerful model, so I don't really see the "homing missiles to open AI" angle.
> [...] I don't think they're worried about competition taking the models anyway, there's nothing particularly special about the weights or training data, just the compute
If that is the case, then what do you suppose is the reason most researcher outfits stopped releasing model weights, or offer more restrictions when they do?
Using the GPL won't prevent the larger AI competitors from using your model outputs from tuning their non-public models to consistently beat yours, but a non-commercial clause does.
> Also, LLaMA has a more permissive license than this translation one, and is a more powerful model, so I don't really see the "homing missiles to open AI" angle.
LLaMa lags ChatGPT 4, but SeamlessM4T is ahead of WhisperX in some ways
Please do not post output from LLMs here. It is against the rules and we have plenty of knowledgeable people to answer questions. We all have access to these chat bots if we want their answer.
To be fair to the op, prefacing as an LLM output was the right choice, and although I do have access to these same tools, it was nice to get a description and know possibly that it’s a little off the mark, but I’m further ahead than before.
To be fair to you, i agree. HN is probably the last place you want to use low-effort AI comments, regardless of how helpful they may be. Let’s leave the AI comments for Reddit.
One limitation that seems undocumented, the current code only supports relatively short clips so isn't suitable for long transcriptions:
> ValueError: The input sequence length must be less than or equal to the maximum sequence length (4096), but is 99945 instead.