I know this is kind of offtopic, but y'all _genuinely_ need to up your UX game. Lots of popups, strange blurred text in the background that makes me think the interface itself is a popup, a youtube URL entered into the text box that.. doesn't do anything (it looks like it should), strangely contrasting and miniature text.
Really not trying to be a jerk -- I think this is a neat project.
Congrats on the launch! The "answer format" parameter is a nice idea for advanced use cases.
If someone wants to compare, Universal Summarizer [1] can do real time summarization of audio/speech, with unlimited input token length and is free (for Kagi members or can be tried with a trial account). Just point it to the URL of the podcast/audio/speech file.
Its the seamless integration that counts. With the playground for example you can have it read and summarize a near 3 hour video https://www.youtube.com/watch?v=Se91Pn3xxSs
with a few clicks and decide if you really want to watch
Do you use Whisper for the transcript (which version? base?) and GPT-3.5-turbo for the language model? Do you provide a self-hosted solution for the companies that don't want their meetings going "on the cloud"? I do not mean to be dismissive of all your work, I know too well the devil is in the details, but what are the key advantages of using your solution over having a Python dev (or GPT-4) write a similar tool using Langchain + whisper + llama2 for example? Again, please do not take this as a cheap shot, I might not be the target audience but if I were to use such a tool I would like everything to run locally because of privacy/corporate spying concerns. Thanks!
EDIT: Also it is unclear if you support other languages than English. Whisper does, so in theory you should. There are companies out there where English is not the work language.
Their ASR model is Conformer trained on 1.1M hours, so the result should be better than Whisper.
From their pricing page, with ~ length of a meeting, input size 15000 tokens (60 minutes audio file), output size 2000 tokens (1500 words), LeMUR default, the price estimate is $0.353, which is I think a fairly good price.
This tool can save a lot of time for a secretary, even replace them. But I think sending your meeting data is still quite risky.
Not surprising though as at this level all these options are starting to be leveled by inconsistencies in manual groundtruth. Conformer alone also isn’t the most powerful architecture out there for speech. This is also slower than, say running a large k2 zipformer via onnx on cpu.
Also if you have a small shop at this point you can do all of this yourself with whisper large v2 on a single 16gb gpu via some tweaking of https://github.com/guillaumekln/faster-whisper and an OSS LLM.
Interesting stuff but I think margins in this space are getting ready to simply vanish.
I'd recommend just trying the Colab in my comment above to test out how quick you can do what you want with LeMUR versus building your own. Piping in 100 hours of audio into an LLM can be a lot of work compared to an API call, but it'll depend on what you are building
Nice! Could also use 2 API calls to openAI.
→ browser record audio
→ send audio to openAI for audio to text transcription
→ send transcribed text for completion
→ display results
→ https://attention1.gitlab.io/ai-interface (open code)
I know this is discouraged to complain about name clashes, but Lemur was also a serious brand in the music production space for over 20 years. First they developed hardware, then transitioned to apps, but now they are no more. For me, when you say "audio" and "lemur" it's the first thing I think of.
Bit of a misleading name. Between the local nature of Llama, Alpaca and Orca, one might expect LeMUR to be something you can download for yourself too. But nope, this is closer to the OpenAI "Pay As You Go" model a-la GPT3/4.
Really not trying to be a jerk -- I think this is a neat project.