Wednesday, February 26, 2025

ElevenLabs is launching its personal speech-to-text mannequin

ElevenLabs, an AI startup that simply raised a $180 million mega funding spherical, has been primarily recognized for its audio era prowess. The corporate took a step in one other technological course by launching its first standalone speech-to-text mannequin referred to as Scribe.

The startup, valued at $3.3 billion, has aided many different firms in offering speech-to-text providers by means of its huge library of voices. Nevertheless, the corporate is now trying to get into speech detection and compete with the likes of Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper fashions.

ElevenLabs’ Scribe mannequin helps over 99 languages at launch. The corporate categorizes over 25 languages in glorious accuracy class for the mannequin the place the phrase error charge is lower than 5%. This checklist contains English (claimed accuracy charge of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Different languages are ranked in several classes with excessive (5-10% phrase error charge), good (10 to twenty% phrase error charge), and average (25 to 50%) phrase error charges.

The corporate mentioned that the mannequin outperformed Google Gemini 2.0 Flash and Whisper Massive V3 throughout a number of languages in FLEURS & Frequent Voice benchmark checks.

ElevenLabs had developed the speech-to-text part for its AI conversational agent platform, which was launched final yr. Nevertheless, that is the primary time the corporate is releasing a standalone speech detection mannequin. In a dialog with TechCrunch final month, CEO Mati Staniszewski talked about bettering speech detection fashions.

“We need to perceive what’s being mentioned by you in a dialog higher. We’re engaged on methods to maneuver away from solely producing content material and understanding and transcribing speech,” Staniszewski mentioned at the moment. “Many individuals say that speech-to-text is a solved drawback. However for a lot of languages, it’s fairly dangerous. We expect we will construct higher speech detection fashions as a result of we’ve in-house groups to annotate information and provides us fast suggestions.”

The mannequin additionally has sensible speaker diarization to let you know who’s talking, timestamp at phrase stage for correct subtitles, and auto-tagging sound occasions like viewers laughters. The startup is offering a method for patrons to immediately transcribe video content material so as to add subtitles or captions in its studio.

Scribe at the moment solely works with pre-recorded audio codecs. The corporate mentioned it’ll launch a low-latency real-time model of the mannequin quickly. Meaning it isn’t but efficient for assembly transcriptions or voice note-taking.

ElevenLabs is pricing Scribe at $0.40 for an hour of transcribed audio. Whereas the speed is aggressive, a few of its rivals supply a cheaper price for audio transcriptions in the intervening time with some characteristic differentiation.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles