Voice & Speech AI Companies
Voice and speech AI companies build systems for automatic speech recognition (ASR), text-to-speech (TTS), voice cloning, and real-time translation. Advances in neural codec language models have made synthetic voices nearly indistinguishable from human speech, enabling new applications in accessibility, entertainment, and enterprise communications.
Spotify AI
Stockholm, Sweden
and publicly available knowledge: Spotify leverages advanced machine learning models – including collaborative filtering and natural language processing – to power personalized music and podcast recommendations through features like Discover Weekly and Release Radar. Their AI capabilities extend to audio analysis for features such as the DJ automated DJ experience and podcast transcription, as well as content moderation systems designed to ensure platform safety. With over 574 million monthly active users globally, Spotify’s AI-driven personalization is a key differentiator in the competitive streaming market and contributes significantly to user engagement and retention.
Uniphore
Chennai, India
Uniphore provides an enterprise-grade Business AI Cloud platform focused on bridging the gap between consumer and business AI applications. Their core technology centers on a composable and secure AI architecture encompassing data, knowledge, models, and agents, with a strong emphasis on speech analytics and conversational AI. Uniphore targets large enterprises seeking to deploy and manage AI solutions across their operations with a focus on data sovereignty and control.
Verbit
New York, United States
Verbit is a US-based AI company specializing in highly accurate transcription and captioning services. Their core technology, the Captivate™ ASR engine and enhanced by Gen.V™ generative AI, delivers rapid, customizable transcripts with automated summarization and keyword extraction. Verbit primarily serves speech-intensive industries like legal and education, offering solutions to improve accessibility, enhance productivity, and derive actionable insights from audio and video content.
Epidemic Sound
Stockholm, Sweden
Epidemic Sound is a Swedish provider of royalty-free music and sound effects for content creators. Their platform utilizes AI-powered search and recommendation algorithms to facilitate efficient content matching and discovery within a vast library of audio assets. Targeting video creators, marketers, and podcasters, Epidemic Sound offers a subscription-based licensing model providing unrestricted usage rights for their audio content globally.
Dialpad
San Francisco, United States
Dialpad is a US-based provider of an all-in-one cloud communications platform integrating voice, video, messaging, and a contact center solution. Their core technology leverages real-time Voice AI to provide features like automated call transcription, agent coaching, and autonomous workflow execution for tasks like appointment scheduling and refunds. Dialpad targets businesses seeking to improve contact center performance and streamline communications across multiple channels, with a focus on security and integration with existing CRM and collaboration tools.
Suno
Cambridge, United States
Suno is the leading AI music generation platform with 100M+ users. Generates $200M annual revenue. Raised $250M at $2.45B valuation backed by NVIDIA.
Mobvoi
Beijing, China
Mobvoi is a Chinese technology company specializing in voice AI and intelligent wearables. Their core technology centers around a proprietary Chinese Natural Language Processing (NLP) engine powering voice assistants and features across their product line, most notably the TicWatch series of smartwatches. Mobvoi primarily targets the Chinese market with localized AI experiences, while also offering select wearables internationally with a focus on health and fitness tracking.
Observe.AI
San Francisco, United States
Observe.ai provides AI Agents for enterprise contact centers, automating and improving customer interactions across voice channels. Their technology utilizes advanced speech recognition and natural language processing to accurately understand complex, real-world conversations – even with background noise and interruptions – and integrate with existing CRM and workflow systems. This enables businesses to automate call resolution, improve agent performance through AI-powered quality assurance, and achieve predictable outcomes in customer service operations.
Loom
San Francisco, United States
Loom is a video messaging platform that enables asynchronous communication through quick screen and camera recordings. Utilizing automatic speech recognition (ASR) technology, Loom provides searchable video transcripts and captions for improved accessibility and information retrieval. Primarily targeting professionals and teams, Loom streamlines communication and documentation workflows, offering a more efficient alternative to traditional email and meetings.
AISpeech
Suzhou, China
AISpeech is a leading specialized large-model conversational AI platform company in China, enabling intelligent connectivity and streamlined operations.
Suki AI
Redwood City, United States
Suki AI develops an ambient clinical intelligence platform that utilizes voice AI and natural language processing to automate clinical documentation workflows. Their technology captures and analyzes patient-physician conversations to generate comprehensive notes, orders, and instructions directly within existing Electronic Health Record (EHR) systems. Suki AI targets healthcare providers and organizations seeking to reduce administrative burden, improve physician burnout, and enhance revenue cycle management through streamlined documentation processes.
Cogito
Boston, United States
Cogito, now part of Verint, delivers real-time AI-powered coaching and performance analytics for contact centers. Their core technology utilizes proprietary AI models to analyze voice conversations, providing both customer experience (CX) and employee experience (EX) scoring during live calls. This enables targeted, in-the-moment guidance for agents, with a focus on improving key metrics like average handle time, customer satisfaction, and revenue generation for large enterprises in sectors like telecommunications and healthcare.
Suno
Cambridge, United States
Suno is a US-based generative AI company specializing in the creation of original music from text-based prompts. Their core technology utilizes AI models to compose full songs, including lyrics and instrumentation, allowing users to rapidly prototype and produce musical content. Suno targets a broad market including musicians, content creators, and hobbyists seeking accessible tools for music production and exploration, offering a platform for both creation and discovery.
Poly AI
London, United Kingdom
Poly AI develops conversational AI solutions for enterprise contact centers, enabling fully autonomous handling of customer voice calls. Their core technology focuses on delivering highly natural, multilingual voice interactions that replicate human agent conversations, distinguishing them through a customer-led approach to AI training. Poly AI targets businesses seeking to scale customer service while maintaining a high-quality, localized brand experience, particularly within the hospitality and service industries.
AssemblyAI
San Francisco, United States
AssemblyAI develops highly accurate speech-to-text APIs, including their flagship LeMUR model, and a suite of audio intelligence features like speaker diarization, entity detection, and topic detection. Their key innovation lies in offering low-latency, high-accuracy transcription optimized for real-time and asynchronous applications, alongside advanced features like content moderation and redaction. Serving a diverse market including contact centers, media companies, and research institutions, AssemblyAI processes millions of minutes of audio data monthly and is recognized for consistently achieving industry-leading Word Error Rates (WER) in independent evaluations.
Otter.ai
Mountain View, United States
Otter.ai develops AI-powered meeting solutions, most notably its Otter Meeting Agent platform, which provides real-time transcription, automated summaries, and AI-driven action item detection. The platform leverages advanced speech recognition and natural language processing to create searchable meeting records and facilitate collaboration, integrating with popular video conferencing tools like Zoom, Google Meet, and Microsoft Teams. Otter.ai currently serves a broad professional market, with reported user testimonials indicating significant time savings – up to 33% according to one VP of Sales at Aiden Technologies – and increased productivity for teams reliant on frequent meetings.
ElevenLabs
New York, United States
ElevenLabs specializes in realistic voice AI, offering a platform for text-to-speech generation and voice cloning powered by proprietary models like their flagship voice agent technology. Their platform provides access to over 5,000 voices in 70+ languages, and recently expanded with the launch of the Iconic Marketplace featuring digitally-recreated voices of prominent figures such as Matthew McConaughey and Sir Michael Caine. ElevenLabs targets content creators, developers, and businesses seeking to integrate high-quality, customizable voice solutions into applications ranging from audiobooks and gaming to virtual assistants and accessibility tools.
Descript
San Francisco, United States
Descript develops a cross-platform audio and video editing platform centered around speech-to-text technology, enabling users to edit media by directly manipulating transcripts. Key innovations include Overdub, a realistic voice synthesis tool allowing users to correct or add to recordings using AI-generated speech, and Studio Sound, which enhances audio clarity with a single click. Targeting podcasters, video creators, and marketing teams, Descript has gained traction for its unique transcript-based workflow and recently launched Underlord, an AI-powered video editor capable of generating and editing video content from text prompts.
Chorus.ai
San Francisco, United States
Chorus.ai, now integrated within ZoomInfo, delivers conversation intelligence software that analyzes sales calls and meetings. Their platform utilizes AI-powered speech and text analytics to identify key conversation patterns, coaching opportunities, and deal-critical insights. This technology primarily serves revenue-focused teams within B2B organizations to improve sales performance and forecasting accuracy.
Ambience Healthcare
San Francisco, United States
Ambience Healthcare provides an AI-powered platform that automates clinical documentation and coding for U.S. healthcare systems. Utilizing natural language processing and speech recognition, the platform generates structured data from patient encounters, reducing administrative burden on clinicians. Ambience targets health systems seeking to improve revenue cycle management, ensure compliance, and allow physicians to focus on patient care rather than documentation.
Parloa
Berlin, Germany
Parloa delivers a generative AI-powered platform for contact center automation, enabling enterprises to deploy and manage personalized “AI agents” that handle high-volume customer interactions. Their technology orchestrates the full AI agent lifecycle – from development to deployment and optimization – focusing on complex tasks like scheduling, refunds, and personalized recommendations. Parloa targets large enterprises seeking to improve customer loyalty and efficiency, and their platform is designed for high-stakes environments requiring precision and scalability in customer communication.
Deepgram
San Francisco, United States
Deepgram is a US-based provider of voice AI APIs for enterprise applications, offering unified speech-to-text, text-to-speech, and LLM orchestration. Their platform distinguishes itself through a single API designed to minimize complexity, latency, and cost compared to component-based solutions, and supports both real-time and batch processing with telephony integrations. Deepgram targets developers and businesses requiring highly accurate and scalable voice intelligence for applications like contact centers, voice assistants, and conversational AI systems.
Speechmatics
Cambridge, United Kingdom
Speechmatics is a UK-based technology company specializing in accurate, low-latency Automatic Speech Recognition (ASR) and speech-to-text solutions. Their core offering is a Speech API providing transcription, real-time translation, and text-to-speech capabilities, deployable on-device, on-premise, or in the cloud. Speechmatics targets enterprises requiring high-quality voice AI with a focus on data privacy, offering a non-logging standard deployment option.
Corti
Copenhagen, Denmark
Corti is a Danish AI infrastructure provider specializing in healthcare applications. Their core product is a highly accurate medical Automatic Speech Recognition (ASR) API that converts clinical conversations into structured data and documentation. Corti targets healthcare developers and providers seeking to rapidly build and deploy voice-enabled workflows – such as automated note-taking, report generation, and point-of-care support – without managing complex AI infrastructure.
Papercup
London, United Kingdom
Papercup provides AI-powered dubbing and voice-over solutions for video content, utilizing a patented technology stack trained on extensive licensed voice data. Their platform combines synthetic voices with human editorial post-editing to deliver natural-sounding, culturally nuanced audio localization. Papercup targets enterprise-level content creators and media companies seeking scalable and cost-effective methods to expand global reach without sacrificing audience engagement.
Infinitus Systems
San Francisco, United States
Infinitus Systems develops a voice AI platform that automates administrative and clinical phone calls for U.S. healthcare providers and payers. Their technology specifically addresses time-consuming tasks like prior authorization and routine patient communication, utilizing AI agents to handle calls without human intervention. This solution aims to reduce administrative burden, improve staff productivity, and ultimately enhance patient outcomes within the healthcare system.
Sanas
Palo Alto, United States
Sanas provides a real-time Speech AI platform specializing in accent and language translation for improved communication clarity. Their core technology modulates speech to neutralize accents and remove noise while preserving vocal characteristics, enabling natural-sounding conversations in over 25 languages. Sanas targets call centers and communication-heavy businesses seeking to enhance customer and employee experiences, reduce communication friction, and improve key performance indicators like CSAT and AHT.
Hume AI
New York, United States
Hume AI builds empathic AI that understands and responds to human emotional expressions. Provides APIs for emotion recognition in voice, face, and language.
aiOla
Tel Aviv, Israel
aiOla transforms frontline speech into structured, validated data for enterprise systems. Voice-agentic workflows replace manual data entry.
Modulate
Cambridge, United States
Modulate is a US-based AI platform that analyzes live and recorded voice conversations to deliver real-time insights into content, intent, and emotional state. Their core technology decodes multi-dimensional voice signals – including deception, toxicity, and synthetic speech – to provide actionable alerts and APIs. Modulate targets businesses requiring enhanced fraud prevention, trust & safety measures, and customer experience improvements through proactive voice intelligence, serving sectors like gaming, contact centers, and online communities.
Decagon
San Francisco, United States
Decagon delivers AI-powered virtual agents for enterprise customer support, specializing in voice and chat channels. Their core technology focuses on customizable conversational AI with cross-channel memory, enabling personalized and connected customer interactions. Decagon targets companies seeking to significantly increase customer support deflection rates, scale operations to 24/7 availability, and improve key customer experience metrics like First Response Time and Customer Satisfaction.
Fano Labs
Hong Kong, Hong Kong
Fano Labs specializes in speech recognition and NLP for Asian languages, serving financial services and customer service industries.
PlayHT
San Francisco, United States
PlayHT is a US-based AI company specializing in realistic text-to-speech (TTS) and voice cloning technology delivered via API. Their platform offers over 200 AI voices in 40+ languages, focusing on low-latency synthesis for applications requiring natural-sounding, multi-speaker audio. PlayHT targets content creators and enterprises seeking to automate voiceovers and generate audio content at scale.
Cartesia
San Francisco, United States
Cartesia builds fast, realtime AI models for voice and speech. Their Sonic model enables sub-100ms latency text-to-speech for conversational AI.
Fixie.ai
Seattle, United States
Fixie.ai develops the Ultravox platform, enabling developers to build and deploy AI agents powered by a next-generation, open-source Speech Language Model (SLM). Ultravox focuses on natural speech understanding to facilitate more human-like conversational AI experiences. The company targets businesses seeking to integrate scalable voice AI capabilities into their applications and workflows.
LiveKit
San Francisco, United States
LiveKit is an open-source platform for building realtime audio and video applications. Powers voice AI agents with ultra-low latency infrastructure.
PlayAI
San Francisco, United States
PlayAI develops voice cloning and text-to-speech technology. Their platform creates custom AI voice models from audio samples, enabling natural-sounding speech synthesis for content creators and businesses.
Krisp
San Francisco, United States
Krisp develops AI-powered tools to enhance the quality and productivity of virtual meetings. Their core product is an AI Meeting Assistant that combines industry-leading noise cancellation with automated transcription, summarization, and accent conversion. Krisp targets professionals and teams seeking to improve communication clarity and efficiency in remote and hybrid work environments by automating key meeting tasks.
Bland AI
San Francisco, United States
Bland AI provides enterprises with AI-powered phone agents capable of handling both inbound and outbound calls using natural language processing. Their core technology centers on customizable voice models trained on client-provided recordings and transcriptions, offering a branded conversational experience. Targeting businesses across verticals like finance, healthcare, and logistics, Bland AI differentiates itself through on-premise data security and seamless integration capabilities for automating customer support, sales, and operational communications.
Voiceitt
Tel Aviv, Israel
Voiceitt develops AI-powered speech recognition technology specifically designed to understand non-standard speech patterns, including those resulting from speech impairments, accents, or aging-related conditions. Their core product is a customizable API and software solution leveraging a proprietary database of atypical speech and advanced machine learning. Voiceitt primarily serves individuals with speech disabilities, as well as accessibility applications for accented speakers and those in the Deaf community, enabling greater communication independence and access to voice-controlled technologies.
Rinna
Tokyo, Japan
Rinna is a Japanese AI company specializing in conversational AI and virtual character development. Their core technology centers around creating highly realistic AI personalities capable of natural language interactions, initially demonstrated through integrations with LINE and evolving into AI-powered virtual YouTubers (AITubers). Rinna targets businesses and entertainment sectors seeking to leverage advanced AI for customer engagement, content creation, and immersive digital experiences, with a strong focus on the Japanese market.
Podcastle
Wilmington, United States
Podcastle is a US-based software company offering an all-in-one platform for video and podcast creation directly within a web browser. Their core technology centers on AI-powered tools for audio and video editing, including features for noise reduction, automatic editing, and AI voice generation. Podcastle targets long-form content creators seeking a streamlined, browser-based solution for recording, editing, and distributing professional-quality audio and video content.
Resemble AI
San Francisco, United States
Resemble AI develops a generative AI platform specializing in voice and audio technology, offering products like real-time voice cloning via their Chatterbox model, and audio editing tools like Edit. Their key innovations include DETECT-3B Omni, a multi-modal deepfake detection model consistently ranked among the industry’s most robust, alongside PerTh, an AI-powered watermarking solution for content provenance. Resemble AI serves enterprise and government clients – including Fortune 500 companies – with solutions for content creation, security, and speaker verification, and is trusted by over 3 million teams worldwide.
Vapi
San Francisco, United States
Vapi provides a platform for developers to build and deploy configurable voice AI agents. Their core technology is a comprehensive API enabling advanced conversational AI functionality for phone-based applications. Vapi targets a broad market ranging from startups to large enterprises seeking to automate phone operations and create scalable voice AI products.
Soapbox Labs
Dublin, Ireland
SoapBox Labs develops voice AI specifically designed for children, enabling speech recognition in educational apps with child privacy protection.
Udio
New York, United States
Udio is a US-based generative AI company specializing in music creation. Their platform utilizes text-to-music AI technology, enabling users to generate complete songs from simple text prompts. Udio targets musicians, content creators, and hobbyists seeking rapid prototyping or royalty-free music generation capabilities.
Speechly
Helsinki, Finland
Speechly is a Finnish company specializing in real-time Automatic Speech Recognition (ASR) technology delivered via a streaming API. Their core product is a cloud-based ASR engine optimized for low-latency transcription and understanding, particularly in demanding applications like real-time communication and interactive voice response systems. Speechly targets developers building voice-enabled applications requiring high accuracy and speed, offering a developer-friendly alternative to traditional, batch-oriented speech-to-text solutions.
Murf AI
San Francisco, United States
Murf AI develops a text-to-speech (TTS) platform offering over 200 AI voices across 20+ languages, powering realistic voiceovers for video content, presentations, and marketing materials. Their core technology leverages advanced neural network architectures to generate highly natural-sounding speech, and they provide both a user-friendly AI Voice Generator and robust Text-to-Speech APIs & SDKs for developers. Murf AI serves a broad market including content creators, educators, and businesses seeking scalable voice solutions, and is recognized for its speed and efficiency in building voice agents.
Speechify
Los Angeles, United States
Speechify develops a text-to-speech (TTS) platform leveraging advanced AI voice synthesis to convert digital text – including documents, web pages, and ebooks – into natural-sounding audio. Their core product is a cross-platform application offering both audio reading and voice typing capabilities. With over 55 million users, Speechify primarily targets individuals seeking enhanced accessibility, learning support, and increased productivity through hands-free information consumption.
Recall.ai
San Francisco, United States
Recall.ai develops APIs and SDKs for extracting high-fidelity audio, transcripts, and metadata from video conferencing platforms. Their core technology focuses on isolating individual speaker audio streams within meetings to improve transcript accuracy and recording quality. The company targets developers building applications requiring detailed meeting intelligence, and differentiates itself through superior audio separation compared to standard screen recording solutions.
Amper Music
New York, United States
Amper Music, a Shutterstock company, provides an AI-powered music composition platform that generates original, royalty-free tracks. Utilizing generative algorithms and machine learning, Amper enables content creators – including video producers, advertisers, and game developers – to quickly and affordably produce customized music tailored to specific moods, styles, and lengths. This solution streamlines the music licensing process and offers a cost-effective alternative to traditional music sourcing.
Retell AI
San Francisco, United States
Retell AI provides a platform for businesses to build and deploy AI-powered voice agents for automating phone calls. Their technology leverages real-time knowledge base synchronization and natural language processing to handle customer interactions, including navigating IVR systems, scheduling appointments, and facilitating warm transfers to live agents. Retell AI targets companies seeking to improve call center efficiency and customer service through scalable, automated phone solutions, as demonstrated by deployments with companies like Everise.
Speechki
San Francisco, United States
Speechki is a text-to-speech platform offering 500+ AI voices in 77 languages. Backed by Greycroft and Alchemist, they enable content creators to convert text to natural-sounding audio at scale.
Lelapa AI
Johannesburg, South Africa
Lelapa AI develops Natural Language Processing (NLP) technology specifically for African languages, originating from the Masakhane research community. Their core product, the Vulavula API, provides resource-efficient speech-to-text and transcription services for real-time call processing and analysis. Lelapa AI targets businesses operating in African markets seeking to improve customer experience, ensure compliance, and gain actionable insights from multilingual customer interactions.
Rev.com
Austin, United States
Rev.com provides AI-powered transcription and captioning services, specializing in solutions for the legal industry. Their core offering is a 96%+ accurate AI transcription engine designed for high-volume processing of legal evidence like depositions, police reports, and bodycam footage, supplemented by a network of 14,000+ human transcriptionists for 99%+ accuracy when required. Rev targets law firms and legal professionals by offering tools for evidence review, timeline creation, and secure transcript management directly within their platform.
iFlytek
Hefei, China
and aiming for a professional, informative tone: iFlytek develops advanced AI-powered language solutions, including its core Jieli speech recognition platform and translation tools supporting over 60 languages. The company’s innovations center on deep learning models for accurate speech-to-text, text-to-speech, and machine translation, demonstrated in products like their real-time transcription services for meetings and content creation. As China’s leading provider in this space, iFlytek increasingly focuses on international expansion and serves sectors including education, digital marketing, and professional communication.
Nuance Communications
Burlington, United States
Nuance Communications, now a Microsoft company, develops AI-powered solutions for clinical and administrative healthcare documentation. Their core technology centers on speech recognition and natural language processing applied to create tools like Dragon Medical One, which automates clinical documentation and enhances radiology reporting. Nuance primarily serves healthcare providers and aims to improve clinician productivity, reduce administrative burden, and enhance patient care through AI-driven workflows.
Cosito
Boston, United States
MIT-founded startup building AI-powered microphones that let frontline teams log data by voice—no physical forms, no typing.
Emotech
London, United Kingdom
Emotech develops multimodal AI solutions focused on enhancing customer and user interactions, with key products including a multilingual speech platform and customizable generative AI avatars. Their technology specializes in realistic AI-driven speech synthesis – notably offering Arabic chatbots with dialect support – and a unique AI-powered pronunciation assessment tool for language learning. Emotech targets businesses seeking to improve customer service, create immersive digital experiences, and innovate in areas like education and gaming, demonstrated by claims of a 30% boost in customer satisfaction for early adopters.
Endel
Berlin, Germany
Endel is a German technology company developing AI-powered generative audio environments designed to improve cognitive performance and wellbeing. Their core product utilizes a patented algorithm that creates personalized soundscapes adapting in real-time to user-specific data like time of day, weather, and biometrics. Endel targets individuals seeking to enhance focus, reduce stress, and improve sleep quality through scientifically-backed auditory experiences.
Sonantic
London, United Kingdom
Sonantic develops realistic, emotionally-expressive AI voices for digital media. Their core technology utilizes a proprietary neural network trained on human performance data to generate nuanced vocal performances from text. Acquired by Spotify, Sonantic primarily serves the gaming, animation, and audiobook industries, offering a solution for scalable and high-quality voice acting.
SoundHound
Santa Clara, United States
SoundHound AI develops and licenses voice AI technologies that enable conversational interfaces for a variety of industries, including automotive, retail, and finance. Their core offering is a fully independent voice AI platform capable of handling over 10 billion conversations annually, focusing on agentic AI solutions that automate complex tasks. SoundHound differentiates itself by offering a complete, customizable voice AI solution – rather than relying on cloud-based assistants – allowing businesses to own the entire interaction and maximize ROI through cost reduction and revenue generation.
Speak AI
Toronto, Canada
Speak AI is a Canadian company specializing in AI-powered transcription and analysis of audio and video data. Their core product utilizes Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) to convert media into searchable, transcribed text and extract key insights. Speak AI primarily serves researchers and businesses needing to efficiently process and analyze qualitative data from interviews, meetings, and other spoken content.
Whisper (OpenAI)
San Francisco, United States
OpenAI’s Whisper is an open-source automatic speech recognition (ASR) system trained on a massive, diverse 680,000-hour dataset of multilingual speech. Utilizing a Transformer-based encoder-decoder architecture, Whisper excels in robustness to accents and background noise, offering both transcription and translation capabilities across multiple languages. This technology targets developers seeking to integrate highly accurate and versatile speech-to-text functionality into a wide range of applications, particularly where diverse audio conditions or multilingual support are critical.
iZotope
Cambridge, United States
iZotope develops advanced audio processing software leveraging machine learning for tasks like mixing, mastering, and dialogue editing. Their core technology centers on neural networks trained on vast datasets of professionally produced audio to deliver intelligent assistance and automated solutions for common audio challenges. Targeting audio engineers, musicians, and post-production professionals, iZotope provides tools that streamline workflows and enhance sonic quality with data-driven precision.
RingCentral
Belmont, United States
RingCentral provides a unified cloud communications platform integrating voice, video, messaging, and contact center solutions. Their core AI technology focuses on real-time conversation intelligence and automation within these communication channels, offering features like call transcription, sentiment analysis, and automated workflows. RingCentral targets businesses of all sizes seeking to improve agent productivity, enhance customer experiences, and gain actionable insights from their communications data.
Teachable Machine
Mountain View, United States
Teachable Machine is a web-based platform developed by Google that enables users to rapidly create machine learning models using a no-code interface. The platform focuses on image, audio, and pose-based recognition, allowing individuals to train custom models directly within their browser. Primarily targeting educators, artists, and hobbyists, Teachable Machine lowers the barrier to entry for machine learning by eliminating the need for programming expertise and facilitating quick prototyping for integration into web applications and creative projects.
Acoustic.ai
Copenhagen, Denmark
Acoustic.ai develops voice AI solutions for automotive and consumer electronics, focusing on noise cancellation and voice enhancement.
AIVA
Luxembourg City, Luxembourg
AIVA is a Luxembourg-based company specializing in AI-driven music composition. Their core technology is a generative AI model capable of autonomously composing original soundtracks across a variety of genres and styles. AIVA targets content creators in film, gaming, and advertising seeking royalty-free, customizable music solutions, offering an alternative to traditional music licensing and composition.
Fish Audio
Shanghai, China
Fish Audio offers studio-grade AI text-to-speech and instant voice cloning with 1,000+ voices in 70+ languages. Their open-source models have gained significant developer adoption.
Fliki
Bengaluru, India
Fliki combines text-to-speech with AI video creation. Their platform converts text and blog posts into videos with lifelike AI voices, serving content creators and marketing teams worldwide.
Fireflies.ai
San Francisco, United States
Fireflies.ai develops an AI-powered meeting assistant that automatically transcribes, summarizes, and analyzes conversational data across various video conferencing platforms. Their core technology centers on speech-to-text conversion and natural language processing to identify speakers and extract key insights from meetings. Fireflies.ai targets professional teams seeking to improve meeting productivity and knowledge management through searchable conversation archives.
Rev AI
San Francisco, United States
Rev AI provides a speech-to-text API specializing in automated transcription and speech recognition services. Their core technology centers on a diverse, large-dataset trained AI model designed for high accuracy across varied audio qualities and accents. They target developers and businesses requiring scalable, programmatic transcription solutions for applications like voice search, media monitoring, and accessibility services.
Boomy
Berkeley, United States
Boomy develops a platform enabling users to create original music tracks via artificial intelligence. Their core technology utilizes generative AI models—specifically, a combination of diffusion and transformer models—to compose music across various genres based on user-defined parameters. Targeting both amateur musicians and content creators, Boomy uniquely allows users to commercially distribute and potentially earn royalties from AI-generated compositions.
Soundraw
Tokyo, Japan
Soundraw develops an AI-powered music generation platform focused on providing royalty-free music for content creators. Their core technology utilizes an in-house trained AI model to compose original instrumentals, allowing for granular customization via a stem-based mixer. Soundraw uniquely targets the need for legally safe, customizable background music, enabling monetization opportunities for users without copyright concerns.
Fathom
San Francisco, United States
Fathom develops AI-powered note-taking and meeting assistants designed for professional use. Their core product utilizes large language models to automatically generate summaries, action items, and searchable transcripts from virtual meetings across platforms like Zoom, Google Meet, and Microsoft Teams. Fathom has gained traction among knowledge workers and teams seeking to improve meeting productivity and information retention, evidenced by integrations with platforms like Slack and Notion.
LMNT
San Francisco, United States
LMNT develops real-time text-to-speech (TTS) technology focused on low-latency and high fidelity voice generation. Their core product is a voice cloning and streaming API enabling developers to create custom AI voices for applications requiring conversational interfaces. Targeting game developers, virtual assistant creators, and interactive application builders, LMNT was founded by a team with prior experience at Google and emphasizes scalability for production deployments.
Neets.ai
Copenhagen, Denmark
Neets.ai develops real-time, low-latency text-to-speech (TTS) APIs for developers. Their core technology focuses on highly customizable and expressive neural voice cloning and generation, enabling creation of unique synthetic voices from limited audio data. Targeting game developers, metaverse platforms, and interactive voice response (IVR) systems, Neets.ai recently launched a public beta program for their voice API following seed funding in late 2023.
Frequently Asked Questions
- What is speech AI used for?
- Speech AI powers voice assistants (Siri, Alexa), live meeting transcription, voice cloning for content creators, multilingual customer service, accessibility tools for the hearing-impaired, and dubbing for video localisation.
- How realistic is AI-generated voice?
- Modern TTS systems achieve MOS (Mean Opinion Score) scores comparable to human speech. Leading systems from ElevenLabs, Play.ht, and LMNT can clone a voice from seconds of audio.
- Who are the top voice AI companies?
- Top companies include Spotify AI, Uniphore, Verbit, alongside ElevenLabs, Deepgram, AssemblyAI, and Whisper (OpenAI).