PhasaTek Labs

Edge STT in 2025: Combining Proprietary Speech-Recognition with Open-Source Libraries

Speech-to-text (STT) technology today in large part combines both proprietary and open-source solutions to meet the demands of low-latency, offline, and real-time transcription. The current approach involves a mix of well-optimized models designed for embedded systems and flexible libraries that allow developers to adapt the system to specific needs.

Google Cloud’s Speech-to-Text On Device is engineered for embedded platforms where network connectivity is limited or unavailable. This solution runs entirely on the device, providing low latency through on-device processing. There are models kept under 1 GB to minimize resource consumption while maintaining high transcription accuracy. Additional features include built-in voice activity detection and the ability to adjust the model for specialized vocabularies. This is a proprietary model and is not open for full public access, however it is possible for some Google Cloud customers to request access.

RealtimeSTT offers real-time transcription capabilities through an open-source library that processes audio directly from the microphone. It employs voice activity detection by combining techniques from WebRTCVAD for initial signal detection with SileroVAD for improved accuracy. The system supports GPU acceleration using frameworks like Faster_Whisper, which enhances both speed and performance. It also incorporates wake word detection using tools such as Porcupine or OpenWakeWord and provides a Python API that handles callbacks and multiprocessing.

For Linux users, nerd-dictation serves as a minimalist solution. Implemented as a single Python script with minimal dependencies, it utilizes the VOSK-API to perform speech recognition without requiring a continuously running background process. Users can control the dictation manually through command-line commands, and the system supports customization through configuration files that allow text manipulation, such as converting spoken numbers to digits. Compatibility with various audio recording utilities makes it a flexible option for desktop environments.

Two major speech recognition engines form the backbone of many STT systems. VOSK is a popular offline toolkit that supports multiple languages and is designed for efficient operation on mobile and embedded platforms. It provides language bindings for languages like Python and Java, making it accessible for a wide range of applications. In contrast, Whisper, developed by OpenAI, is a neural network-based engine known for its robustness in handling diverse accents and languages. Whisper can be run offline with GPU acceleration, offering high-quality transcription even in challenging audio conditions.

These recognition engines are often integrated into higher-level STT libraries. For instance, nerd-dictation relies on the VOSK-API for processing audio, while RealtimeSTT can utilize GPU-accelerated models such as those based on Faster_Whisper. This modular integration allows developers to mix and match components based on the requirements of their applications, enabling both real-time interaction and offline processing.

Despite significant progress, several technical challenges remain. Many systems are optimized for clear, single-speaker audio, and overlapping speech continues to pose problems because mixed audio signals complicate the separation of individual voices. Diarized transcription, which involves determining who spoke and when, is still difficult to implement accurately in real time due to variations in accents, voice modulation, and background noise. Additionally, automatic language detection struggles in environments where speakers switch languages rapidly or use strong regional accents.

Efforts are ongoing to address these challenges, with continued research into advanced neural architectures and integrated systems that combine speech separation, diarization, and language detection. The interplay between proprietary solutions and open-source projects is likely to drive further improvements in both the performance and flexibility of STT systems.

PhasaTek Labs is keenly interested in these models and will be conducting further research into the various speeds and performance metrics they can achieve in different settings. We plan to publish our findings soon, and we look forward to sharing more detailed insights with the community.

 

Google Cloud Speech-to-Text On Device Documentation

RealtimeSTT on GitHub

nerd-dictation on GitHub

Special thanks to reddit.com/r/speechtech community.