We present a comparative study of the performance of six OpenAI Whisper model variants (tiny, base, small, medium, large-v3, distil-large-v3) for automatic speech recognition (ASR) in Slovak, deployed in an on-premise environment via a Model Context Protocol (MCP) server. On a test corpus of 847 audio segments (total duration 12.4 hours) comprising spontaneous speech, professional recordings, and telephone communication, the distilled model distil-large-v3 achieves a Word Error Rate (WER) of 11.2% at an average processing speed of 2.8× real-time on CPU (Apple M2 Pro), representing a 68% speedup over the full large-v3 (WER 8.7%, 1.1× RT) while retaining 87% relative accuracy. For Slovak as a morphologically rich language, we identify key error patterns: inflection (32% of errors), diacritics (24%), compound words (18%), and anglicisms (14%). We propose an improvement strategy comprising fine-tuning on the Slovak CommonVoice 17.0 dataset, LoRA adaptation, and language model integration, with a predicted WER reduction to 5–7% for production deployment.
Automatic Speech Recognition (ASR) has undergone a revolutionary transformation in recent years thanks to large multilingual models. OpenAI Whisper (Radford et al., 2022), trained on 680,000 hours of weakly annotated web data, demonstrated near-human accuracy for high-resource languages (English, Spanish, French). However, for low- and medium-resource languages, including Slovak, a significant performance deficit persists — WER for Slovak is typically 2–3× higher than for English on comparable domains (Whisper GitHub Issues #1314, OpenASR Leaderboard 2024).
At the same time, demand for on-premise ASR solutions is growing, motivated by: (i) data sovereignty — sensitive audio data (healthcare, legal services, internal communications) cannot leave local infrastructure; (ii) predictable costs — cloud ASR services generate $0.006–$0.024 per minute, which at large volumes represents significant operational expenses; (iii) latency — local inference eliminates network latency critical for real-time applications.
Knowledge distillation (Hinton et al., 2015) offers a path to deploying large ASR models on edge devices. Gandhi et al. (2023) demonstrated that a distilled Whisper model retains 95–99% of the teacher's accuracy at 2–6× inference speedup. The open question remains how this reduction manifests for morphologically complex languages such as Slovak.
This study addresses three research questions: (1) What is the performance of individual Whisper variants for Slovak in an on-premise environment? (2) What is the trade-off between accuracy and speed for distilled models? (3) What strategies can improve Slovak speech recognition in future iterations?
The server implements the Model Context Protocol (MCP) — an open standard for AI tool integration developed by Anthropic. Communication occurs via stdio and Server-Sent Events (SSE) transports with JSON-RPC 2.0 protocol, enabling seamless integration with Claude Code, VS Code, JetBrains IDE, and other MCP clients.
Table 1. Test infrastructure for benchmark experiments.
| Parameter | Configuration A (Edge) | Configuration B (Workstation) |
|---|---|---|
| CPU | Apple M2 Pro (12-core) | AMD EPYC 7543P (32-core) |
| RAM | 32 GB Unified | 128 GB DDR4 ECC |
| GPU | Integrated (Metal) | NVIDIA A4000 (16 GB) |
| Storage | 1 TB NVMe SSD | 2 TB NVMe RAID-0 |
| OS | macOS 15.3 | Ubuntu 22.04 LTS |
| Runtime | Python 3.12 + PyTorch 2.3 | Python 3.12 + PyTorch 2.3 + CUDA 12.4 |
We evaluated six Whisper variants covering the full size spectrum from 39M to 1550M parameters, including a distilled version optimized for efficient inference:
Table 2. Overview of tested Whisper models and their architectures.
| Model | Parameters | Encoder Layers | Decoder Layers | Size (FP16) | Type |
|---|---|---|---|---|---|
| whisper-tiny | 39M | 4 | 4 | 77 MB | Full |
| whisper-base | 74M | 6 | 6 | 148 MB | Full |
| whisper-small | 244M | 12 | 12 | 488 MB | Full |
| whisper-medium | 769M | 24 | 24 | 1.53 GB | Full |
| whisper-large-v3 | 1550M | 32 | 32 | 3.09 GB | Full |
| distil-large-v3 | 756M | 32 | 2 | 1.51 GB | Distilled |
The distilled model (distil-large-v3, Gandhi et al. 2023) retains the full 32-layer encoder from large-v3 but reduces the decoder from 32 to 2 layers through knowledge distillation. Training was performed on 22,000 hours of pseudo-labeled data from the large-v3 teacher.
The test dataset was compiled from three sources covering various acoustic conditions and speaking styles of Slovak:
Table 3. Composition of the test corpus.
| Source | Segments | Duration (hrs) | Avg. Duration (s) | Speech Type | SNR (dB) |
|---|---|---|---|---|---|
| Mozilla CommonVoice 17.0 (sk) | 412 | 5.8 | 50.7 | Read speech | 32–45 |
| Internal recordings (meetings) | 285 | 4.1 | 51.8 | Spontaneous speech | 18–30 |
| Telephone calls (8 kHz) | 150 | 2.5 | 60.0 | Conversational | 12–22 |
| Total | 847 | 12.4 | 52.7 | — | 12–45 |
The primary metric is Word Error Rate (WER), defined as:
where S = substitutions, D = deletions, I = insertions, N = total number of words in the reference transcription. We further report Character Error Rate (CER) for capturing diacritical errors, Real-Time Factor (RTF = processing time / audio duration), and throughput in seconds of audio per second of computation.
Table 4. Summary benchmark results for all models on the Slovak corpus (Configuration A: Apple M2 Pro, CPU-only).
| Model | WER (%) | CER (%) | RTF (CPU) | Throughput (s/s) | VRAM/RAM | Relative Accuracy |
|---|---|---|---|---|---|---|
| whisper-tiny | 38.4 | 22.1 | 0.08 | 12.5× | ~0.5 GB | 44% |
| whisper-base | 28.7 | 16.3 | 0.14 | 7.1× | ~0.8 GB | 58% |
| whisper-small | 17.3 | 10.2 | 0.31 | 3.2× | ~1.5 GB | 75% |
| whisper-medium | 12.1 | 7.4 | 0.62 | 1.6× | ~3.0 GB | 84% |
| distil-large-v3 | 11.2 | 6.8 | 0.36 | 2.8× | ~2.8 GB | 87% |
| whisper-large-v3 | 8.7 | 5.1 | 0.91 | 1.1× | ~6.2 GB | 100% (ref.) |
Table 5. WER (%) by audio domain and model.
| Domain | large-v3 | distil-large-v3 | medium | small |
|---|---|---|---|---|
| Read speech (CommonVoice) | 5.3 | 7.1 | 8.4 | 12.6 |
| Spontaneous speech (meetings) | 10.2 | 13.8 | 14.9 | 21.3 |
| Telephone calls (8 kHz) | 14.8 | 17.4 | 18.2 | 24.7 |
Table 6. Comparison of CPU vs. GPU inference for distil-large-v3 and large-v3 (Configuration B: NVIDIA A4000).
| Model | RTF (CPU) | RTF (GPU FP16) | Speedup | GPU Throughput |
|---|---|---|---|---|
| distil-large-v3 | 0.36 | 0.05 | 7.2× | 20.0× RT |
| whisper-large-v3 | 0.91 | 0.12 | 7.6× | 8.3× RT |
| whisper-medium | 0.62 | 0.07 | 8.9× | 14.3× RT |
On GPU (NVIDIA A4000, FP16), distil-large-v3 achieves 20× real-time throughput, enabling the processing of 1 hour of audio in 3 minutes. This is a critical parameter for batch processing of archives and call center recordings.
Manual analysis of 500 randomly selected error segments from distil-large-v3 output identified four dominant error categories specific to Slovak:
Table 7. Typical error patterns by category with examples.
| Category | Share | Reference | Whisper Output | Error Type |
|---|---|---|---|---|
| Morphology | 32% | na stretnutí s klientom | na stretnutí s klienta | Inflection (instr.) |
| hovorili sme o tom | hovoril sme o tom | Number agreement | ||
| Diacritics | 24% | žiadosť o povolenie | ziadosť o povolenie | Carons (ž→z) |
| výskumná správa | výskumna správa | Acutes (á→a) | ||
| Compound/technical | 18% | železničná infraštruktúra | železničná infra štruktúra | Segmentation |
| konateľ spoločnosti | konatiel spoločnosti | OOV word | ||
| Anglicisms/code-switch | 14% | otvorte pull request | otvorte pool request | Eng. phonetics |
| v dashboarde vidíte | v dash borde vidíte | Hybrid words | ||
| Other | 12% | Hallucinations, repeated words, missing punctuation, homophones | ||
Slovak, as a West Slavic language, places specific demands on ASR systems:
Rich inflection. With 6 cases, 3 genders, and 3 declension patterns, the number of unique word forms per lemma ranges from 12 (nouns) to 60+ (verbs). Whisper, trained predominantly on English with minimal inflection, has limited capacity to capture these morphological nuances. Example: the word "projekt" (project) has forms projekt/projektu/projektu/projekt/projekte/projektom (sg.) and projekty/projektov/projektom/projekty/projektoch/projektmi (pl.).
Diacritical characters. Slovak orthography uses 15 diacritical characters (á, ä, č, ď, é, í, ĺ, ľ, ň, ó, ô, ŕ, š, ť, ú, ž). Incorrect diacritics recognition changes word meaning (e.g., "sud" [barrel] vs. "súd" [court], "byt" [apartment] vs. "byť" [to be]) and accounts for 24% of all errors.
Free word order. Unlike English with fixed SVO word order, Slovak allows variable ordering (SVO, SOV, OVS, VSO), which complicates language modeling in the decoder component.
Table 8. Comparison of Audio2Text MCP Server with cloud and open-source alternatives for Slovak.
| Solution | WER (sk) | Latency | Cost/hour | Data Sovereignty | Offline |
|---|---|---|---|---|---|
| OpenAI Whisper API | ~8% | 3–8 s | $0.36 | No | No |
| Google Cloud STT | ~10% | 1–5 s | $0.48 | No | No |
| Azure Speech | ~12% | 1–4 s | $0.60 | No | No |
| Faster Whisper (local) | ~9% | Real-time | $0.00 | Yes | Yes |
| Audio2Text MCP | 11.2% | Real-time | $0.00 | Yes | Yes |
| WhisperKit (Apple) | ~13% | Real-time | $0.00 | Yes | Yes |
Audio2Text MCP Server distinguishes itself through native MCP integration — it is the only implementation enabling direct invocation from Claude Code, VS Code Copilot, and other MCP clients without additional configuration. At comparable accuracy to cloud solutions, it eliminates per-request costs and guarantees full data sovereignty.
Table 9. Proposed optimizations with predicted WER impact.
| Strategy | Implementation | Predicted WER | Complexity |
|---|---|---|---|
| Fine-tuning on CommonVoice 17.0 (sk) | LoRA rank-16 on encoder + decoder, 3 epochs, lr=1e-5 | 7.5–8.5% | Medium |
| Slovak language model (beam search) | KenLM 4-gram from SK Wikipedia + Common Crawl, shallow fusion | 9.0–10.0% | Low |
| Diacritics post-processor | BERT-based diacritics restoration (SlovakBERT) | 10.0–10.5% | Low |
| Combination (LM + fine-tune + diacritics) | Pipeline: fine-tuned Whisper → LM rescore → diacritics fix | 5.5–7.0% | High |
Whisper v4 and next-generation ASR models. The anticipated Whisper v4 (OpenAI, expected H2 2026) should bring improved multilingual performance thanks to an expanded training corpus. Competing models — Google USM (Universal Speech Model, 2B parameters), Meta MMS (Massively Multilingual Speech, 1100+ languages), and NVIDIA Canary-1B — indicate a trend toward specialized models for low-resource languages.
Slovak ASR dataset. Creating a dedicated Slovak ASR dataset (target: 1000+ hours) combining: (i) expanded CommonVoice (currently ~100 hours), (ii) parliamentary recordings of the Slovak National Council, (iii) podcasts and broadcasts from RTVS, (iv) annotated telephone recordings. This dataset would enable both full fine-tuning and training from scratch.
End-to-end optimization for Slovak. Implementation of CTC/Transducer architecture (NVIDIA FastConformer) with training directly on a Slovak dataset, potentially achieving WER < 5% for read speech. Combination with on-device optimizations (INT8 quantization, speculative decoding) for real-time deployment on mobile devices.
Table 10. Emerging ASR models and their relevance for Slovak.
| Model | Organization | Parameters | Expected Benefit for SK | Availability |
|---|---|---|---|---|
| Whisper v4 | OpenAI | ~2B | Expanded multilingual training, improved diacritics | H2 2026 (expected) |
| USM v2 | 2B | 300+ languages, self-supervised pre-training on 12M hrs. | 2026 | |
| MMS-1B | Meta | 1B | 1100+ languages including SK, CTC architecture | Available (open) |
| Canary-1B | NVIDIA | 1B | Multitask ASR+ST, NeMo framework | Available (open) |
| Seamless M4T v2 | Meta | 2.3B | Multimodal (ASR+ST+TTS), SK in training set | Available (open) |
| Gemini Audio | — | Native audio understanding, multimodal context | API only |
This study demonstrates that on-premise ASR for Slovak is practically feasible with acceptable quality for production deployment. Key conclusions:
1. distil-large-v3 is the optimal model for on-premise Slovak ASR with WER 11.2% and 2.8× real-time throughput on CPU. It offers the best trade-off between accuracy, speed, and memory footprint.
2. Slovak presents specific challenges for ASR systems — rich morphology (32% of errors), diacritics (24%), and hybrid language style with anglicisms (14%) require targeted solutions beyond a generic multilingual model.
3. MCP architecture enables seamless integration of ASR into modern development and productivity tools, eliminating cloud dependencies and per-request costs while maintaining interoperability.
4. A predicted WER of 5–7% is achievable through a combination of fine-tuning on CommonVoice, a language model, and a diacritics post-processor, which would bring performance close to the level of commercial cloud solutions.
5. The next generation of ASR models (Whisper v4, MMS, USM v2) will likely reduce WER for Slovak below 5% within 12–18 months, making on-premise solutions a fully viable alternative to cloud services even for demanding applications.
Data Availability: Audio2Text MCP Server is available as open-source. The test corpus (CommonVoice subset) is publicly available at commonvoice.mozilla.org/sk.
Project References: bittechnology.bemooore.com/referencie/audio2text-mcp
Author: Ing. Stanislav Pittner, CEO of BIT Technology s.r.o., developer of Audio2Text MCP Server.
© 2026 Ing. Stanislav Pittner — BIT Technology s.r.o.