Audio2Text MCP Server — Technical Report TR-2026-002
← EuropeAir 📥 Download PDF
BIT Technology Research Series • AI Infrastructure

On-Premise Speech-to-Text for Slovak: A Comparative Study of Distilled and Full Whisper Models in MCP Server Architecture

Ing. Stanislav Pittner
BIT Technology s.r.o., Trstínska cesta 9, 917 01 Trnava, Slovakia
Published: April 12, 2026 Test Period: January – March 2026 Version 1.0
DOI: 10.5281/bittechnology.2026.tr002 (preprint)

Abstract

We present a comparative study of the performance of six OpenAI Whisper model variants (tiny, base, small, medium, large-v3, distil-large-v3) for automatic speech recognition (ASR) in Slovak, deployed in an on-premise environment via a Model Context Protocol (MCP) server. On a test corpus of 847 audio segments (total duration 12.4 hours) comprising spontaneous speech, professional recordings, and telephone communication, the distilled model distil-large-v3 achieves a Word Error Rate (WER) of 11.2% at an average processing speed of 2.8× real-time on CPU (Apple M2 Pro), representing a 68% speedup over the full large-v3 (WER 8.7%, 1.1× RT) while retaining 87% relative accuracy. For Slovak as a morphologically rich language, we identify key error patterns: inflection (32% of errors), diacritics (24%), compound words (18%), and anglicisms (14%). We propose an improvement strategy comprising fine-tuning on the Slovak CommonVoice 17.0 dataset, LoRA adaptation, and language model integration, with a predicted WER reduction to 5–7% for production deployment.

Keywords: ASR, Whisper, knowledge distillation, Slovak, MCP, on-premise, Word Error Rate, speech-to-text, edge AI

1. Introduction

Automatic Speech Recognition (ASR) has undergone a revolutionary transformation in recent years thanks to large multilingual models. OpenAI Whisper (Radford et al., 2022), trained on 680,000 hours of weakly annotated web data, demonstrated near-human accuracy for high-resource languages (English, Spanish, French). However, for low- and medium-resource languages, including Slovak, a significant performance deficit persists — WER for Slovak is typically 2–3× higher than for English on comparable domains (Whisper GitHub Issues #1314, OpenASR Leaderboard 2024).

At the same time, demand for on-premise ASR solutions is growing, motivated by: (i) data sovereignty — sensitive audio data (healthcare, legal services, internal communications) cannot leave local infrastructure; (ii) predictable costs — cloud ASR services generate $0.006–$0.024 per minute, which at large volumes represents significant operational expenses; (iii) latency — local inference eliminates network latency critical for real-time applications.

Knowledge distillation (Hinton et al., 2015) offers a path to deploying large ASR models on edge devices. Gandhi et al. (2023) demonstrated that a distilled Whisper model retains 95–99% of the teacher's accuracy at 2–6× inference speedup. The open question remains how this reduction manifests for morphologically complex languages such as Slovak.

This study addresses three research questions: (1) What is the performance of individual Whisper variants for Slovak in an on-premise environment? (2) What is the trade-off between accuracy and speed for distilled models? (3) What strategies can improve Slovak speech recognition in future iterations?

2. System Architecture

2.1 Audio2Text MCP Server

The server implements the Model Context Protocol (MCP) — an open standard for AI tool integration developed by Anthropic. Communication occurs via stdio and Server-Sent Events (SSE) transports with JSON-RPC 2.0 protocol, enabling seamless integration with Claude Code, VS Code, JetBrains IDE, and other MCP clients.

Audio2Text MCP Server — Processing Pipeline
Audio Input
MP3/WAV/M4A/FLAC
FFmpeg
Resampling 16kHz
VAD Filter
Silero VAD
Whisper
Inference Engine
Post-processing
Punct + Format
MCP Output
JSON/SRT/VTT
stdio / SSE transport • JSON-RPC 2.0 • Content-addressable cache

2.2 Hardware Environment

Table 1. Test infrastructure for benchmark experiments.

ParameterConfiguration A (Edge)Configuration B (Workstation)
CPUApple M2 Pro (12-core)AMD EPYC 7543P (32-core)
RAM32 GB Unified128 GB DDR4 ECC
GPUIntegrated (Metal)NVIDIA A4000 (16 GB)
Storage1 TB NVMe SSD2 TB NVMe RAID-0
OSmacOS 15.3Ubuntu 22.04 LTS
RuntimePython 3.12 + PyTorch 2.3Python 3.12 + PyTorch 2.3 + CUDA 12.4

3. Models and Methods

3.1 Tested Models

We evaluated six Whisper variants covering the full size spectrum from 39M to 1550M parameters, including a distilled version optimized for efficient inference:

Table 2. Overview of tested Whisper models and their architectures.

ModelParametersEncoder LayersDecoder LayersSize (FP16)Type
whisper-tiny39M4477 MBFull
whisper-base74M66148 MBFull
whisper-small244M1212488 MBFull
whisper-medium769M24241.53 GBFull
whisper-large-v31550M32323.09 GBFull
distil-large-v3756M3221.51 GBDistilled

The distilled model (distil-large-v3, Gandhi et al. 2023) retains the full 32-layer encoder from large-v3 but reduces the decoder from 32 to 2 layers through knowledge distillation. Training was performed on 22,000 hours of pseudo-labeled data from the large-v3 teacher.

3.2 Test Corpus

The test dataset was compiled from three sources covering various acoustic conditions and speaking styles of Slovak:

Table 3. Composition of the test corpus.

SourceSegmentsDuration (hrs)Avg. Duration (s)Speech TypeSNR (dB)
Mozilla CommonVoice 17.0 (sk)4125.850.7Read speech32–45
Internal recordings (meetings)2854.151.8Spontaneous speech18–30
Telephone calls (8 kHz)1502.560.0Conversational12–22
Total84712.452.712–45

3.3 Evaluation Metrics

The primary metric is Word Error Rate (WER), defined as:

WER = (S + D + I) / N × 100 % (1)

where S = substitutions, D = deletions, I = insertions, N = total number of words in the reference transcription. We further report Character Error Rate (CER) for capturing diacritical errors, Real-Time Factor (RTF = processing time / audio duration), and throughput in seconds of audio per second of computation.

4. Results

4.1 Overall Model Performance

Figure 1. Word Error Rate (WER, %) for individual Whisper models on the Slovak test corpus (n = 847 segments). Lower value = better accuracy. distil-large-v3 achieves WER comparable to the medium model at 2.5× lower inference time.

Table 4. Summary benchmark results for all models on the Slovak corpus (Configuration A: Apple M2 Pro, CPU-only).

ModelWER (%)CER (%)RTF (CPU)Throughput (s/s)VRAM/RAMRelative Accuracy
whisper-tiny38.422.10.0812.5×~0.5 GB44%
whisper-base28.716.30.147.1×~0.8 GB58%
whisper-small17.310.20.313.2×~1.5 GB75%
whisper-medium12.17.40.621.6×~3.0 GB84%
distil-large-v311.26.80.362.8×~2.8 GB87%
whisper-large-v38.75.10.911.1×~6.2 GB100% (ref.)

Key Finding: distil-large-v3 as the Optimal Model

4.2 Performance by Domain

Figure 2. WER (%) by audio domain for the three best-performing models. Telephone calls (8 kHz) consistently exhibit the highest WER across models, which correlates with lower SNR and limited frequency bandwidth.

Table 5. WER (%) by audio domain and model.

Domainlarge-v3distil-large-v3mediumsmall
Read speech (CommonVoice)5.37.18.412.6
Spontaneous speech (meetings)10.213.814.921.3
Telephone calls (8 kHz)14.817.418.224.7

4.3 GPU Acceleration

Table 6. Comparison of CPU vs. GPU inference for distil-large-v3 and large-v3 (Configuration B: NVIDIA A4000).

ModelRTF (CPU)RTF (GPU FP16)SpeedupGPU Throughput
distil-large-v30.360.057.2×20.0× RT
whisper-large-v30.910.127.6×8.3× RT
whisper-medium0.620.078.9×14.3× RT

On GPU (NVIDIA A4000, FP16), distil-large-v3 achieves 20× real-time throughput, enabling the processing of 1 hour of audio in 3 minutes. This is a critical parameter for batch processing of archives and call center recordings.

5. Error Analysis for Slovak

5.1 Categorization of Error Patterns

Manual analysis of 500 randomly selected error segments from distil-large-v3 output identified four dominant error categories specific to Slovak:

Figure 3. Distribution of error categories for Slovak (distil-large-v3, n = 500 analyzed errors). Morphological errors (inflection, conjugation) represent the largest category, followed by diacritical errors.

Table 7. Typical error patterns by category with examples.

CategoryShareReferenceWhisper OutputError Type
Morphology32%na stretnutí s klientomna stretnutí s klientaInflection (instr.)
hovorili sme o tomhovoril sme o tomNumber agreement
Diacritics24%žiadosť o povolenieziadosť o povolenieCarons (ž→z)
výskumná správavýskumna správaAcutes (á→a)
Compound/technical18%železničná infraštruktúraželezničná infra štruktúraSegmentation
konateľ spoločnostikonatiel spoločnostiOOV word
Anglicisms/code-switch14%otvorte pull requestotvorte pool requestEng. phonetics
v dashboarde vidítev dash borde vidíteHybrid words
Other12%Hallucinations, repeated words, missing punctuation, homophones

5.2 Language-Specific Factors

Slovak, as a West Slavic language, places specific demands on ASR systems:

Rich inflection. With 6 cases, 3 genders, and 3 declension patterns, the number of unique word forms per lemma ranges from 12 (nouns) to 60+ (verbs). Whisper, trained predominantly on English with minimal inflection, has limited capacity to capture these morphological nuances. Example: the word "projekt" (project) has forms projekt/projektu/projektu/projekt/projekte/projektom (sg.) and projekty/projektov/projektom/projekty/projektoch/projektmi (pl.).

Diacritical characters. Slovak orthography uses 15 diacritical characters (á, ä, č, ď, é, í, ĺ, ľ, ň, ó, ô, ŕ, š, ť, ú, ž). Incorrect diacritics recognition changes word meaning (e.g., "sud" [barrel] vs. "súd" [court], "byt" [apartment] vs. "byť" [to be]) and accounts for 24% of all errors.

Free word order. Unlike English with fixed SVO word order, Slovak allows variable ordering (SVO, SOV, OVS, VSO), which complicates language modeling in the decoder component.

6. Comparison with Alternative Solutions

Table 8. Comparison of Audio2Text MCP Server with cloud and open-source alternatives for Slovak.

SolutionWER (sk)LatencyCost/hourData SovereigntyOffline
OpenAI Whisper API~8%3–8 s$0.36NoNo
Google Cloud STT~10%1–5 s$0.48NoNo
Azure Speech~12%1–4 s$0.60NoNo
Faster Whisper (local)~9%Real-time$0.00YesYes
Audio2Text MCP11.2%Real-time$0.00YesYes
WhisperKit (Apple)~13%Real-time$0.00YesYes

Audio2Text MCP Server distinguishes itself through native MCP integration — it is the only implementation enabling direct invocation from Claude Code, VS Code Copilot, and other MCP clients without additional configuration. At comparable accuracy to cloud solutions, it eliminates per-request costs and guarantees full data sovereignty.

7. Recommendations for Future Research

7.1 Short-Term Improvements (Q2–Q3 2026)

Table 9. Proposed optimizations with predicted WER impact.

StrategyImplementationPredicted WERComplexity
Fine-tuning on CommonVoice 17.0 (sk)LoRA rank-16 on encoder + decoder, 3 epochs, lr=1e-57.5–8.5%Medium
Slovak language model (beam search)KenLM 4-gram from SK Wikipedia + Common Crawl, shallow fusion9.0–10.0%Low
Diacritics post-processorBERT-based diacritics restoration (SlovakBERT)10.0–10.5%Low
Combination (LM + fine-tune + diacritics)Pipeline: fine-tuned Whisper → LM rescore → diacritics fix5.5–7.0%High

7.2 Medium-Term Directions (2026–2027)

Whisper v4 and next-generation ASR models. The anticipated Whisper v4 (OpenAI, expected H2 2026) should bring improved multilingual performance thanks to an expanded training corpus. Competing models — Google USM (Universal Speech Model, 2B parameters), Meta MMS (Massively Multilingual Speech, 1100+ languages), and NVIDIA Canary-1B — indicate a trend toward specialized models for low-resource languages.

Slovak ASR dataset. Creating a dedicated Slovak ASR dataset (target: 1000+ hours) combining: (i) expanded CommonVoice (currently ~100 hours), (ii) parliamentary recordings of the Slovak National Council, (iii) podcasts and broadcasts from RTVS, (iv) annotated telephone recordings. This dataset would enable both full fine-tuning and training from scratch.

End-to-end optimization for Slovak. Implementation of CTC/Transducer architecture (NVIDIA FastConformer) with training directly on a Slovak dataset, potentially achieving WER < 5% for read speech. Combination with on-device optimizations (INT8 quantization, speculative decoding) for real-time deployment on mobile devices.

7.3 Emerging AI Models and Trends

Table 10. Emerging ASR models and their relevance for Slovak.

ModelOrganizationParametersExpected Benefit for SKAvailability
Whisper v4OpenAI~2BExpanded multilingual training, improved diacriticsH2 2026 (expected)
USM v2Google2B300+ languages, self-supervised pre-training on 12M hrs.2026
MMS-1BMeta1B1100+ languages including SK, CTC architectureAvailable (open)
Canary-1BNVIDIA1BMultitask ASR+ST, NeMo frameworkAvailable (open)
Seamless M4T v2Meta2.3BMultimodal (ASR+ST+TTS), SK in training setAvailable (open)
Gemini AudioGoogleNative audio understanding, multimodal contextAPI only

8. Conclusions

This study demonstrates that on-premise ASR for Slovak is practically feasible with acceptable quality for production deployment. Key conclusions:

1. distil-large-v3 is the optimal model for on-premise Slovak ASR with WER 11.2% and 2.8× real-time throughput on CPU. It offers the best trade-off between accuracy, speed, and memory footprint.

2. Slovak presents specific challenges for ASR systems — rich morphology (32% of errors), diacritics (24%), and hybrid language style with anglicisms (14%) require targeted solutions beyond a generic multilingual model.

3. MCP architecture enables seamless integration of ASR into modern development and productivity tools, eliminating cloud dependencies and per-request costs while maintaining interoperability.

4. A predicted WER of 5–7% is achievable through a combination of fine-tuning on CommonVoice, a language model, and a diacritics post-processor, which would bring performance close to the level of commercial cloud solutions.

5. The next generation of ASR models (Whisper v4, MMS, USM v2) will likely reduce WER for Slovak below 5% within 12–18 months, making on-premise solutions a fully viable alternative to cloud services even for demanding applications.

References

  1. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
  2. Gandhi, S., von Platen, P., & Rush, A.M. (2023). Distil-Whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430.
  3. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  4. Gulati, A., et al. (2020). Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, 5036–5040.
  5. Lin, J., et al. (2022). On-device training under 256KB memory. Proc. NeurIPS 2022.
  6. Pratap, V., et al. (2023). Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516 (MMS).
  7. Zhang, Y., et al. (2023). Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
  8. Pikuliak, M., Šimko, M., & Bieliková, M. (2022). SlovakBERT: Slovak masked language model. Findings of EMNLP 2022.
  9. Mozilla Common Voice (2024). Common Voice Corpus 17.0 — Slovak. Mozilla Foundation.
  10. Anthropic (2024). Model Context Protocol specification v1.0. github.com/modelcontextprotocol/specification.
  11. Peng, Y., et al. (2024). NVIDIA Canary: Multilingual multi-task ASR model. NVIDIA NeMo Toolkit.
  12. Barrault, L., et al. (2023). SeamlessM4T: Massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596.

Data Availability: Audio2Text MCP Server is available as open-source. The test corpus (CommonVoice subset) is publicly available at commonvoice.mozilla.org/sk.

Project References: bittechnology.bemooore.com/referencie/audio2text-mcp

Author: Ing. Stanislav Pittner, CEO of BIT Technology s.r.o., developer of Audio2Text MCP Server.

© 2026 Ing. Stanislav Pittner — BIT Technology s.r.o.