BIT Technology Research Series • AI Infrastructure

On-Premise Speech-to-Text for Slovak: A Comparative Study of Distilled and Full Whisper Models in MCP Server Architecture

Ing. Stanislav Pittner

BIT Technology s.r.o., Trstínska cesta 9, 917 01 Trnava, Slovakia

Published: April 12, 2026 Test Period: January – March 2026 Version 1.0

DOI: 10.5281/bittechnology.2026.tr002 (preprint)

Abstract

We present a comparative study of the performance of six OpenAI Whisper model variants (tiny, base, small, medium, large-v3, distil-large-v3) for automatic speech recognition (ASR) in Slovak, deployed in an on-premise environment via a Model Context Protocol (MCP) server. On a test corpus of 847 audio segments (total duration 12.4 hours) comprising spontaneous speech, professional recordings, and telephone communication, the distilled model distil-large-v3 achieves a Word Error Rate (WER) of 11.2% at an average processing speed of 2.8× real-time on CPU (Apple M2 Pro), representing a 68% speedup over the full large-v3 (WER 8.7%, 1.1× RT) while retaining 87% relative accuracy. For Slovak as a morphologically rich language, we identify key error patterns: inflection (32% of errors), diacritics (24%), compound words (18%), and anglicisms (14%). We propose an improvement strategy comprising fine-tuning on the Slovak CommonVoice 17.0 dataset, LoRA adaptation, and language model integration, with a predicted WER reduction to 5–7% for production deployment.

Keywords: ASR, Whisper, knowledge distillation, Slovak, MCP, on-premise, Word Error Rate, speech-to-text, edge AI

1. Introduction

Automatic Speech Recognition (ASR) has undergone a revolutionary transformation in recent years thanks to large multilingual models. OpenAI Whisper (Radford et al., 2022), trained on 680,000 hours of weakly annotated web data, demonstrated near-human accuracy for high-resource languages (English, Spanish, French). However, for low- and medium-resource languages, including Slovak, a significant performance deficit persists — WER for Slovak is typically 2–3× higher than for English on comparable domains (Whisper GitHub Issues #1314, OpenASR Leaderboard 2024).

At the same time, demand for on-premise ASR solutions is growing, motivated by: (i) data sovereignty — sensitive audio data (healthcare, legal services, internal communications) cannot leave local infrastructure; (ii) predictable costs — cloud ASR services generate $0.006–$0.024 per minute, which at large volumes represents significant operational expenses; (iii) latency — local inference eliminates network latency critical for real-time applications.

Knowledge distillation (Hinton et al., 2015) offers a path to deploying large ASR models on edge devices. Gandhi et al. (2023) demonstrated that a distilled Whisper model retains 95–99% of the teacher's accuracy at 2–6× inference speedup. The open question remains how this reduction manifests for morphologically complex languages such as Slovak.

This study addresses three research questions: (1) What is the performance of individual Whisper variants for Slovak in an on-premise environment? (2) What is the trade-off between accuracy and speed for distilled models? (3) What strategies can improve Slovak speech recognition in future iterations?

2. System Architecture

2.1 Audio2Text MCP Server

The server implements the Model Context Protocol (MCP) — an open standard for AI tool integration developed by Anthropic. Communication occurs via stdio and Server-Sent Events (SSE) transports with JSON-RPC 2.0 protocol, enabling seamless integration with Claude Code, VS Code, JetBrains IDE, and other MCP clients.

Audio2Text MCP Server — Processing Pipeline

Audio Input
MP3/WAV/M4A/FLAC

→

FFmpeg
Resampling 16kHz

→

VAD Filter
Silero VAD

→

Whisper
Inference Engine

→

Post-processing
Punct + Format

→

MCP Output
JSON/SRT/VTT

stdio / SSE transport • JSON-RPC 2.0 • Content-addressable cache

2.2 Hardware Environment

Table 1. Test infrastructure for benchmark experiments.

Parameter	Configuration A (Edge)	Configuration B (Workstation)
CPU	Apple M2 Pro (12-core)	AMD EPYC 7543P (32-core)
RAM	32 GB Unified	128 GB DDR4 ECC
GPU	Integrated (Metal)	NVIDIA A4000 (16 GB)
Storage	1 TB NVMe SSD	2 TB NVMe RAID-0
OS	macOS 15.3	Ubuntu 22.04 LTS
Runtime	Python 3.12 + PyTorch 2.3	Python 3.12 + PyTorch 2.3 + CUDA 12.4

3. Models and Methods

3.1 Tested Models

We evaluated six Whisper variants covering the full size spectrum from 39M to 1550M parameters, including a distilled version optimized for efficient inference:

Table 2. Overview of tested Whisper models and their architectures.

Model	Parameters	Encoder Layers	Decoder Layers	Size (FP16)	Type
whisper-tiny	39M	4	4	77 MB	Full
whisper-base	74M	6	6	148 MB	Full
whisper-small	244M	12	12	488 MB	Full
whisper-medium	769M	24	24	1.53 GB	Full
whisper-large-v3	1550M	32	32	3.09 GB	Full
distil-large-v3	756M	32	2	1.51 GB	Distilled

The distilled model (distil-large-v3, Gandhi et al. 2023) retains the full 32-layer encoder from large-v3 but reduces the decoder from 32 to 2 layers through knowledge distillation. Training was performed on 22,000 hours of pseudo-labeled data from the large-v3 teacher.

3.2 Test Corpus

The test dataset was compiled from three sources covering various acoustic conditions and speaking styles of Slovak:

Table 3. Composition of the test corpus.

Source	Segments	Duration (hrs)	Avg. Duration (s)	Speech Type	SNR (dB)
Mozilla CommonVoice 17.0 (sk)	412	5.8	50.7	Read speech	32–45
Internal recordings (meetings)	285	4.1	51.8	Spontaneous speech	18–30
Telephone calls (8 kHz)	150	2.5	60.0	Conversational	12–22
Total	847	12.4	52.7	—	12–45

3.3 Evaluation Metrics

The primary metric is Word Error Rate (WER), defined as:

WER = (S + D + I) / N × 100 % (1)

where S = substitutions, D = deletions, I = insertions, N = total number of words in the reference transcription. We further report Character Error Rate (CER) for capturing diacritical errors, Real-Time Factor (RTF = processing time / audio duration), and throughput in seconds of audio per second of computation.

4. Results

4.1 Overall Model Performance

Figure 1. Word Error Rate (WER, %) for individual Whisper models on the Slovak test corpus (n = 847 segments). Lower value = better accuracy. distil-large-v3 achieves WER comparable to the medium model at 2.5× lower inference time.

Table 4. Summary benchmark results for all models on the Slovak corpus (Configuration A: Apple M2 Pro, CPU-only).

Model	WER (%)	CER (%)	RTF (CPU)	Throughput (s/s)	VRAM/RAM	Relative Accuracy
whisper-tiny	38.4	22.1	0.08	12.5×	~0.5 GB	44%
whisper-base	28.7	16.3	0.14	7.1×	~0.8 GB	58%
whisper-small	17.3	10.2	0.31	3.2×	~1.5 GB	75%
whisper-medium	12.1	7.4	0.62	1.6×	~3.0 GB	84%
distil-large-v3	11.2	6.8	0.36	2.8×	~2.8 GB	87%
whisper-large-v3	8.7	5.1	0.91	1.1×	~6.2 GB	100% (ref.)

    Key Finding: distil-large-v3 as the Optimal Model
    WER 11.2% — only 2.5 percentage points higher than the full large-v3 (8.7%)
2.8× real-time on CPU — suitable for both real-time and batch processing
Half the memory footprint compared to large-v3 (2.8 GB vs. 6.2 GB)
Best accuracy/speed ratio among all tested models

4.2 Performance by Domain

Figure 2. WER (%) by audio domain for the three best-performing models. Telephone calls (8 kHz) consistently exhibit the highest WER across models, which correlates with lower SNR and limited frequency bandwidth.

Table 5. WER (%) by audio domain and model.

Domain	large-v3	distil-large-v3	medium	small
Read speech (CommonVoice)	5.3	7.1	8.4	12.6
Spontaneous speech (meetings)	10.2	13.8	14.9	21.3
Telephone calls (8 kHz)	14.8	17.4	18.2	24.7

4.3 GPU Acceleration

Table 6. Comparison of CPU vs. GPU inference for distil-large-v3 and large-v3 (Configuration B: NVIDIA A4000).

Model	RTF (CPU)	RTF (GPU FP16)	Speedup	GPU Throughput
distil-large-v3	0.36	0.05	7.2×	20.0× RT
whisper-large-v3	0.91	0.12	7.6×	8.3× RT
whisper-medium	0.62	0.07	8.9×	14.3× RT

On GPU (NVIDIA A4000, FP16), distil-large-v3 achieves 20× real-time throughput, enabling the processing of 1 hour of audio in 3 minutes. This is a critical parameter for batch processing of archives and call center recordings.

5. Error Analysis for Slovak

5.1 Categorization of Error Patterns

Manual analysis of 500 randomly selected error segments from distil-large-v3 output identified four dominant error categories specific to Slovak:

Figure 3. Distribution of error categories for Slovak (distil-large-v3, n = 500 analyzed errors). Morphological errors (inflection, conjugation) represent the largest category, followed by diacritical errors.

Table 7. Typical error patterns by category with examples.

Category	Share	Reference	Whisper Output	Error Type
Morphology	32%	na stretnutí s klientom	na stretnutí s klienta	Inflection (instr.)
Morphology	32%	hovorili sme o tom	hovoril sme o tom	Number agreement
Diacritics	24%	žiadosť o povolenie	ziadosť o povolenie	Carons (ž→z)
Diacritics	24%	výskumná správa	výskumna správa	Acutes (á→a)
Compound/technical	18%	železničná infraštruktúra	železničná infra štruktúra	Segmentation
Compound/technical	18%	konateľ spoločnosti	konatiel spoločnosti	OOV word
Anglicisms/code-switch	14%	otvorte pull request	otvorte pool request	Eng. phonetics
Anglicisms/code-switch	14%	v dashboarde vidíte	v dash borde vidíte	Hybrid words
Other	12%	Hallucinations, repeated words, missing punctuation, homophones

5.2 Language-Specific Factors

Slovak, as a West Slavic language, places specific demands on ASR systems:

Rich inflection. With 6 cases, 3 genders, and 3 declension patterns, the number of unique word forms per lemma ranges from 12 (nouns) to 60+ (verbs). Whisper, trained predominantly on English with minimal inflection, has limited capacity to capture these morphological nuances. Example: the word "projekt" (project) has forms projekt/projektu/projektu/projekt/projekte/projektom (sg.) and projekty/projektov/projektom/projekty/projektoch/projektmi (pl.).

Diacritical characters. Slovak orthography uses 15 diacritical characters (á, ä, č, ď, é, í, ĺ, ľ, ň, ó, ô, ŕ, š, ť, ú, ž). Incorrect diacritics recognition changes word meaning (e.g., "sud" [barrel] vs. "súd" [court], "byt" [apartment] vs. "byť" [to be]) and accounts for 24% of all errors.

Free word order. Unlike English with fixed SVO word order, Slovak allows variable ordering (SVO, SOV, OVS, VSO), which complicates language modeling in the decoder component.

6. Comparison with Alternative Solutions

Table 8. Comparison of Audio2Text MCP Server with cloud and open-source alternatives for Slovak.

Solution	WER (sk)	Latency	Cost/hour	Data Sovereignty	Offline
OpenAI Whisper API	~8%	3–8 s	$0.36	No	No
Google Cloud STT	~10%	1–5 s	$0.48	No	No
Azure Speech	~12%	1–4 s	$0.60	No	No
Faster Whisper (local)	~9%	Real-time	$0.00	Yes	Yes
Audio2Text MCP	11.2%	Real-time	$0.00	Yes	Yes
WhisperKit (Apple)	~13%	Real-time	$0.00	Yes	Yes

Audio2Text MCP Server distinguishes itself through native MCP integration — it is the only implementation enabling direct invocation from Claude Code, VS Code Copilot, and other MCP clients without additional configuration. At comparable accuracy to cloud solutions, it eliminates per-request costs and guarantees full data sovereignty.

7. Recommendations for Future Research

7.1 Short-Term Improvements (Q2–Q3 2026)

Table 9. Proposed optimizations with predicted WER impact.

Strategy	Implementation	Predicted WER	Complexity
Fine-tuning on CommonVoice 17.0 (sk)	LoRA rank-16 on encoder + decoder, 3 epochs, lr=1e-5	7.5–8.5%	Medium
Slovak language model (beam search)	KenLM 4-gram from SK Wikipedia + Common Crawl, shallow fusion	9.0–10.0%	Low
Diacritics post-processor	BERT-based diacritics restoration (SlovakBERT)	10.0–10.5%	Low
Combination (LM + fine-tune + diacritics)	Pipeline: fine-tuned Whisper → LM rescore → diacritics fix	5.5–7.0%	High

7.2 Medium-Term Directions (2026–2027)

Whisper v4 and next-generation ASR models. The anticipated Whisper v4 (OpenAI, expected H2 2026) should bring improved multilingual performance thanks to an expanded training corpus. Competing models — Google USM (Universal Speech Model, 2B parameters), Meta MMS (Massively Multilingual Speech, 1100+ languages), and NVIDIA Canary-1B — indicate a trend toward specialized models for low-resource languages.

Slovak ASR dataset. Creating a dedicated Slovak ASR dataset (target: 1000+ hours) combining: (i) expanded CommonVoice (currently ~100 hours), (ii) parliamentary recordings of the Slovak National Council, (iii) podcasts and broadcasts from RTVS, (iv) annotated telephone recordings. This dataset would enable both full fine-tuning and training from scratch.

End-to-end optimization for Slovak. Implementation of CTC/Transducer architecture (NVIDIA FastConformer) with training directly on a Slovak dataset, potentially achieving WER < 5% for read speech. Combination with on-device optimizations (INT8 quantization, speculative decoding) for real-time deployment on mobile devices.

7.3 Emerging AI Models and Trends

Table 10. Emerging ASR models and their relevance for Slovak.

Model	Organization	Parameters	Expected Benefit for SK	Availability
Whisper v4	OpenAI	~2B	Expanded multilingual training, improved diacritics	H2 2026 (expected)
USM v2	Google	2B	300+ languages, self-supervised pre-training on 12M hrs.	2026
MMS-1B	Meta	1B	1100+ languages including SK, CTC architecture	Available (open)
Canary-1B	NVIDIA	1B	Multitask ASR+ST, NeMo framework	Available (open)
Seamless M4T v2	Meta	2.3B	Multimodal (ASR+ST+TTS), SK in training set	Available (open)
Gemini Audio	Google	—	Native audio understanding, multimodal context	API only

8. Conclusions

This study demonstrates that on-premise ASR for Slovak is practically feasible with acceptable quality for production deployment. Key conclusions:

1. distil-large-v3 is the optimal model for on-premise Slovak ASR with WER 11.2% and 2.8× real-time throughput on CPU. It offers the best trade-off between accuracy, speed, and memory footprint.

2. Slovak presents specific challenges for ASR systems — rich morphology (32% of errors), diacritics (24%), and hybrid language style with anglicisms (14%) require targeted solutions beyond a generic multilingual model.

3. MCP architecture enables seamless integration of ASR into modern development and productivity tools, eliminating cloud dependencies and per-request costs while maintaining interoperability.

4. A predicted WER of 5–7% is achievable through a combination of fine-tuning on CommonVoice, a language model, and a diacritics post-processor, which would bring performance close to the level of commercial cloud solutions.

5. The next generation of ASR models (Whisper v4, MMS, USM v2) will likely reduce WER for Slovak below 5% within 12–18 months, making on-premise solutions a fully viable alternative to cloud services even for demanding applications.

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
Gandhi, S., von Platen, P., & Rush, A.M. (2023). Distil-Whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Gulati, A., et al. (2020). Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, 5036–5040.
Lin, J., et al. (2022). On-device training under 256KB memory. Proc. NeurIPS 2022.
Pratap, V., et al. (2023). Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516 (MMS).
Zhang, Y., et al. (2023). Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
Pikuliak, M., Šimko, M., & Bieliková, M. (2022). SlovakBERT: Slovak masked language model. Findings of EMNLP 2022.
Mozilla Common Voice (2024). Common Voice Corpus 17.0 — Slovak. Mozilla Foundation.
Anthropic (2024). Model Context Protocol specification v1.0. github.com/modelcontextprotocol/specification.
Peng, Y., et al. (2024). NVIDIA Canary: Multilingual multi-task ASR model. NVIDIA NeMo Toolkit.
Barrault, L., et al. (2023). SeamlessM4T: Massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596.

Data Availability: Audio2Text MCP Server is available as open-source. The test corpus (CommonVoice subset) is publicly available at commonvoice.mozilla.org/sk.

Project References: bittechnology.bemooore.com/referencie/audio2text-mcp

Author: Ing. Stanislav Pittner, CEO of BIT Technology s.r.o., developer of Audio2Text MCP Server.