BIT Technology Research Series • Digital Health

eHealthAI: AI Platform for Aggregation and Interpretation of Health Data from Apple Watch/iPhone with Evidence-Based Recommendations Powered by a RAG Pipeline

Ing. Stanislav Pittner

BIT Technology s.r.o., Trstínska cesta 9, 917 01 Trnava, Slovakia

Published: October 15, 2025 Testing period: June – September 2025 Version 1.0

DOI: 10.5281/bittechnology.2025.tr003 (preprint)

Abstract

We present the eHealthAI platform, which aggregates biometric data from Apple HealthKit (HRV, SpO2, heart rate, sleep stages, VO2max, and others) and provides contextual health interpretations through a Retrieval-Augmented Generation (RAG) pipeline indexing 50 000+ medical publications from PubMed, Cochrane Library, and WHO Guidelines. On a cohort of 127 users over a 3-month evaluation period, the platform achieves response accuracy of 85.4 % (assessed by a certified physician), while the RAG pipeline reduces LLM hallucination rate from 23.1 % to 4.7 % compared to the non-RAG baseline. Anomaly detection in biometric patterns operates with latency < 5 minutes. The system aggregates 15+ types of biometric data and generates personalized health reports with direct citations from primary scientific sources. We identify key limitations: the platform is not a certified medical device and does not provide diagnostic conclusions.

Keywords: digital health, Apple HealthKit, RAG, LLM, PubMed, wearable, HRV, biometrics, personalized health, evidence-based medicine

1. Introduction

Consumer wearable devices — particularly Apple Watch and iPhone — generate an unprecedented volume of biometric data. Apple Watch Series 9 continuously monitors heart rate, heart rate variability (HRV), blood oxygen saturation (SpO2), ECG, wrist temperature, sleep stages, and physical activity. These data are stored in Apple HealthKit; however, the native Apple Health app provides only basic summary statistics without expert medical interpretation.

Perez et al. (NEJM 2019) demonstrated the clinical relevance of wearable data in detecting atrial fibrillation with a positive predictive value of 84 %. Topol (Nature Medicine 2019) identified AI-driven interpretation of biometric data as a key enabler for preventive medicine. Nevertheless, a significant gap exists between collected data and their clinically relevant interpretation — users see numbers but do not understand their medical context.

Retrieval-Augmented Generation (RAG, Lewis et al. NeurIPS 2020) offers a solution to this problem: by combining an LLM with a retrieval system over medical literature, it is possible to generate contextual responses with direct citations from scientific sources, thereby significantly reducing the hallucination rate typical of standalone LLMs.

This study addresses three research questions: (1) What accuracy does the RAG pipeline over PubMed/Cochrane achieve in interpreting biometric data? (2) What is the impact of RAG on hallucination reduction compared to a non-RAG LLM? (3) What is the user satisfaction and clinical relevance of generated recommendations?

2. System Architecture

2.1 Platform Overview

eHealthAI — RAG Health Intelligence Pipeline

Apple Watch
HealthKit API

→

iCloud Sync
E2E Encryption

→

ETL Pipeline
Normalization

→

RAG Engine
Vector DB + Retrieval

→

LLM Synthesis
Citation Generation

→

Health Report
PDF + Dashboard

FastAPI backend • PubMed + Cochrane + WHO indexed • 50,000+ publications

2.2 RAG Pipeline

The RAG engine indexes medical literature in three phases: (i) Ingestion — downloading and parsing abstracts from PubMed API (36M+ articles), Cochrane systematic reviews (8 000+), and WHO guidelines; (ii) Embedding — generating vector representations using all-MiniLM-L6-v2 (384-dim) with a chunking strategy of 512 tokens and 50-token overlap; (iii) Retrieval + Reranking — semantic search via ChromaDB with cross-encoder reranking (ms-marco-MiniLM-L-6-v2) for top-k=10 documents.

Score_final = α · sim_cosine(q, d) + (1 − α) · score_rerank(q, d), α = 0.4 (1)

3. Data and Methods

3.1 Biometric Data

Table 1. Types of biometric data collected from Apple HealthKit and their clinical relevance.

Biometric Parameter	Device	Frequency	Unit	Clinical Relevance
Heart rate (HR)	Apple Watch	Continuous	bpm	Cardiovascular health
Heart rate variability (HRV)	Apple Watch	During sleep	ms (SDNN)	Autonomic regulation, stress
Blood oxygen saturation (SpO2)	Apple Watch	Periodic	%	Respiratory function
Sleep stages	Apple Watch	Daily	min (REM/Core/Deep)	Sleep quality, recovery
VO2max	Apple Watch	During activity	mL/kg/min	Cardiorespiratory fitness
Wrist temperature	Apple Watch	During sleep	°C (deviation)	Cycle, infections
Steps + distance	iPhone/Watch	Continuous	steps / km	Physical activity
Respiratory rate	Apple Watch	During sleep	breaths/min	Respiratory health

3.2 Knowledge Base

Table 2. Medical literature sources indexed in the RAG pipeline.

Source	Type	Number of Documents	Update Frequency	Coverage
PubMed	Abstracts + metadata	36,000,000+	Daily	Complete biomedicine
Cochrane Library	Systematic reviews	8,500+	Monthly	Evidence-based medicine
WHO Guidelines	Clinical guidelines	420+	Quarterly	Global health standards
NICE Guidelines (UK)	Clinical guidelines	310+	Quarterly	UK/EU clinical practice
Indexed in RAG	Embedded chunks	52,400	—	Relevant to wearable health

3.3 Evaluation Methodology

The evaluation was conducted on a cohort of 127 volunteers (age 24–67, 58 % male) over 3 months. Each participant used an Apple Watch and the eHealthAI platform daily. Generated health reports were assessed by a certified internist on a 5-point scale (1=incorrect, 5=clinically accurate). A total of 1 840 responses were evaluated.

4. Results

4.1 Response Accuracy by Medical Domain

Figure 1. AI response accuracy (% of ratings ≥ 4/5 by a certified physician) by medical domain (n = 1,840 responses). Fitness and cardiology achieve the highest accuracy due to a rich evidence base in PubMed.

Table 3. Comparison of RAG vs. non-RAG LLM on identical questions (n = 400 paired comparisons).

Metric	eHealthAI (RAG)	LLM without RAG	Improvement
Response accuracy (≥ 4/5)	85.4%	64.2%	+21.2 pp
Hallucination rate	4.7%	23.1%	−18.4 pp
Citation accuracy	91.3%	N/A	—
Factual correctness	92.8%	71.4%	+21.4 pp
Mean score (1–5)	4.21	3.14	+1.07

4.2 Anomaly Detection

Figure 2. Mean anomaly detection latency by biometric type (seconds). All types fall within the 5-minute SLA target.

Table 4. User study results (n = 127, 3 months).

Metric	Value	Benchmark
Daily usage rate	78%	Typical health app: 30–40%
Net Promoter Score (NPS)	+62	Health tech average: +35
Reported behavior changes	67%	—
Anomaly accuracy (physician)	89.2%	—
Average time in app	4.2 min/day	Apple Health: 1.1 min/day

4.3 HRV as a Biomarker

Heart rate variability (HRV, measured as SDNN during sleep) proved to be the most informative parameter for overall health assessment. Correlation analysis on our cohort confirmed a significant relationship between HRV and subjectively reported sleep quality (r = 0.68, p < 0.001), stress load (r = −0.54, p < 0.001), and physical performance (r = 0.47, p < 0.01), which is consistent with findings by Shaffer & Ginsberg (2017).

5. Comparison with Alternatives

Table 5. Comparison of eHealthAI with existing solutions for health data interpretation.

Solution	AI Interpretation	Scientific Citations	HealthKit Integration	Personalization	Price
eHealthAI	RAG + LLM	52,400 sources	Full (15+ types)	High	Platform
Apple Health (native)	Basic	No	Full	Low	Free
Whoop	Strain/Recovery	No	Partial	Medium	$30/mo
Oura Ring	Readiness score	No	Partial	Medium	$6/mo
ChatGPT + manual data	LLM (no RAG)	Inaccurate	No	Ad hoc	$20/mo

6. Limitations

Regulatory status. eHealthAI is not a certified medical device under MDR (EU 2017/745) and does not provide diagnostic conclusions. All recommendations are informational in nature with a reference to consult a physician.

Literature bias. PubMed and Cochrane have an inherent bias toward Anglo-Saxon populations. Recommendations may not fully reflect the specifics of Central European populations.

Cohort size. 127 participants are sufficient for a pilot study but not for definitive clinical conclusions. A randomized controlled trial would require n > 500.

7. Recommendations for Future Research

Table 6. Proposed improvements with predicted impact.

Strategy	Implementation	Predicted Impact	Complexity
Fine-tuned medical LLM	LoRA adaptation on clinical Q&A datasets	Accuracy +8–12 pp	Medium
FHIR integration	HL7 FHIR R4 for clinical data import	Clinical interoperability	Medium
Multi-language RAG	SK/CZ/DE PubMed articles + local guidelines	Regional relevance	Low
Physician dashboard	Web portal for physicians with patient view	Clinical adoption	High
FDA/CE compliance pathway	Clinical validation for MDR Class IIa	Regulatory certification	High

Emerging medical LLM models: Med-PaLM 2 (Google, 86.5 % on MedQA), BioMistral (open-source, 7B), Clinical Camel (fine-tuned LLaMA on USMLE), and GPT-4 Medical (OpenAI) represent the next generation of models with the potential to increase medical response accuracy above 90 %.

8. Conclusions

1. RAG pipeline dramatically reduces hallucinations — from 23.1 % to 4.7 %, which is critical for medical applications where inaccurate information can lead to health harm.

2. Accuracy of 85.4 % assessed by a certified physician confirms the clinical relevance of AI interpretation of wearable data, particularly in the domains of cardiology (89 %) and fitness (91 %).

3. HRV is the most informative wearable biomarker with the strongest correlation to overall health status, confirming findings by Shaffer & Ginsberg (2017).

4. User engagement of 78 % daily (vs. 30–40 % for typical health apps) and NPS +62 indicate strong product-market fit for AI-interpreted health data.

5. The platform is an informational tool, not a diagnostic device — the next step requires clinical validation and a regulatory pathway for MDR certification.

References

Perez, M.V., et al. (2019). Large-scale assessment of a smartwatch to identify atrial fibrillation. New England Journal of Medicine, 381(20), 1909–1917.
Topol, E.J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proc. NeurIPS 2020.
Shaffer, F. & Ginsberg, J.P. (2017). An overview of heart rate variability metrics and norms. Frontiers in Public Health, 5, 258.
WHO (2019). WHO guideline: recommendations on digital interventions for health system strengthening. Geneva: WHO.
Singhal, K., et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172–180 (Med-PaLM 2).
Rajpurkar, P., et al. (2022). AI in health and medicine. Nature Medicine, 28(1), 31–38.
Labrique, A.B., et al. (2018). WHO digital health guidelines: a milestone for global health. NPJ Digital Medicine, 1, 11.
Thirunavukarasu, A.J., et al. (2023). Large language models in medicine. Nature Medicine, 29, 1930–1940.
Gao, Y., et al. (2024). Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997.
Apple Inc. (2024). HealthKit Framework Documentation. developer.apple.com/healthkit.
Consortium for Service Innovation (2023). KCS v6 Practices Guide. serviceinnovation.org.

Availability: The eHealthAI platform is available at healthai.bemooore.com.

Project reference: bittechnology.bemooore.com/referencie/ehealth-ai

Author: Ing. Stanislav Pittner, Managing Director, BIT Technology s.r.o.