eHealthAI — Technical Report TR-2025-003
← HealthAI 📥 Download PDF
BIT Technology Research Series • Digital Health

eHealthAI: AI Platform for Aggregation and Interpretation of Health Data from Apple Watch/iPhone with Evidence-Based Recommendations Powered by a RAG Pipeline

Ing. Stanislav Pittner
BIT Technology s.r.o., Trstínska cesta 9, 917 01 Trnava, Slovakia
Published: October 15, 2025 Testing period: June – September 2025 Version 1.0
DOI: 10.5281/bittechnology.2025.tr003 (preprint)

Abstract

We present the eHealthAI platform, which aggregates biometric data from Apple HealthKit (HRV, SpO2, heart rate, sleep stages, VO2max, and others) and provides contextual health interpretations through a Retrieval-Augmented Generation (RAG) pipeline indexing 50 000+ medical publications from PubMed, Cochrane Library, and WHO Guidelines. On a cohort of 127 users over a 3-month evaluation period, the platform achieves response accuracy of 85.4 % (assessed by a certified physician), while the RAG pipeline reduces LLM hallucination rate from 23.1 % to 4.7 % compared to the non-RAG baseline. Anomaly detection in biometric patterns operates with latency < 5 minutes. The system aggregates 15+ types of biometric data and generates personalized health reports with direct citations from primary scientific sources. We identify key limitations: the platform is not a certified medical device and does not provide diagnostic conclusions.

Keywords: digital health, Apple HealthKit, RAG, LLM, PubMed, wearable, HRV, biometrics, personalized health, evidence-based medicine

1. Introduction

Consumer wearable devices — particularly Apple Watch and iPhone — generate an unprecedented volume of biometric data. Apple Watch Series 9 continuously monitors heart rate, heart rate variability (HRV), blood oxygen saturation (SpO2), ECG, wrist temperature, sleep stages, and physical activity. These data are stored in Apple HealthKit; however, the native Apple Health app provides only basic summary statistics without expert medical interpretation.

Perez et al. (NEJM 2019) demonstrated the clinical relevance of wearable data in detecting atrial fibrillation with a positive predictive value of 84 %. Topol (Nature Medicine 2019) identified AI-driven interpretation of biometric data as a key enabler for preventive medicine. Nevertheless, a significant gap exists between collected data and their clinically relevant interpretation — users see numbers but do not understand their medical context.

Retrieval-Augmented Generation (RAG, Lewis et al. NeurIPS 2020) offers a solution to this problem: by combining an LLM with a retrieval system over medical literature, it is possible to generate contextual responses with direct citations from scientific sources, thereby significantly reducing the hallucination rate typical of standalone LLMs.

This study addresses three research questions: (1) What accuracy does the RAG pipeline over PubMed/Cochrane achieve in interpreting biometric data? (2) What is the impact of RAG on hallucination reduction compared to a non-RAG LLM? (3) What is the user satisfaction and clinical relevance of generated recommendations?

2. System Architecture

2.1 Platform Overview

eHealthAI — RAG Health Intelligence Pipeline
Apple Watch
HealthKit API
iCloud Sync
E2E Encryption
ETL Pipeline
Normalization
RAG Engine
Vector DB + Retrieval
LLM Synthesis
Citation Generation
Health Report
PDF + Dashboard
FastAPI backend • PubMed + Cochrane + WHO indexed • 50,000+ publications

2.2 RAG Pipeline

The RAG engine indexes medical literature in three phases: (i) Ingestion — downloading and parsing abstracts from PubMed API (36M+ articles), Cochrane systematic reviews (8 000+), and WHO guidelines; (ii) Embedding — generating vector representations using all-MiniLM-L6-v2 (384-dim) with a chunking strategy of 512 tokens and 50-token overlap; (iii) Retrieval + Reranking — semantic search via ChromaDB with cross-encoder reranking (ms-marco-MiniLM-L-6-v2) for top-k=10 documents.

Scorefinal = α · simcosine(q, d) + (1 − α) · scorererank(q, d),   α = 0.4 (1)

3. Data and Methods

3.1 Biometric Data

Table 1. Types of biometric data collected from Apple HealthKit and their clinical relevance.

Biometric ParameterDeviceFrequencyUnitClinical Relevance
Heart rate (HR)Apple WatchContinuousbpmCardiovascular health
Heart rate variability (HRV)Apple WatchDuring sleepms (SDNN)Autonomic regulation, stress
Blood oxygen saturation (SpO2)Apple WatchPeriodic%Respiratory function
Sleep stagesApple WatchDailymin (REM/Core/Deep)Sleep quality, recovery
VO2maxApple WatchDuring activitymL/kg/minCardiorespiratory fitness
Wrist temperatureApple WatchDuring sleep°C (deviation)Cycle, infections
Steps + distanceiPhone/WatchContinuoussteps / kmPhysical activity
Respiratory rateApple WatchDuring sleepbreaths/minRespiratory health

3.2 Knowledge Base

Table 2. Medical literature sources indexed in the RAG pipeline.

SourceTypeNumber of DocumentsUpdate FrequencyCoverage
PubMedAbstracts + metadata36,000,000+DailyComplete biomedicine
Cochrane LibrarySystematic reviews8,500+MonthlyEvidence-based medicine
WHO GuidelinesClinical guidelines420+QuarterlyGlobal health standards
NICE Guidelines (UK)Clinical guidelines310+QuarterlyUK/EU clinical practice
Indexed in RAGEmbedded chunks52,400Relevant to wearable health

3.3 Evaluation Methodology

The evaluation was conducted on a cohort of 127 volunteers (age 24–67, 58 % male) over 3 months. Each participant used an Apple Watch and the eHealthAI platform daily. Generated health reports were assessed by a certified internist on a 5-point scale (1=incorrect, 5=clinically accurate). A total of 1 840 responses were evaluated.

4. Results

4.1 Response Accuracy by Medical Domain

Figure 1. AI response accuracy (% of ratings ≥ 4/5 by a certified physician) by medical domain (n = 1,840 responses). Fitness and cardiology achieve the highest accuracy due to a rich evidence base in PubMed.

Table 3. Comparison of RAG vs. non-RAG LLM on identical questions (n = 400 paired comparisons).

MetriceHealthAI (RAG)LLM without RAGImprovement
Response accuracy (≥ 4/5)85.4%64.2%+21.2 pp
Hallucination rate4.7%23.1%−18.4 pp
Citation accuracy91.3%N/A
Factual correctness92.8%71.4%+21.4 pp
Mean score (1–5)4.213.14+1.07

4.2 Anomaly Detection

Figure 2. Mean anomaly detection latency by biometric type (seconds). All types fall within the 5-minute SLA target.

Table 4. User study results (n = 127, 3 months).

MetricValueBenchmark
Daily usage rate78%Typical health app: 30–40%
Net Promoter Score (NPS)+62Health tech average: +35
Reported behavior changes67%
Anomaly accuracy (physician)89.2%
Average time in app4.2 min/dayApple Health: 1.1 min/day

4.3 HRV as a Biomarker

Heart rate variability (HRV, measured as SDNN during sleep) proved to be the most informative parameter for overall health assessment. Correlation analysis on our cohort confirmed a significant relationship between HRV and subjectively reported sleep quality (r = 0.68, p < 0.001), stress load (r = −0.54, p < 0.001), and physical performance (r = 0.47, p < 0.01), which is consistent with findings by Shaffer & Ginsberg (2017).

5. Comparison with Alternatives

Table 5. Comparison of eHealthAI with existing solutions for health data interpretation.

SolutionAI InterpretationScientific CitationsHealthKit IntegrationPersonalizationPrice
eHealthAIRAG + LLM52,400 sourcesFull (15+ types)HighPlatform
Apple Health (native)BasicNoFullLowFree
WhoopStrain/RecoveryNoPartialMedium$30/mo
Oura RingReadiness scoreNoPartialMedium$6/mo
ChatGPT + manual dataLLM (no RAG)InaccurateNoAd hoc$20/mo

6. Limitations

Regulatory status. eHealthAI is not a certified medical device under MDR (EU 2017/745) and does not provide diagnostic conclusions. All recommendations are informational in nature with a reference to consult a physician.

Literature bias. PubMed and Cochrane have an inherent bias toward Anglo-Saxon populations. Recommendations may not fully reflect the specifics of Central European populations.

Cohort size. 127 participants are sufficient for a pilot study but not for definitive clinical conclusions. A randomized controlled trial would require n > 500.

7. Recommendations for Future Research

Table 6. Proposed improvements with predicted impact.

StrategyImplementationPredicted ImpactComplexity
Fine-tuned medical LLMLoRA adaptation on clinical Q&A datasetsAccuracy +8–12 ppMedium
FHIR integrationHL7 FHIR R4 for clinical data importClinical interoperabilityMedium
Multi-language RAGSK/CZ/DE PubMed articles + local guidelinesRegional relevanceLow
Physician dashboardWeb portal for physicians with patient viewClinical adoptionHigh
FDA/CE compliance pathwayClinical validation for MDR Class IIaRegulatory certificationHigh

Emerging medical LLM models: Med-PaLM 2 (Google, 86.5 % on MedQA), BioMistral (open-source, 7B), Clinical Camel (fine-tuned LLaMA on USMLE), and GPT-4 Medical (OpenAI) represent the next generation of models with the potential to increase medical response accuracy above 90 %.

8. Conclusions

1. RAG pipeline dramatically reduces hallucinations — from 23.1 % to 4.7 %, which is critical for medical applications where inaccurate information can lead to health harm.

2. Accuracy of 85.4 % assessed by a certified physician confirms the clinical relevance of AI interpretation of wearable data, particularly in the domains of cardiology (89 %) and fitness (91 %).

3. HRV is the most informative wearable biomarker with the strongest correlation to overall health status, confirming findings by Shaffer & Ginsberg (2017).

4. User engagement of 78 % daily (vs. 30–40 % for typical health apps) and NPS +62 indicate strong product-market fit for AI-interpreted health data.

5. The platform is an informational tool, not a diagnostic device — the next step requires clinical validation and a regulatory pathway for MDR certification.

References

  1. Perez, M.V., et al. (2019). Large-scale assessment of a smartwatch to identify atrial fibrillation. New England Journal of Medicine, 381(20), 1909–1917.
  2. Topol, E.J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.
  3. Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proc. NeurIPS 2020.
  4. Shaffer, F. & Ginsberg, J.P. (2017). An overview of heart rate variability metrics and norms. Frontiers in Public Health, 5, 258.
  5. WHO (2019). WHO guideline: recommendations on digital interventions for health system strengthening. Geneva: WHO.
  6. Singhal, K., et al. (2023). Large language models encode clinical knowledge. Nature, 620, 172–180 (Med-PaLM 2).
  7. Rajpurkar, P., et al. (2022). AI in health and medicine. Nature Medicine, 28(1), 31–38.
  8. Labrique, A.B., et al. (2018). WHO digital health guidelines: a milestone for global health. NPJ Digital Medicine, 1, 11.
  9. Thirunavukarasu, A.J., et al. (2023). Large language models in medicine. Nature Medicine, 29, 1930–1940.
  10. Gao, Y., et al. (2024). Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997.
  11. Apple Inc. (2024). HealthKit Framework Documentation. developer.apple.com/healthkit.
  12. Consortium for Service Innovation (2023). KCS v6 Practices Guide. serviceinnovation.org.

Availability: The eHealthAI platform is available at healthai.bemooore.com.

Project reference: bittechnology.bemooore.com/referencie/ehealth-ai

Author: Ing. Stanislav Pittner, Managing Director, BIT Technology s.r.o.

© 2025 Ing. Stanislav Pittner — BIT Technology s.r.o.