We present the eHealthAI platform, which aggregates biometric data from Apple HealthKit (HRV, SpO2, heart rate, sleep stages, VO2max, and others) and provides contextual health interpretations through a Retrieval-Augmented Generation (RAG) pipeline indexing 50 000+ medical publications from PubMed, Cochrane Library, and WHO Guidelines. On a cohort of 127 users over a 3-month evaluation period, the platform achieves response accuracy of 85.4 % (assessed by a certified physician), while the RAG pipeline reduces LLM hallucination rate from 23.1 % to 4.7 % compared to the non-RAG baseline. Anomaly detection in biometric patterns operates with latency < 5 minutes. The system aggregates 15+ types of biometric data and generates personalized health reports with direct citations from primary scientific sources. We identify key limitations: the platform is not a certified medical device and does not provide diagnostic conclusions.
Consumer wearable devices — particularly Apple Watch and iPhone — generate an unprecedented volume of biometric data. Apple Watch Series 9 continuously monitors heart rate, heart rate variability (HRV), blood oxygen saturation (SpO2), ECG, wrist temperature, sleep stages, and physical activity. These data are stored in Apple HealthKit; however, the native Apple Health app provides only basic summary statistics without expert medical interpretation.
Perez et al. (NEJM 2019) demonstrated the clinical relevance of wearable data in detecting atrial fibrillation with a positive predictive value of 84 %. Topol (Nature Medicine 2019) identified AI-driven interpretation of biometric data as a key enabler for preventive medicine. Nevertheless, a significant gap exists between collected data and their clinically relevant interpretation — users see numbers but do not understand their medical context.
Retrieval-Augmented Generation (RAG, Lewis et al. NeurIPS 2020) offers a solution to this problem: by combining an LLM with a retrieval system over medical literature, it is possible to generate contextual responses with direct citations from scientific sources, thereby significantly reducing the hallucination rate typical of standalone LLMs.
This study addresses three research questions: (1) What accuracy does the RAG pipeline over PubMed/Cochrane achieve in interpreting biometric data? (2) What is the impact of RAG on hallucination reduction compared to a non-RAG LLM? (3) What is the user satisfaction and clinical relevance of generated recommendations?
The RAG engine indexes medical literature in three phases: (i) Ingestion — downloading and parsing abstracts from PubMed API (36M+ articles), Cochrane systematic reviews (8 000+), and WHO guidelines; (ii) Embedding — generating vector representations using all-MiniLM-L6-v2 (384-dim) with a chunking strategy of 512 tokens and 50-token overlap; (iii) Retrieval + Reranking — semantic search via ChromaDB with cross-encoder reranking (ms-marco-MiniLM-L-6-v2) for top-k=10 documents.
Table 1. Types of biometric data collected from Apple HealthKit and their clinical relevance.
| Biometric Parameter | Device | Frequency | Unit | Clinical Relevance |
|---|---|---|---|---|
| Heart rate (HR) | Apple Watch | Continuous | bpm | Cardiovascular health |
| Heart rate variability (HRV) | Apple Watch | During sleep | ms (SDNN) | Autonomic regulation, stress |
| Blood oxygen saturation (SpO2) | Apple Watch | Periodic | % | Respiratory function |
| Sleep stages | Apple Watch | Daily | min (REM/Core/Deep) | Sleep quality, recovery |
| VO2max | Apple Watch | During activity | mL/kg/min | Cardiorespiratory fitness |
| Wrist temperature | Apple Watch | During sleep | °C (deviation) | Cycle, infections |
| Steps + distance | iPhone/Watch | Continuous | steps / km | Physical activity |
| Respiratory rate | Apple Watch | During sleep | breaths/min | Respiratory health |
Table 2. Medical literature sources indexed in the RAG pipeline.
| Source | Type | Number of Documents | Update Frequency | Coverage |
|---|---|---|---|---|
| PubMed | Abstracts + metadata | 36,000,000+ | Daily | Complete biomedicine |
| Cochrane Library | Systematic reviews | 8,500+ | Monthly | Evidence-based medicine |
| WHO Guidelines | Clinical guidelines | 420+ | Quarterly | Global health standards |
| NICE Guidelines (UK) | Clinical guidelines | 310+ | Quarterly | UK/EU clinical practice |
| Indexed in RAG | Embedded chunks | 52,400 | — | Relevant to wearable health |
The evaluation was conducted on a cohort of 127 volunteers (age 24–67, 58 % male) over 3 months. Each participant used an Apple Watch and the eHealthAI platform daily. Generated health reports were assessed by a certified internist on a 5-point scale (1=incorrect, 5=clinically accurate). A total of 1 840 responses were evaluated.
Table 3. Comparison of RAG vs. non-RAG LLM on identical questions (n = 400 paired comparisons).
| Metric | eHealthAI (RAG) | LLM without RAG | Improvement |
|---|---|---|---|
| Response accuracy (≥ 4/5) | 85.4% | 64.2% | +21.2 pp |
| Hallucination rate | 4.7% | 23.1% | −18.4 pp |
| Citation accuracy | 91.3% | N/A | — |
| Factual correctness | 92.8% | 71.4% | +21.4 pp |
| Mean score (1–5) | 4.21 | 3.14 | +1.07 |
Table 4. User study results (n = 127, 3 months).
| Metric | Value | Benchmark |
|---|---|---|
| Daily usage rate | 78% | Typical health app: 30–40% |
| Net Promoter Score (NPS) | +62 | Health tech average: +35 |
| Reported behavior changes | 67% | — |
| Anomaly accuracy (physician) | 89.2% | — |
| Average time in app | 4.2 min/day | Apple Health: 1.1 min/day |
Heart rate variability (HRV, measured as SDNN during sleep) proved to be the most informative parameter for overall health assessment. Correlation analysis on our cohort confirmed a significant relationship between HRV and subjectively reported sleep quality (r = 0.68, p < 0.001), stress load (r = −0.54, p < 0.001), and physical performance (r = 0.47, p < 0.01), which is consistent with findings by Shaffer & Ginsberg (2017).
Table 5. Comparison of eHealthAI with existing solutions for health data interpretation.
| Solution | AI Interpretation | Scientific Citations | HealthKit Integration | Personalization | Price |
|---|---|---|---|---|---|
| eHealthAI | RAG + LLM | 52,400 sources | Full (15+ types) | High | Platform |
| Apple Health (native) | Basic | No | Full | Low | Free |
| Whoop | Strain/Recovery | No | Partial | Medium | $30/mo |
| Oura Ring | Readiness score | No | Partial | Medium | $6/mo |
| ChatGPT + manual data | LLM (no RAG) | Inaccurate | No | Ad hoc | $20/mo |
Regulatory status. eHealthAI is not a certified medical device under MDR (EU 2017/745) and does not provide diagnostic conclusions. All recommendations are informational in nature with a reference to consult a physician.
Literature bias. PubMed and Cochrane have an inherent bias toward Anglo-Saxon populations. Recommendations may not fully reflect the specifics of Central European populations.
Cohort size. 127 participants are sufficient for a pilot study but not for definitive clinical conclusions. A randomized controlled trial would require n > 500.
Table 6. Proposed improvements with predicted impact.
| Strategy | Implementation | Predicted Impact | Complexity |
|---|---|---|---|
| Fine-tuned medical LLM | LoRA adaptation on clinical Q&A datasets | Accuracy +8–12 pp | Medium |
| FHIR integration | HL7 FHIR R4 for clinical data import | Clinical interoperability | Medium |
| Multi-language RAG | SK/CZ/DE PubMed articles + local guidelines | Regional relevance | Low |
| Physician dashboard | Web portal for physicians with patient view | Clinical adoption | High |
| FDA/CE compliance pathway | Clinical validation for MDR Class IIa | Regulatory certification | High |
Emerging medical LLM models: Med-PaLM 2 (Google, 86.5 % on MedQA), BioMistral (open-source, 7B), Clinical Camel (fine-tuned LLaMA on USMLE), and GPT-4 Medical (OpenAI) represent the next generation of models with the potential to increase medical response accuracy above 90 %.
1. RAG pipeline dramatically reduces hallucinations — from 23.1 % to 4.7 %, which is critical for medical applications where inaccurate information can lead to health harm.
2. Accuracy of 85.4 % assessed by a certified physician confirms the clinical relevance of AI interpretation of wearable data, particularly in the domains of cardiology (89 %) and fitness (91 %).
3. HRV is the most informative wearable biomarker with the strongest correlation to overall health status, confirming findings by Shaffer & Ginsberg (2017).
4. User engagement of 78 % daily (vs. 30–40 % for typical health apps) and NPS +62 indicate strong product-market fit for AI-interpreted health data.
5. The platform is an informational tool, not a diagnostic device — the next step requires clinical validation and a regulatory pathway for MDR certification.
Availability: The eHealthAI platform is available at healthai.bemooore.com.
Project reference: bittechnology.bemooore.com/referencie/ehealth-ai
Author: Ing. Stanislav Pittner, Managing Director, BIT Technology s.r.o.
© 2025 Ing. Stanislav Pittner — BIT Technology s.r.o.