Electronic Health Records Based Prediction of Future Incidence of Alzheimer’s Disease Using Machine Learning
Background: Prediction of future incidence of Alzheimer’s disease may facilitate intervention strategy to delay disease onset. Existing AD risk prediction models require collection of biospecimen (genetic, CSF, or blood samples), cognitive testing, or brain imaging. Conversely, EHR provides an opportunity to build a completely automated risk prediction model based on individuals’ history of health and healthcare. We tested machine learning models to predict future incidence of AD using administrative EHR in individuals aged 65 or older.
Methods: We obtained de-identified EHR from Korean elders age above 65 years old (N=40,736) collected between 2002 and 2012 in the Korean National Health Insurance Service database system. Consisting of Participant Insurance Eligibility database, Healthcare Utilization database, and Health Screening database, this EHR contain 4,894 unique clinical features including ICD-9/10 codes, medication codes, laboratory values, history of personal and family illness, and socio-demographics. Our event of interest was new incidence of AD defined from the EHR based on both AD codes and prescription of anti-dementia medication. Two definitions were considered: a more stringent one requiring a diagnosis and dementia medication resulting in n=614 cases (“definite AD”) and a more liberal one requiring only diagnostic codes (n=2,026; “probable AD”). We trained and validated a random forest, support vector machine, and logistic regression to predict incident AD in 1,2,3, and 4 subsequent years using the EHR available since 2002. The length of the EHR used in the models ranged from 1,571 to 2,239 days. Data was randomly split into training (60%), validation (20%), and test sets (20%) so that AUC values represent true out of sample prediction are based on the test set.
Results: Average duration of EHR was 1,936 days in AD and 2,694 days in controls. For predicting future incidence of AD using the “definite AD” outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.781; in 2 year, 0.739; in 3 year, 0.686; in 4 year, 0.662. Using “probable AD” outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.730; in 2 year, 0.645; in 3 year, 0.575; in 4 year, 0.602. Important clinical features selected in logistic regression included hemoglobin level (b=-0.902), age (b=0.689), urine protein level (b=0.303), prescription of Lodopin (antipsychotic drug) (b=0.303), and prescription of Nicametate Citrate (vasodilator) (b=-0.297).
Conclusion: This study demonstrates that EHR can i detect risk for incident AD. This approach could enable risk-specific stratification of elders for better targeted clinical trials.
- AD_EHR_TEXT.pdf application/pdf 413 KB Download File