OBJECTIVEThe diagnosis of Gaucher disease (GD) presents a major challenge due to the high variability and low specificity of its clinical characteristics, along with limited physician awareness of the disease's early symptoms. Early and accurate diagnosis is important to enable effective treatment decisions, prevent unnecessary testing, and facilitate genetic counseling. This study aimed to develop a machine learning (ML) model for GD screening and GD early diagnosis based on real-world clinical data using the Maccabi Healthcare Services electronic database, which contains 20 years of longitudinal data on approximately 2.6 million patients.STUDY DESIGN AND SETTINGWe screened the Maccabi Healthcare Services database for patients with GD between January 1998 and May 2022. Eligible controls were matched by year of birth, sex, and socioeconomic status in a 1:13 ratio. The data were partitioned into 75% training and 25% test sets and trained to predict GD using features obtained from medical and laboratory records. Model performances were evaluated using the area under the receiver operating characteristic curve and the area under the precision-recall curve.RESULTSWe detected 264 confirmed patients with GD to which we matched 3,429 controls. The best model performance (which included known GD signs and symptoms, previously unknown clinical features, and administrative codes) on the test set had an area under the receiver operating characteristic curve = 0.95 ± 0.03 and area under the precision-recall curve = 0.80 ± 0.08, which yielded a median GD identification of 2.78 years earlier than the clinical diagnosis (25th-75th percentile: 1.29-4.53).CONCLUSIONUsing an ML approach on real-world data led to excellent discrimination between GD patients and controls, with the ability to detect GD significantly earlier than the time of actual diagnosis. Hence, this approach might be useful as a screening tool for GD and lead to earlier diagnosis and treatment. Furthermore, advanced ML analytics may highlight previously unrecognized features associated with GD, including clinical diagnoses and health-seeking behaviors.PLAIN LANGUAGE SUMMARYDiagnosing Gaucher disease is difficult, which often leads to late or incorrect diagnoses. As a result, patients may undergo unnecessary tests and treatments and experience health deterioration despite medications availability for Gaucher disease. In this study, we used electronic health data to develop machine learning models for early diagnosis of Gaucher disease type 1. Our models, which included known Gaucher disease signs and symptoms, previously unknown clinical features, and administrative codes, were able to significantly outperform other models and expert opinions, detecting type 1 Gaucher disease 3 years on average before actual diagnosis. Our models also revealed new features linked to type 1 Gaucher disease, including specific diagnoses and patterns in patients' healthcare-seeking behaviors. We believe that the tool of machine learning can be valuable for patients with rare diseases.