AbstractIntroductionDespite evidence showing that agreement between human and some automatic staging systems is generally comparable to agreement between human scorers, automated scoring is rarely used in clinical practice, even though it offers time savings and consistency. We propose a paradigm for testing digital systems that reveals their true accuracy vs. highly experienced academic scorers. As an example of a digital method to be tested, we used Michele Sleep Scoring (abbreviated:Digital).Methods70 PSGs were scored by 6 experienced technologists from 3 academic centers. Staging results were compared to digital staging results using an epoch-by-epoch approach. For each PSG we carried out 6 cycles of comparisons. Each cycle consisted of two steps, one comparing one scorer (tested scorer) with the scoring of the five remaining scorers (judges), and one comparing Digital as the tested scorer with the same 5 judges. Error 1 was assessed when all judges disagreed with the tested scorer but there was disagreement between the judges. Error 2 was assigned when all judges disagreed with the tested scorer but agreed unanimously on the stage. For each PSG the number of epochs with types 1 and 2 errors was counted for each scorer (n=6 scorers) and for Digital. Results of all 70 PSGs were pooled, and percent of types 1 and 2 errors is reported for all scorers and Digital.Results70 PSGs (females aged 51.1 ± 4.2 years) were evaluated. Average times in different sleep stages (manual scoring) were 43±18, 244±47, 30±21, and 81±25 minutes for stages N1, N2, N3 and REM, respectively. TST was 398±52 minutes, and sleep efficiency was 84±8%. There was a total of 65,053 epochs scored by each scorer and Digital. The average percent of type 1 errors made by scorers for all epochs was 6.4% (0-33.2) vs. 7.8% (1.68-26.6) made by Digital. The average percent of type 2 errors made by scorers for all epochs was 3.9% (0-28.6) vs. 4.3% (0-17.3) made by Digital.ConclusionThis study provides an objective way of testing the accuracy of automated scoring systems and supports evidence that the accuracy of Michele Sleep Scoring is comparable to manual scoring.Support (If Any)None