The case of the missing eights. An object lesson in data quality assurance

Stellman, Steven D.

Data analysis is an integral part of the training of epidemiologists, but computer-based data management and quality control (QC) procedures whereby raw data are prepared for analysis are often overlooked. Cancer Prevention Study 2 (CPS-2) is a cohort study of 1.2 million American men and women begun by the American Cancer Society in 1982. During data preparation for a study of diet and cancer it was found that the distribution of the number of missing items out of 28 possible foods was monotonic, as expected, except that no individuals were missing exactly 8 or 18 items. These anomalous “holes” in the distribution were traced to a programming error within a section of QC code that confused a zero with the letter O. One lesson learned is that simple frequency tabulations to identify missing, out of range, or miscoded individual data items, as well as more complex assessment of permissible combinations of multiple items, should be supplemented by content-sensitive tests as well.


American Journal of Epidemiology

September 4, 2019