Natural Language Processing Improves Reliable Identification of COVID-19 Compared to Diagnostic Codes Alone.

Publication date: Jul 30, 2025

Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21,659 patients (out of 1,560,564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68. 6% for

Concepts Keywords
Physicians cohort identification
Race COVID-19
Reliable natural language processing
sample sizes

Semantics

Type Source Name
disease MESH COVID-19

Original Article

(Visited 1 times, 1 visits today)