Publication date: Dec 09, 2025
Moderate to severe Long COVID is estimated to impact as many as 10% of SARS-CoV-2 infected individuals, representing a chronic condition with a substantial public health burden. An expansive literature has identified over 200 persistent symptoms associated with a history of SARS-CoV-2 infection; yet, there remains to be a clear consensus on a syndrome definition. Long COVID thus represents a “negative-unlabeled” outcome where those without prior infection must be Long COVID “negative” but those with prior infection have unknown or “unlabeled” Long COVID status. Despite this lack of a gold standard definition or biomarker, developing and evaluating an approach to characterizing Long COVID is a critical first step in future studies of risk and resiliency factors, mechanisms of disease, and interventions for both treatment and prevention. We recently applied a strategy for defining a numeric Long COVID research index (LCRI) using Lasso-penalized logistic regression, leveraging information on history of SARS-CoV-2 infection as a pseudo-label. In the current manuscript we formalize and evaluate this approach in a simulation framework for the occurrence of infection, Long COVID onset, and symptomatology. We evaluate its performance selecting symptoms associated with Long COVID and distinguishing individuals with Long COVID, in the presence of symptom correlations and demographic confounders. We compare the LCRI method to a simpler index defined by counting Long COVID symptoms, and assess these methods in a reanalysis of data on participants enrolled in the Adult Cohort of the Researching COVID to Enhance Recovery (RECOVER) study. Simulation results demonstrate that the Lasso-penalized LCRI methodology appropriately selects symptoms associated with Long COVID, and that the LCRI has high discriminatory power to distinguish Long COVID, outperforming symptom count. This performance was robust to correlation between symptoms, and weighting methods are shown to successfully address potential confounding by demographic characteristics. Analysis of RECOVER data showed the LCRI outperforming symptom count by misclassifying fewer uninfected individuals as having Long COVID. As the LCRI is increasingly used to characterize LC in research settings, this paper represents an important step in understanding its operating characteristics and developing general methodology for settings with negative-unlabeled data.
Open Access PDF
| Concepts | Keywords |
|---|---|
| Biomarker | COVID-19 |
| Covid | Feature selection |
| Disease | Long COVID |
| Lasso | Negative-unlabeled data |
| Penalized regression | |
| SARS-CoV-2 |
Semantics
| Type | Source | Name |
|---|---|---|
| disease | MESH | Long COVID |
| disease | MESH | chronic condition |
| disease | MESH | SARS-CoV-2 infection |
| pathway | REACTOME | SARS-CoV-2 Infection |
| disease | MESH | syndrome |
| disease | MESH | infection |
| drug | DRUGBANK | Gold |
| pathway | REACTOME | Reproduction |
| disease | MESH | included |
| disease | MESH | char |
| drug | DRUGBANK | Creatinolfosfate |
| drug | DRUGBANK | L-Valine |
| disease | MESH | fac |
| drug | DRUGBANK | Aspartame |
| drug | DRUGBANK | Methylergometrine |
| disease | MESH | asymptomatic infection |
| drug | DRUGBANK | Saquinavir |
| drug | DRUGBANK | Esomeprazole |
| drug | DRUGBANK | Dextromethorphan |
| disease | MESH | chronic cough |
| disease | MESH | brain fog |
| drug | DRUGBANK | Dimercaprol |
| disease | MESH | abnormal movements |
| disease | MESH | dizziness |
| disease | MESH | chest pain |
| disease | MESH | cough |
| disease | MESH | face |
| disease | MESH | viral infection |
| disease | MESH | tick bite |
| drug | DRUGBANK | Coenzyme M |
| disease | MESH | ald |
| disease | MESH | severe acute respiratory syndrome |
| disease | MESH | tic |
| disease | MESH | Hair loss |
| disease | MESH | Fatigue |