Penalized regression with negative-unlabeled data: an approach to developing a Long COVID research index.

Penalized regression with negative-unlabeled data: an approach to developing a Long COVID research index.

Publication date: Dec 09, 2025

Moderate to severe Long COVID is estimated to impact as many as 10% of SARS-CoV-2 infected individuals, representing a chronic condition with a substantial public health burden. An expansive literature has identified over 200 persistent symptoms associated with a history of SARS-CoV-2 infection; yet, there remains to be a clear consensus on a syndrome definition. Long COVID thus represents a “negative-unlabeled” outcome where those without prior infection must be Long COVID “negative” but those with prior infection have unknown or “unlabeled” Long COVID status. Despite this lack of a gold standard definition or biomarker, developing and evaluating an approach to characterizing Long COVID is a critical first step in future studies of risk and resiliency factors, mechanisms of disease, and interventions for both treatment and prevention. We recently applied a strategy for defining a numeric Long COVID research index (LCRI) using Lasso-penalized logistic regression, leveraging information on history of SARS-CoV-2 infection as a pseudo-label. In the current manuscript we formalize and evaluate this approach in a simulation framework for the occurrence of infection, Long COVID onset, and symptomatology. We evaluate its performance selecting symptoms associated with Long COVID and distinguishing individuals with Long COVID, in the presence of symptom correlations and demographic confounders. We compare the LCRI method to a simpler index defined by counting Long COVID symptoms, and assess these methods in a reanalysis of data on participants enrolled in the Adult Cohort of the Researching COVID to Enhance Recovery (RECOVER) study. Simulation results demonstrate that the Lasso-penalized LCRI methodology appropriately selects symptoms associated with Long COVID, and that the LCRI has high discriminatory power to distinguish Long COVID, outperforming symptom count. This performance was robust to correlation between symptoms, and weighting methods are shown to successfully address potential confounding by demographic characteristics. Analysis of RECOVER data showed the LCRI outperforming symptom count by misclassifying fewer uninfected individuals as having Long COVID. As the LCRI is increasingly used to characterize LC in research settings, this paper represents an important step in understanding its operating characteristics and developing general methodology for settings with negative-unlabeled data.

Open Access PDF

Concepts Keywords
Biomarker COVID-19
Covid Feature selection
Disease Long COVID
Lasso Negative-unlabeled data
Penalized regression
SARS-CoV-2

Semantics

Type Source Name
disease MESH Long COVID
disease MESH chronic condition
disease MESH SARS-CoV-2 infection
pathway REACTOME SARS-CoV-2 Infection
disease MESH syndrome
disease MESH infection
drug DRUGBANK Gold
pathway REACTOME Reproduction
disease MESH included
disease MESH char
drug DRUGBANK Creatinolfosfate
drug DRUGBANK L-Valine
disease MESH fac
drug DRUGBANK Aspartame
drug DRUGBANK Methylergometrine
disease MESH asymptomatic infection
drug DRUGBANK Saquinavir
drug DRUGBANK Esomeprazole
drug DRUGBANK Dextromethorphan
disease MESH chronic cough
disease MESH brain fog
drug DRUGBANK Dimercaprol
disease MESH abnormal movements
disease MESH dizziness
disease MESH chest pain
disease MESH cough
disease MESH face
disease MESH viral infection
disease MESH tick bite
drug DRUGBANK Coenzyme M
disease MESH ald
disease MESH severe acute respiratory syndrome
disease MESH tic
disease MESH Hair loss
disease MESH Fatigue

Original Article

(Visited 1 times, 1 visits today)

Leave a Comment

Your email address will not be published. Required fields are marked *