Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction.

Publication date: Feb 13, 2025

Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which may deteriorate machine learning model performance. Existing approaches addressing noisy training data typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This method uses a small set of well-curated training data and leverages ICP-calculated reliability metrics to selectively correct mislabeled data and outliers within vast quantities of noisy training data. The efficacy is validated across three classification tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature with free-text title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced via label permutation. Our training-data-cleaning method significantly enhanced the downstream classification performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement in 86 out of 96 DILI experiments (up to 11. 4% increase from 0. 812 to 0. 905), significant AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23. 8% increase from 0. 597 to 0. 739 for AUROC, and 69. 8% increase from 0. 183 to 0. 311 for AUPRC), and significant accuracy and macro-average F1-score improvements in 47 out of 48 RNA-sequencing experiments (up to 74. 6% increase from 0. 351 to 0. 613 for accuracy, and 89. 0% increase from 0. 267 to 0. 505 for F1-score). The improvement can be both statistically and clinically significant for information retrieval, disease diagnosis and prognosis. The method offers the potential to substantially boost classification performance in biomedical machine learning tasks without necessitating an excessive volume of well-curated training data or strong data distribution and modeling assumptions in existing semi-supervised learning methods.

Open Access PDF

Concepts Keywords
Biomedical Biomedical
Deteriorate Classification
F1 Cleaning
Radiomics Conformal
Curated
Data
Enhanced
Experiments
Increase
Inductive
Learning
Noise
Reliability
Significant
Training

Semantics

Type Source Name
disease MESH COVID-19
disease MESH breast cancer
pathway KEGG Breast cancer
disease IDO history
disease IDO process
pathway REACTOME Reproduction
disease MESH drug induced liver injury
drug DRUGBANK Trestolone
drug DRUGBANK Coenzyme M
drug DRUGBANK Sulpiride
disease MESH misdiagnoses
disease MESH uncertainty
disease IDO quality
drug DRUGBANK Cysteamine
disease IDO nucleic acid
disease MESH coronary heart disease
disease MESH hypertension
disease MESH COPD
disease MESH liver disease
disease MESH chronic kidney disease
disease MESH carcinoma
disease MESH dyspnea
disease IDO blood
drug DRUGBANK Prothrombin
drug DRUGBANK Potassium
drug DRUGBANK Creatinine
disease MESH infection
disease MESH pneumonia
disease IDO algorithm
drug DRUGBANK Tretamine
drug DRUGBANK Isoxaflutole
drug DRUGBANK Spinosad
disease MESH neurological disorders
drug DRUGBANK Esomeprazole
drug DRUGBANK Flunarizine
disease MESH sepsis
disease MESH cardiovascular diseases
disease MESH glioblastoma
disease MESH gastric cancer
pathway KEGG Gastric cancer
disease MESH tumor

Original Article

(Visited 1 times, 1 visits today)