Identifying clusters of people with Multiple Long-Term Conditions using Large Language Models: a population-based study

Publication date: Feb 15, 2025

Identifying clusters of people with similar patterns of Multiple Long-Term Conditions (MLTC) could help healthcare services to tailor management for each group. Large Language Models (LLMs) can utilise complex longitudinal electronic health records (EHRs) which may enable deeper insights into patterns of disease. Here, we develop a pipeline, incorporating an LLM, to generate gender-specific clusters using clinical codes recorded in EHRs. In this population-based study, we used EHRs from individuals aged [≥]50 years from Clinical Practice Research Datalink in the UK. Longitudinal sequences of medical histories including diagnoses, diagnostic tests and medications were used to pre-train an LLM based on DeBERTa. The LLM, called EHR-DeBERTa, includes embedding layers for age of diagnosis, calendar year of diagnosis, gender, and visit number with a diagnosis vocabulary of 3776 tokens, covering the entire ICD-10 hierarchy. We fine-tuned EHR-DeBERTa using contrastive learning and generated patient embeddings for all individuals. A bootstrapping clustering pipeline was applied separately for females and males and gender-specific patient clusters were characterised by disease prevalence, ethnicity and deprivation. A total of 5,846,480 patients were included. We identified fifteen clusters in females and seventeen clusters in males, grouped into five categories: i) low disease burden; ii) mental health; iii) cardiometabolic diseases; iv) respiratory diseases, and v) mixed diseases. Cardiometabolic and mental health conditions showed the strongest separation across clusters. People in low disease burden and mental health clusters were younger, whereas those in cardiometabolic clusters were older, with females in cardiometabolic clusters older than their male counterparts. Using an LLM applied to longitudinal EHRs, we generated interpretable and gender-specific clusters of diseases, providing insights into patterns of diseases. Extending these methods in future to incorporate clinical outcomes could enable identification of high-risk patients and support precision-medicine approaches for managing MLTC.

PDF

Concepts Keywords
E1003514e1003514 Burden
Hospitalised Cardiometabolic
Mix3 Cluster
Python Clusters
Conditions
Diseases
Females
Individuals
Low
Medrxiv
Mental
Multiple
Patient
Preprint
Prevalence

Semantics

Type Source Name
disease MESH respiratory diseases
disease MESH Cancer
disease MESH chronic conditions
drug DRUGBANK Gold
drug DRUGBANK Imidacloprid
drug DRUGBANK Hydroxyethyl Starch
disease MESH death
disease IDO history
disease MESH morbidities
disease IDO process
disease MESH metabolic diseases
disease MESH hypertension
disease MESH obesity
drug DRUGBANK Ethanol
disease MESH asthma
pathway KEGG Asthma
disease MESH gout
disease MESH dementia
disease MESH heart failure
disease MESH multiple sclerosis
disease MESH eating disorders
disease MESH cardiovascular risk factors
disease MESH anxiety
disease MESH depression
disease MESH osteoarthritis
disease MESH osteoporosis
disease MESH hearing impairment
disease MESH melanoma
pathway KEGG Melanoma
disease MESH cardiovascular diseases
disease MESH COPD
disease MESH Parkinson’s disease
disease MESH schizophrenia
disease MESH bipolar disorder
disease MESH liver disease
disease MESH infectious diseases
disease MESH connective tissue disease
disease MESH bronchiectasis
disease MESH Ischemic heart disease
disease MESH arrythmia
disease MESH epilepsy
pathway REACTOME Infectious disease
disease IDO infectious disease
disease MESH Chronic Kidney Disease
disease IDO intervention
disease MESH abnormalities
disease MESH psychotic disorders
disease MESH infection
disease MESH inflammation
disease MESH cognitive decline
disease MESH lifestyle
disease MESH missed diagnosis
disease IDO blood
disease MESH emergency
drug DRUGBANK Troleandomycin

Download Document

(Visited 2 times, 1 visits today)