PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.

Publication date: Jan 19, 2024

One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30cD7 larger. Our method forecasts unseen lineages months in advance, whereas models 4cD7 and 30cD7 larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets.

Open Access PDF

Concepts Keywords
Forecast Complete
Models Forecast
Pandemic Forecasts
Train Future
Viral Generating
Instances
Learning
Models
Pandemic
Pandogen
Sars
Sequence
Sequences
Variants
Viral

Semantics

Type Source Name
disease VO time
disease IDO history
disease IDO process
pathway REACTOME Reproduction
disease VO effectiveness
disease MESH SARS CoV 2 infection
drug DRUGBANK Succimer
disease IDO host
disease VO effective
drug DRUGBANK Coenzyme M
disease MESH point mutations
disease IDO quality
drug DRUGBANK Aspartame
drug DRUGBANK Hexadecanal
drug DRUGBANK Flunarizine
drug DRUGBANK Amino acids
disease IDO algorithm
disease VO population
drug DRUGBANK Tropicamide
drug DRUGBANK Ranitidine
drug DRUGBANK L-Arginine
drug DRUGBANK Tretamine
drug DRUGBANK Pentaerythritol tetranitrate
disease VO efficiency
disease VO document
drug DRUGBANK Spinosad
disease MESH co infection
disease MESH premature stop codons
drug DRUGBANK Trestolone
disease VO device
disease VO Gap
drug DRUGBANK Huperzine B
drug DRUGBANK Ramipril
drug DRUGBANK L-Glutamine
drug DRUGBANK Methionine
drug DRUGBANK Ademetionine
disease IDO cell
disease VO report
drug DRUGBANK Stavudine
disease IDO infectivity
drug DRUGBANK (S)-Des-Me-Ampa
drug DRUGBANK Angiotensin II
drug DRUGBANK Dimethyl sulfone
disease IDO assay
drug DRUGBANK Chlorhexadol

Original Article

(Visited 1 times, 1 visits today)