New Foundation Model Just Dropped: This Time Focused on Gene Regulation
While transcriptional regulation underpins the diversity of biological and pathological processes, our understanding of this process remains surprisingly incomplete. Cell-specific transcriptional profiles arise from myriad protein-protein and protein-DNA interactions taking place in a background of differing epigenetic conditions. The reported clustering of transcription factor (TF) binding motifs (Vierstra et al.) highlights the homology of DNA-binding domains and the low combinatorial variability regarding regulatory interactions; however, our understanding of transcription regulation remains limited to specific cell types. Furthermore, we still do not know if combinatorial interactions of TFs determine cell-specific gene expression profiles. Overall, we appear to understand little regarding the critical process of transcriptional regulation.

Developing fine-tuned prediction methods based on sequence data and trained on specific human cell types (Zhou et al., Kelly 2020, and Zhang et al.) represented a first step towards an improved understanding of transcriptional regulation; recently, the implication of foundation models – machine/deep learning models trained on vast datasets for application to range of use cases – have fostered generalizability and improved utility (OpenAI. GPT-4 technical report and Lin et al.). More recent developments in single-cell foundation models (Theodoris et al., Cui et al., and Hao et al.) have reported the encoding of transcriptomic profiles within single models to enable downstream tasks. Will the report of a foundation model describing how transcription emerges from the chromatin landscape represent the next evolutionary step?
Understanding relationships between chromatin and transcriptional output remains tricky, given that approaches generally require separate epigenetic and transcriptomic assays in distinct cell samples and reintegrating these diverse datasets. Parallel analysis of individual cells for RNA expression and DNA from targeted tagmentation by sequencing or "Paired-Tag" from Epigenome Technologies generates joint epigenetic and gene expression profiles at the single-cell resolution and detects histone modifications and RNA transcripts in individual nuclei with efficiencies similar to single-nucleus RNA-seq/ChIP-seq assays.
Paired-Tag technology can enable researchers to advance our understanding of transcriptional regulation and improve disease management.
Researchers from the laboratories of Xi Fu, Raul Rabadan (Columbia University), and Eric P. Xing (MBZUAI/Carnegie Mellon University) knew that current computational models of transcriptional regulation lacked the generalizability required to support extrapolation to "unseen" cell types and conditions, which would aid the appreciation of gene regulation at a deeper level and perhaps extend to identifying new targets in a variety of diseases/disorders. In their recent Nature study, the authors introduce GET - general expression transformer – as an interpretable foundation model for transcriptional regulation based on the integration of genomic sequence information and chromatin accessibility data from over 200 fetal and adult human cell types (Fu, Mo, and Buendia et al.). In part one of this series, Epigenome Technologies offers a brief overview of the development of GET before describing the practical applications of this exciting technological advance.
Developing and Benchmarking the General Expression Transformer

GET focuses on characterizing a local genomic region containing regulatory elements by understanding cell-specific TF binding and chromatin accessibility, which provides a proxy for PolII-driven gene expression within each element (approximated from RNA-seq data)
The method of encoding each gene as a structured matrix represents a distinctive innovation; GET defines a gene by the local regulatory landscape instead of assigning each gene a unique identifier
Specifically, a matrix with dimensions of 200 rows × 283 columns represents each gene
The rows represent a series of contiguous windows or "bins" spanning a genomic region around the transcription start site; each row represents a localized region of chromatin accessibility
The columns aggregate TF motif binding scores into 282 clusters with an additional channel that captures underlying chromatin accessibility
An individual matrix entry quantifies the likelihood of a TF binding event (or the accessibility level) at that precise bin
A high value at a particular row and column may indicate robust predicted binding for a specific TF motif within a given region of chromatin accessibility, while lower scores suggest weaker binding/less accessibility
This detailed representation enables GET to "define" a gene by a spatially resolved profile of regulatory features – a property leveraged for predicting reporter assays
A regulatory element-specific architecture supports a self-supervised pre-training step that allows GET to learn how regions and features interact across cell types
Masking out random regulatory elements trains GET to predict TF binding/chromatin accessibility and subsequent PolII activity/gene expression
A portion (typically up to 50%) of the matrix entries are masked out using a learnable token, with GET aiming to impute - or accurately predict - missing values based on the context provided by surrounding entries
Imputing missing data forces the model to learn local correlations and dependencies within the matrix
For instance, the model can capture relationships when specific TF motifs co-occur in neighboring bins or when specific accessibility patterns predict binding events in adjacent regions
The design of GET supports the use of chromatin accessibility data without paired gene expression data to improve the diversity of training data regulation information
GET pre-training employs single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) data from 213 human cell types from the fetus and adult (Zhang et al., Joung et al., and Domcke et al.)
Fine-tuning utilizes gene expression data for 153 cell types from multiomics or separate single-cell RNA-seq analyses (Cao et al. and The Tabula Sapiens Consortium)
An assessment of GET's ability to accurately predict gene expression in unseen cell types reveals a high correlation between GET's predicted and the observed expression values for a "left-out" cell type (in this case, astrocytes)
GET significantly outperforms related technologies, including more straightforward machine learning approaches trained in the same manner on the same data
The authors note the elevated level of generalizability, as GET functions on adult cell types when trained solely on fetal data and employing distinct sequencing platforms and other experimental assays
An examination of zero-shot prediction capacity (recognizing/categorizing without observing examples beforehand) of gene expression-driving regulatory elements in unseen cell types reveals GET's ability to accurately predict the outcomes of a lentivirus-based massively parallel reporter assay in lymphoblast cells
In this case, GET pre-training employs chromatin accessibility and gene expression data from lymphoblast cells but did not expose GET to reporter assay data
Benchmarking against the deep learning architecture known as "Enformer" (Avsec et al.) in lymphoblast cells reveals that GET made more accurate predictions and scaled better, with predicted regulatory elements displaying meaningful enrichment of histone modifications and TF binding sites
Training Enformer employed TF binding, histone modification, gene expression, and chromatin accessibility data and functioned better in certain instances (enhancers and repressed regions); however, GET displays significant advantages in terms of computational cost
Can Chromatin Accessibility Alone Suffice When Predicting Gene Expression?

While GET leverages chromatin accessibility as a primary proxy for active regulatory regions, this approach prompts a pivotal question: can chromatin accessibility data explain gene expression levels, or do we require additional epigenetic information, such as histone modification profiles? Said modifications function directly by recruiting transcriptional activators or indirectly through chromatin remodeling complexes, suggesting that accessibility lays the groundwork for gene activation, but the full orchestration of transcription depends on a complex interplay of structural/chemical signals.
General Expression Transformer – The Highlights
Overall, the authors report how GET provides experimental-level accuracy in gene expression prediction in seen and unseen cell types employing chromatin accessibility data and sequence information, displays adaptability across sequencing platforms and assays, offers zero-shot prediction of reporter assay readouts, and outperforms previous state-of-the-art models when identifying cis-regulatory elements.
Future studies may be supported by robust single-cell datasets provided by Paired-tag – an analytical platform that creates joint epigenetic and gene expression profiles at single-cell resolution and detects histone modifications and RNA transcripts in individual nuclei. The Bing Ren lab developed Paired-tag, and Epigenome Technologies offers optimized Paired-tag kits and services to epigenetics researchers an exclusive license from the Ludwig Institute for Cancer Research.
See Nature, January 2025 for more on the development of GET as an interpretable foundation model for transcriptional regulation, and stay tuned to Twitter, Bluesky, and LinkedIn to keep up to date with all the new epigenetics studies and the Epigenome Technologies website for part 2. In the meantime, check out our Products and Services pages to see how Epigenome Technologies can elevate your research today.
Comments