Model Card for BlueSTARR

The BlueSTARR model predicts regulatory activation (or suppression) from 300, 600, and 1k base-long genomic sequences based on training on the RNA over DNA counts ratios from STARR-seq experiments, a type of high-throughput reporter assay.

Model Details

Model Description

Trained BlueSTARR models take as input DNA sequences of length 300, 600, and 1k bases, depending on what the particular model was trained with (see directory structure). The models predicts a STARR-seq activation value (corresponding to RNA over DNA count ratio). By mutating a base in the sequence (typically a base at or near the center of the sequence) and comparing the predictions for reference versus variant, the model can be used for non-coding variant effect prediction.

Developed by: Revathy Venukuttan, Alexander Thomson, Yutian Chen, Boyao Li, Hilmar Lapp, William H. Majoros
License: MIT

Model Sources

Code: BlueSTARR
Paper: Revathy Venukuttan, Richard Doty, Alexander Thomson, Yutian Chen, Boyao Li, Yuncheng Duan, Alejandro Barrera, Katherine Dura, Kuei-Yueh Ko, Hilmar Lapp, Timothy E Reddy, Andrew S Allen, William H Majoros (2026). Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays. bioRxiv 2026.03.27.714770; doi: https://doi.org/10.64898/2026.03.27.714770

Out-of-Scope Use

Because BlueSTARR is trained on STARR-seq data, it is unsuitable for predicting the effect of coding variants.

Training Details

Training Data

The K562 and A549 STARR-seq data that were used for training are available from ENCODE:

K562: ENCSR926NDZ
A549-DMSO: ENCSR005XEA
A549-DEX: ENCSR459NUS

Preprocessing

The data-preprocessing steps from FASTQ or BigWig files are enumerated at the BlueSTARR code repository.

The initial preprocessing steps (before downsampling) resulted in 19 million records (sequence, DNA counts, RNA counts) for the K562 STARR-seq data, 6 million for the A549/DMSO STARR-seq data, and 7 million for A549/DEX STARR-seq data.

To evaluate the effect of training set size, the K562 data were downsampled to 0.8M, 1M, 1.3M, 1.5M and 1.55M training examples. Based on benchmarking against the Kircher et al (2019) MPRA data, which showed performance to plateau after 1.5M training examples, 1.55M was selected for all other training set sizes for downsampling. Using this sampling strategy, the final dataset was split into 1.55M training examples, 0.5M validation examples and 0.5M test examples.

Training Procedure

The different model configurations were each trained with 10-30 independent training runs to account for stochasticity in model weights initialization and training batches. The models posted here are those with the minimum loss on the validation set, after using continued (as defined by patience hyperparameter) failure to improve validation loss as an early stopping criterion.

Evaluation

Predictive accuracy was assessed on both quantitative accuracy of effect sizes and binary classification accuracy for variants classified as being functional or non-functional.

Quantitative effect size accuracy

Quantitative effect size prediction accuracy was evaluated using root mean squared error (RMSE) and correlation between predicted and observed effect sizes on held-out STARR-seq test data for both K562 and A549-based models. For each sequence, the model predicts a quantitative effect size which was compared against the naive estimator for effect size (ratio of mean RNA counts to mean DNA counts across replicates) calculated from the experimental data.

Classification accuracy

For classification accuracy, model performance was assessed using unseen variants from saturation mutagenesis MPRA data reported by Kircher et al (2019). Note that these include measurements across multiple cell types, many of which were different from the A549 and K562 cell line experiments on which the model was trained, thus effectively testing the model out-of-distribution. MPRA variants were labeled as functional (positive) or non-functional (negative) based on experimentally measured regulatory activity following the same criteria described by Kircher et al. Variants were classified as positive if they exhibited a statistically significant effect (p ≤ 1x10-5). Variants were classified as negative if they were not statistically significant (p >1x10-5) and had an absolute effect size (|log2FC| ≤ 0.05). Variants with a minimum tag count of less than 10 were excluded (as suggested in Kircher et al.).

For each variant, variant effect predictions were obtained as the BlueSTARR-predicted regulatory activity of the sequence with the alternate allele over the regulatory activity predicted for the sequence with the reference allele (or their difference on the log scale). Using absolute predicted variant effect for ordering, model performance was quantified using the area under the receiver operating characteristic curve (AUROC), computed by comparing the threshold-dependent predicted label against the MPRA-assigned binary labels.

Technical Specifications

Model Architecture

BlueSTARR was originally inspired by applying the DeepSTARR model to human STARR-seq data. To permit easy modification of the overall neural network architecture, the implementation was extended to build the model from a simple configuration file, allowing for rapid comparison of different model architectures.

The default architecture consists of a convolutional neural network (CNN) that accepts one-hot encoded 300 bp DNA sequences as input. Our default model comprises five one-dimensional convolutional layers with 1024, 512, 256, 128 and 64 filters, and kernel sizes of 8,16, 32, 64 and 128, respectively. This choice was made to maintain a large receptive field without using pooling after each layer (in contrast to DeepSTARR). Each convolutional layer is followed by batch normalization and ReLU activation, and a dropout layer is applied before each convolution. All layers use the same padding with no dilation, no residual connections and no intermediate pooling. The final convolutional layer outputs were aggregated by global average pooling and connected directly to a single output neuron for prediction.

The model was trained using the Adam optimizer with a learning rate of 0.002, a mean-squared error (MSE) loss function, batch size of 128, and an early stopping “patience” value of 10. All of the foregoing hyperparameters can be easily changed via the model configuration text file.

Variations of the above-mentioned architecture were explored to evaluate different receptive fields, sequence lengths, number of layers, inclusion of attention mechanism, and a custom loss function based on negative log-likelihood (NLL).

Compute Infrastructure

Training was performed on nodes provided by the Duke Compute Cluster using consumer-grade NVIDIA RTX A5000 and RTX A6000 GPUs. A single GPU was used for each training run. Most training runs completed in under 24 hours.

Software

Code to train BlueSTARR models and to perform inference with them can be found in the BlueSTARR repository on GitHub.

Citation

BibTeX:

If you use our model in your work, please cite the associated paper:

@article {Venukuttan2026.03,
    author = {Venukuttan, Revathy and Doty, Richard and Thomson, Alexander and Chen, Yutian and Li, Boyao and Duan, Yuncheng and Barrera, Alejandro and Dura, Katherine and Ko, Kuei-Yueh and Lapp, Hilmar and Reddy, Timothy E and Allen, Andrew S and Majoros, William H},
    title = {Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays},
    elocation-id = {2026.03.27.714770},
    year = {2026},
    doi = {10.64898/2026.03.27.714770},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/03/31/2026.03.27.714770},
    eprint = {https://www.biorxiv.org/content/early/2026/03/31/2026.03.27.714770.full.pdf},
    journal = {bioRxiv}
}

Acknowledgements

This work was supported by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH) under award number 1R35-GM150404 to W.H.M., and National Human Genome Research Institute (NHGRI) of NIH 5U01-HG011967, and NHGRI 5UM1-HG012053. Content is solely the responsibility of the authors.

Glossary

STARR-seq: _Self Transcribed Active Regulatory Region_sequencing
MPRA: Massively parallel reporter assay
HPC: High-performance computing

Model Card Authors

Hilmar Lapp, William H. Majoros

Model Card Contact

William H. Majoros

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support