Evaluating multimodal GRNs

Evaluating multimodal GRNs#

Evaluating a multimodal gene regulatory network (GRN) requires two ingredients: a multiome dataset, comprising paired single-nucleus RNA-seq and ATAC-seq measurements, and a GRN inferred from that dataset. Together, these allow systematic assessment of whether the inferred regulatory relationships are supported by independent lines of evidence.

gretapy organizes evaluation into four complementary metric categories:

  • Genomic: assesses whether GRN elements (TF binding sites, CREs, CRE-to-gene links) are supported by genome-wide measurements.

  • Predictive: tests whether the GRN topology can predict molecular readouts, such as gene expression from TF activity or the presence of active gene sets.

  • Literature: checks whether the TFs, TF-TF interactions, and TF-gene pairs in the GRN are supported by curated, experimentally validated evidence from the scientific literature.

  • Mechanistic: evaluates whether the GRN can reproduce functional outcomes, including TF activity in perturbation experiments, gene expression changes after TF knockouts, and cell-type-specific steady states.

To support evaluation out of the box, gretapy provides 12 already-processed multimodal datasets spanning diverse human and mouse tissues and cell types. Each metric category can draw from multiple independent databases (e.g. ChIP-Atlas, ReMap2022, and UniBind for TF binding) or involve multiple subtasks (e.g. gene expression and chromatin accessibility prediction within the Omics metric), resulting in 25 distinct evaluation tasks across datasets. This multi-task design ensures that strong performance reflects consistency across diverse and orthogonal criteria, offering a robust and holistic view of GRN quality.

For each evaluation task, performance is quantified by defining true positives (TP), false positives (FP), and false negatives (FN) (Table 1). Precision and recall are then computed and summarized into an F0.1F_{0.1} score, which weights precision ten times more than recall (β = 0.1), reflecting the principle that in GRN evaluation, predicting a small set of high-confidence edges is more informative than recovering many noisy ones.

Table 1. Definition of evaluation tasks. TP: true positive; FP: false positive; FN: false negative.

Category

Task

TP

FP

FN

Genomic

TF Binding

TF-CRE in GRN and in binding collection

TF-CRE in GRN but not in collection

TF-CRE not in GRN but in collection

Genomic

CREs

CRE in GRN and in annotation

CRE in GRN but not in annotation

CRE not in GRN but in annotation

Genomic

CRE to Gene

CRE-Gene in GRN and in annotation

CRE-Gene in GRN but not in annotation

CRE-Gene not in GRN but in annotation

Predictive

Omics

Omic feature in GRN with significant predicted–observed correlation

Omic feature in GRN with insignificant correlation

Omic feature not in GRN but in dataset

Predictive

Gene Sets

Gene set in GRN and predicted present in omics data

Gene set in GRN but not in omics data

Gene set not in GRN but in omics data

Literature

TF Markers

TF in GRN and in marker collection

TF in GRN but not in collection

TF not in GRN but in collection

Literature

TF Pairs

TF-TF in GRN and in interactions collection

TF-TF in GRN but not in collection

TF-TF not in GRN but in collection

Literature

Reference GRN

TF-Gene in GRN and in reference collection

TF-Gene in GRN but not in collection

TF-Gene not in GRN but in collection

Mechanistic

TF Scoring

TF in GRN with significant activity score for perturbation

TF in GRN with insignificant activity score

TF not in GRN but in perturbation experiment

Mechanistic

Perturbation Forecasting

TF targets in GRN with significant predicted–observed correlation

TF targets in GRN with insignificant correlation

TF not in GRN but in perturbation experiment

Mechanistic

Steady State Simulation

Steady state matching one cell type

Steady state matching multiple or no cell types

Cell type not generated but present in dataset

Several evaluation tasks are made context-specific by filtering their reference databases using tissue- or cell-type-relevant terms. For example, when evaluating a dataset derived from heart tissue, only TF binding peaks annotated to cardiac cell types are retained; similarly, for brain datasets, only neuronal and glial annotations are used. This filtering ensures that evaluation reflects whether methods recover regulatory interactions relevant to the biological system under study, rather than generic genome-wide associations.

In this vignette, we begin by evaluating a toy GRN against a toy multiome dataset designed to simulate a PBMC-like tissue comprising B cells, T cells, and monocytes, with TF-gene interactions and marker genes grounded in prior biological knowledge. We walk through individual evaluation metrics step by step. We then demonstrate how to run the full benchmark across all 12 datasets and 25 tasks, and how to visualize and compare results against the precomputed scores of the 20 methods evaluated in the original GRETA manuscript.

On a single dataset#

Let’s start by importing the toy dataset, which contains a multiome dataset as a MuData object, and a GRN that has been “inferred” from this data as a pandas dataframe.

Note

It is a requirement that features in the GRN, like CREs and Genes, are available as features in the multiome dataset, else it will throw an error.

import gretapy as gt
import pandas as pd

mdata, grn = gt.ds.toy()
mdata
MuData object with n_obs × n_vars = 60 × 93
  obs:	'celltype'
  2 modalities
    rna:	60 x 30
    atac:	60 x 63
grn.head()
source cre target score
0 PAX5 chr16-28926950-28927450 CD19 0.85
1 PAX5 chr16-28936000-28936500 CD19 0.72
2 PAX5 chr11-60443809-60444309 MS4A1 0.78
3 PAX5 chr19-41827233-41827733 CD79A 0.72
4 PAX5 chr18-63073346-63073846 BCL2 0.68

For this vignette, we use the toy dataset to keep computations fast. In practice, gretapy provides 12 pre-processed multiome datasets spanning diverse human and mouse tissues and cell types, which can be listed with gt.show_datasets().

gt.show_datasets()
organism name pubmed geo
0 hg38 Brain 38491288 GSE193688
1 hg38 Breast 39122969 None
2 hg38 Embryo 37369347 GSE218314
3 hg38 Eye 36277849 GSE196235
4 hg38 Heart 37438528 None
5 hg38 Kidney 37333123 None
6 hg38 Lung 39266564 GSE241468
7 hg38 PBMC None None
8 hg38 Pituitary 35263594 GSE178454
9 hg38 Repr. Fibroblasts 37873116 GSE242419
10 hg38 Skin 36121124 GSE207335
11 mm10 Palate 38280850 GSE218576

These can then be easily downloaded with gt.ds.read_dts, for example:

mdata = gt.ds.read_dts(organism='hg38', dts_name='PBMC')

Note

Any data gretapy downloads, it will be available at ./gretapy_data

Next, we check which metrics are available for the given organism:

metrics = gt.show_metrics(organism='hg38')
metrics
category metric db
0 Genomic CRE to Gene eQTL Catalogue
1 Genomic CREs ENCODE Blacklist
2 Genomic CREs ENCODE CREs
3 Genomic CREs GWAS Catalog
4 Genomic CREs Promoters
5 Genomic CREs Zhang21
6 Genomic CREs phastCons
7 Genomic TF Binding ChIP-Atlas
8 Genomic TF Binding ReMap 2022
9 Genomic TF Binding UniBind
10 Literature Reference GRN CollecTRI
11 Literature TF Markers HPA
12 Literature TF Markers TF-Marker
13 Literature TF Pairs Europe PMC
14 Literature TF Pairs IntAct
15 Mechanistic Perturbation Forecasting KnockTF (forecasting)
16 Mechanistic Steady State Simulation Boolean rules
17 Mechanistic TF Scoring KnockTF (scoring)
18 Predictive Gene Sets Hallmarks
19 Predictive Gene Sets KEGG
20 Predictive Gene Sets PROGENy
21 Predictive Gene Sets Reactome
22 Predictive Omics CRE ~ TFs
23 Predictive Omics Gene ~ CREs
24 Predictive Omics Gene ~ TFs

As mentioned in the introduction, some metrics filter their reference databases using tissue- or cell-type-relevant terms to ensure context-specific evaluation. Since our toy dataset simulates a PBMC-like tissue, we reuse the terms associated with the PBMC dataset. These terms are passed to gt.tl.eval_grn_dataset and used to restrict, for example, TF binding peaks to those annotated in relevant blood cell types.

Note

When working with a custom multiome dataset not available in gretapy you will need to define which terms to filter for. You can check the full list of terms across evaluation databases by running:

gt.show_terms(organism='hg38')

Here’s how to retrieve them:

terms = gt.get_terms('PBMC')

To evaluate a GRN against a dataset, use the gt.tl.eval_grn_dataset function. The metrics argument controls which metrics to run: you can pass a specific database or subtask name (e.g. TF-Marker), a metric name (e.g. TF Markers), or a metric category (e.g. Literature). If set to None, all available metrics are run. Below is an example evaluating the TF-Marker database from the TF Markers task in the Literature category:

df = gt.tl.eval_grn_dataset(
    organism='hg38',
    grn=grn,
    dataset=mdata,
    terms=terms,
    metrics='TF-Marker'
)
df
class task db precision recall f01
0 Literature TF Markers TF-Marker 0.666667 0.444444 0.663383

The output reports the category, task, database or subtask name, along with precision, recall, and F0.1F_{0.1} score. Let’s try another example, the ChIP-Atlas database from the TF Binding task within the Genomic category:

df = gt.tl.eval_grn_dataset(
    organism='hg38',
    grn=grn,
    dataset=mdata,
    terms=terms,
    metrics='ChIP-Atlas'
)
df
class task db precision recall f01
0 Genomic TF Binding ChIP-Atlas 0.057692 0.038961 0.057419

Of course, we can also run multiple databases or metrics in a single call. Here we repeat the same two evaluations together:

df = gt.tl.eval_grn_dataset(
    organism='hg38',
    grn=grn,
    dataset=mdata,
    terms=terms,
    metrics=['TF-Marker', 'ChIP-Atlas']
)
df
class task db precision recall f01
0 Genomic TF Binding ChIP-Atlas 0.057692 0.038961 0.057419
1 Literature TF Markers TF-Marker 0.666667 0.444444 0.663383

On multiple datasets#

To run the evaluation across multiple datasets and metrics, one could loop the gt.tl.eval_grn_dataset function, or use the gt.tl.benchmark function directly instead.

This function accepts as input a dictionary of GRNs called grns. It’s format should be:

grn_dict = {
  "MethodA": {
    "hg38": {
      "PBMC": pd.DataFrame,
      ...
    }
  }
}

For each method, for each organism, for each dataset have a GRN as a pandas dataframe. Then simply run:

df = gt.tl.benchmark(grns=grn_dict)

Note

This line can take from 1 to 2 hours to run depending on your internet connection and number of CPUs available.

It will also download 40 GB total space of datasets and evaluation databases, so make sure you have enough space on disk.

If you have a custom GRN inference method that you would like to benchmark across all gretapy datasets, here’s the logic you should follow:

import gretapy as gt

# Get available datasets
lst_dts = gt.show_datasets()[['organism', 'name']].values

for org, dts in lst_dts:
    # Retrieve individual dataset
    mdata = gt.ds.read_dts(organism=org, dts_name=dts, verbose=True)
    # Run custom GRN method
    grn = ...  # use mdata as input
    # Save
    grn_dict['CollecTRI'][org][dts] = grn

# Run benchmark
mth_res = gt.tl.benchmark(grns=grn_dict)

Once it runs it should look like this example result for an imaginary method:

mth_res = gt.ds.read_imaginary_metrics()
mth_res
class task db organism dataset name precision recall f01
0 Predictive Gene Sets Hallmarks hg38 Eye ImaginaryMethod 0.200000 0.250000 0.200397
1 Predictive Gene Sets Hallmarks hg38 Kidney ImaginaryMethod 0.181818 0.800000 0.183220
2 Predictive Gene Sets Hallmarks hg38 Breast ImaginaryMethod 0.217391 0.909091 0.219041
3 Predictive Gene Sets Hallmarks hg38 Skin ImaginaryMethod 0.571429 0.320000 0.567018
4 Predictive Gene Sets Hallmarks hg38 Brain ImaginaryMethod 0.071429 0.500000 0.072040
... ... ... ... ... ... ... ... ... ...
286 Literature TF Pairs Europe PMC hg38 Embryo ImaginaryMethod 0.014936 0.173477 0.015073
287 Literature TF Pairs Europe PMC hg38 Heart ImaginaryMethod 0.056566 0.058639 0.056585
288 Literature TF Pairs Europe PMC hg38 PBMC ImaginaryMethod 0.057269 0.016414 0.055891
289 Literature TF Pairs Europe PMC hg38 Pituitary ImaginaryMethod 0.127566 0.182965 0.127950
290 Literature TF Pairs Europe PMC hg38 Lung ImaginaryMethod 0.259259 0.014768 0.222747

291 rows × 9 columns

We can then concatenate the results from our imaginary method with the original results from the gretapy manuscript.

old_res = gt.ds.read_metrics()
old_res
class task db organism dataset name precision recall f01
0 Predictive Gene Sets Hallmarks hg38 Eye CellOracle 0.500000 0.750000 0.501656
1 Predictive Gene Sets Hallmarks hg38 Eye CollecTRI 0.080000 1.000000 0.080735
2 Predictive Gene Sets Hallmarks hg38 Eye CREMA 0.181818 0.500000 0.182971
3 Predictive Gene Sets Hallmarks hg38 Eye Dictys 0.187500 0.750000 0.188903
4 Predictive Gene Sets Hallmarks hg38 Eye DIRECT-NET 0.111111 0.500000 0.111973
... ... ... ... ... ... ... ... ... ...
5565 Literature TF Pairs Europe PMC hg38 Lung SCENIC 0.119403 0.008439 0.105649
5566 Literature TF Pairs Europe PMC hg38 Lung SCENIC+ 0.061911 0.048523 0.061743
5567 Literature TF Pairs Europe PMC hg38 Lung scGPT 0.043919 0.013713 0.042982
5568 Literature TF Pairs Europe PMC hg38 Lung scMTNI 0.000000 0.000000 0.000000
5569 Literature TF Pairs Europe PMC hg38 Lung Spearman 0.014317 0.210970 0.014451

5570 rows × 9 columns

res = pd.concat([mth_res, old_res])
res
class task db organism dataset name precision recall f01
0 Predictive Gene Sets Hallmarks hg38 Eye ImaginaryMethod 0.200000 0.250000 0.200397
1 Predictive Gene Sets Hallmarks hg38 Kidney ImaginaryMethod 0.181818 0.800000 0.183220
2 Predictive Gene Sets Hallmarks hg38 Breast ImaginaryMethod 0.217391 0.909091 0.219041
3 Predictive Gene Sets Hallmarks hg38 Skin ImaginaryMethod 0.571429 0.320000 0.567018
4 Predictive Gene Sets Hallmarks hg38 Brain ImaginaryMethod 0.071429 0.500000 0.072040
... ... ... ... ... ... ... ... ... ...
5565 Literature TF Pairs Europe PMC hg38 Lung SCENIC 0.119403 0.008439 0.105649
5566 Literature TF Pairs Europe PMC hg38 Lung SCENIC+ 0.061911 0.048523 0.061743
5567 Literature TF Pairs Europe PMC hg38 Lung scGPT 0.043919 0.013713 0.042982
5568 Literature TF Pairs Europe PMC hg38 Lung scMTNI 0.000000 0.000000 0.000000
5569 Literature TF Pairs Europe PMC hg38 Lung Spearman 0.014317 0.210970 0.014451

5861 rows × 9 columns

We can then visualize the updated results in the summary heatmap plot at the metric class level:

gt.pl.ranking(res, level='class', figsize=(5, 5))
../_images/ce65843b9052a502e0b00fe7022cf1d69facc4423584df11135a1ae855e7b5d9.png

This heatmap color indicates mean F0.1F_{0.1} scores, and the numbers indicate the ranking inside each metric class (lower is better).

Alternatively, we can visualize the results at finer resolution, grouped by individual evaluation task, to see which specific databases or subtasks drive the differences between methods:

gt.pl.ranking(res, level='task')
../_images/ef012f674c1019f91fc41380888691a421d6c64b8f9c2a6d7b78fc02c9cf9708.png

In this case, ImaginaryMethod does not manage to outperform the top-performing methods in the benchmark. To understand where a method falls short, inspecting the task-level plot can help identify which metric categories drive its ranking.