xcore vignette
migdal
xcore_vignette.Rmd
Introduction
xcore
package provides a framework for transcription
factor (TF) activity modeling based on their molecular signatures and
user’s gene expression data. Additionally, xcoredata
package provides a collection of pre-processed TF molecular signatures
constructed from ChiP-seq experiments available in ReMap2020 or ChIP-Atlas databases.
xcore
use ridge regression to model changes in
expression as a linear combination of molecular signatures and find
their unknown activities. Obtained, estimates can be further tested for
significance to select molecular signatures with the highest predicted
effect on the observed expression changes.
Gene expression modeling in context of rinderpest infection
library("xcore")
Here we will use a subset of 293SLAM rinderpest infection dataset
from FANTOM5.
This subset contains expression counts for 0, 12 and 24 hours post
infection samples and only for a subset of FANTOM5 promoters. We can
find this example data shipped together with xcore
package.
This is a simple count matrix, with columns corresponding to samples and rows corresponding to FANTOM5 promoters.
## 00hr_rep1 00hr_rep2 00hr_rep3
## hg19::chr1:10003372..10003465,-;hg_10258.1 52 46 57
## hg19::chr1:10003486..10003551,+;hg_541.1 39 42 27
## hg19::chr1:100111580..100111773,+;hg_4181.1 1 0 2
## hg19::chr1:100232177..100232198,-;hg_13495.1 15 9 26
## hg19::chr1:100315613..100315691,+;hg_4187.1 95 109 110
## hg19::chr1:100435545..100435597,+;hg_4201.1 141 129 101
## 12hr_rep1 12hr_rep2 12hr_rep3
## hg19::chr1:10003372..10003465,-;hg_10258.1 54 50 53
## hg19::chr1:10003486..10003551,+;hg_541.1 35 30 40
## hg19::chr1:100111580..100111773,+;hg_4181.1 1 2 3
## hg19::chr1:100232177..100232198,-;hg_13495.1 20 16 13
## hg19::chr1:100315613..100315691,+;hg_4187.1 112 94 103
## hg19::chr1:100435545..100435597,+;hg_4201.1 132 106 125
## 24hr_rep1 24hr_rep2 24hr_rep3
## hg19::chr1:10003372..10003465,-;hg_10258.1 11 12 12
## hg19::chr1:10003486..10003551,+;hg_541.1 22 34 50
## hg19::chr1:100111580..100111773,+;hg_4181.1 0 1 1
## hg19::chr1:100232177..100232198,-;hg_13495.1 7 13 10
## hg19::chr1:100315613..100315691,+;hg_4187.1 43 74 89
## hg19::chr1:100435545..100435597,+;hg_4201.1 84 100 121
First we need to construct a design matrix describing our experiment design.
design <- matrix(
data = c(1, 0, 0,
1, 0, 0,
1, 0, 0,
0, 1, 0,
0, 1, 0,
0, 1, 0,
0, 0, 1,
0, 0, 1,
0, 0, 1),
ncol = 3,
nrow = 9,
byrow = TRUE,
dimnames = list(
c(
"00hr_rep1",
"00hr_rep2",
"00hr_rep3",
"12hr_rep1",
"12hr_rep2",
"12hr_rep3",
"24hr_rep1",
"24hr_rep2",
"24hr_rep3"
),
c("00hr", "12hr", "24hr")
)
)
print(design)
## 00hr 12hr 24hr
## 00hr_rep1 1 0 0
## 00hr_rep2 1 0 0
## 00hr_rep3 1 0 0
## 12hr_rep1 0 1 0
## 12hr_rep2 0 1 0
## 12hr_rep3 0 1 0
## 24hr_rep1 0 0 1
## 24hr_rep2 0 0 1
## 24hr_rep3 0 0 1
Next, we need to preprocess the counts using
prepareCountsForRegression
function. Here, CAGE expression
tags for each sample are filtered for lowly expressed promoters,
normalized for the library size and transformed into counts per million
(CPM). Finally, CPM are log2 transformed with addition of pseudo count
1. Moreover, we designate the base level samples, 0 hours after
treatment in our example, from which basal expression level is
calculated. This basal level will be used as a reference when modeling
the expression changes.
mae <- prepareCountsForRegression(counts = rinderpest_mini,
design = design,
base_lvl = "00hr")
xcore
models the expression as a function of molecular
signatures. Such signatures can be constructed eg. from the known
transcription factor binding sites (see Constructing molecular
signatures section). Here, we will take advantage of pre-computed
molecular signatures found in the xcoredata
package.
Particularly, molecular signatures constructed from ReMap2020 against
FANTOM5 annotation.
Molecular signatures can be conveniently accessed using the ExperimentHub interface.
library("ExperimentHub")
eh <- ExperimentHub()
query(eh, "xcoredata")
## ExperimentHub with 8 records
## # snapshotDate(): 2022-10-31
## # $dataprovider: ReMap, RIKEN, NA, ChIP-Atlas
## # $species: Homo sapiens, NA
## # $rdataclass: dgCMatrix, data.table, character, GRanges
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH7298"]]'
##
## title
## EH7298 | chip_atlas_meta
## EH7299 | remap_meta
## EH7300 | chip_atlas_promoters_f5
## EH7301 | remap_promoters_f5
## EH7302 | promoters_f5
## EH7303 | promoters_f5_core
## EH7699 | entrez2fantom
## EH7700 | symbol2fantom
remap_promoters_f5 <- eh[["EH7301"]]
Molecular signature is a simple binary matrix indicating if a transcription factor binding site was found or not found in promoter vicinity.
print(remap_promoters_f5[3:6, 3:6])
## 4 x 4 sparse Matrix of class "dgCMatrix"
## MLLT1.GM12878.ENCSR552XSN
## hg19::chr1:10003372..10003465,-;hg_10258.1 1
## hg19::chr1:10003486..10003551,+;hg_541.1 1
## hg19::chr1:100110807..100110818,+;hg_4172.1 1
## hg19::chr1:100110851..100110860,+;hg_4173.1 1
## ZBTB10.HEK293.ENCSR004PLU
## hg19::chr1:10003372..10003465,-;hg_10258.1 1
## hg19::chr1:10003486..10003551,+;hg_541.1 1
## hg19::chr1:100110807..100110818,+;hg_4172.1 .
## hg19::chr1:100110851..100110860,+;hg_4173.1 .
## ZFP3.HEK293.ENCSR134QIE
## hg19::chr1:10003372..10003465,-;hg_10258.1 .
## hg19::chr1:10003486..10003551,+;hg_541.1 .
## hg19::chr1:100110807..100110818,+;hg_4172.1 .
## hg19::chr1:100110851..100110860,+;hg_4173.1 .
## ZNF2.HEK293.ENCSR011CKE
## hg19::chr1:10003372..10003465,-;hg_10258.1 1
## hg19::chr1:10003486..10003551,+;hg_541.1 1
## hg19::chr1:100110807..100110818,+;hg_4172.1 .
## hg19::chr1:100110851..100110860,+;hg_4173.1 .
Computational resources consideration
Running xcore
using extensive ReMap2020 or ChIP-Atlas
molecular signatures, can be quite time and memory consuming. As an
example, modeling the example 293SLAM rinderpest infection dataset with
ReMap2020 signatures matrix took around 10 minutes to compute and used
up to 8 GB RAM on Intel(R) Xeon(R) CPU E5-2680 v3 using 2 cores. This
unfortunately exceeds the resources available for Bioconductor
vignettes. This being said, we will further proceed by taking only a
subset of ReMap2020 signatures such that we fit into the time limits.
However, for running xcore
in a normal setting the whole
molecular signatures matrix should be used!
# here we subset ReMap2020 molecular signatures matrix for the purpose of the
# vignette only! In a normal setting the whole molecular signatures matrix should
# be used!
set.seed(432123)
i <- sample(x = seq_len(ncol(remap_promoters_f5)), size = 100, replace = FALSE)
remap_promoters_f5 <- remap_promoters_f5[, i]
To add signatures to our MultiAssayExperiment
object we
can use addSignatures
function. As you add your signatures
remember to give them unique names.
mae <- addSignatures(mae, remap = remap_promoters_f5)
When we examine newly added signatures, we can see that some of them does not overlap any of the promoters. On the other side, depending on the signatures quality, we could expect to see signatures that overlap all of the promoters.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 99.5 1300.5 2499.3 4148.0 9561.0
While, we do not provide detailed guidelines to signatures filtering,
filtering out at least signatures overlapping no or all promoters is
mandatory. Here, we filter signatures that overlap less than 5% or more
than 95% of promoters using filterSignatures
function.
mae <- filterSignatures(mae, min = 0.05, max = 0.95)
At this stage we can investigate mae
to see how many
signatures are available after intersecting with counts and filtering.
number of columns in signatures corresponds to number of molecular
signatures which will be used to build a model.
print(mae)
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] U: matrix with 10388 rows and 1 columns
## [2] Y: matrix with 10388 rows and 6 columns
## [3] remap: dgCMatrix with 10388 rows and 63 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Finally, we can run our expression modeling using
modelGeneExpression
function.
modelGeneExpression
can take advantage of parallelization
to speed up the calculations. To use it we need to register a parallel
backend.
Parallel computing
While there are many parallel backends to choose from, internally
xcore
uses foreach
to implement parallel computing. Having this in mind we should use a
backend supported by foreach
.
Here we are using doParallel
backend, together with BiocParallel
package providing unified interface across different OS. Those packages
can be installed with:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BiocParallel")
install.packages("doParallel")
Our modeling will inherently contain some level of randomness due to using cross validation, here we set the seed so that our results can be replicated.
This step is time consuming as we need to train a separate model for
each of our samples. To speed up the calculations we will lower number
of folds used in cross-validation procedure from 10 to 5 using
nfolds
argument, which is internally passed to
cv.glmnet
function. Moreover here we only use a subset of
the ReMap2020 molecular signatures matrix.
# register parallel backend
doParallel::registerDoParallel(2L)
BiocParallel::register(BiocParallel::DoparParam(), default = TRUE)
# set seed
set.seed(314159265)
res <- modelGeneExpression(
mae = mae,
xnames = "remap",
nfolds = 5)
Output returned by modelGeneExpression
is a list with
following elements:
-
regression_models
- a list holdingcv.glmnet
objects for each sample. -
pvalues
- list ofdata.frame
s for each sample, each holding signatures estimates, their estimated standard errors and p-values. -
zscore_avg
- amatrix
holding replicate average molecular signatures Z-scores with columns corresponding to groups in the design. -
coef_avg
- amatrix
holding replicate average molecular signatures activities with columns corresponding to groups in the design. -
results
- adata.frame
holding replicate average molecular signatures and overall molecular signatures Z-score calculated over all groups in the design.
results
is likely of most interest. As you can see below
this data.frame
holds replicate averaged molecular
signatures activities, as well as overall Z-score which can be used to
rank signatures.
head(res$results$remap)
## name 12hr 24hr z_score
## 1 MYC.NCI-H2171.GSE36354 -0.12113152 -0.20342963 52.31218
## 2 E2F4.lymphoblastoid.GSE21488 0.04555700 0.20390516 40.62562
## 3 ATF3.K-562.ENCSR632DCH 0.03149811 0.18768057 34.85842
## 4 NELFA.HeLa_Flavo-0-H2O2.GSE100742 -0.03869524 -0.14910961 24.26357
## 5 MXD4.Hep-G2.ENCSR441KFW -0.07112564 -0.07581347 18.57259
## 6 MYC.HFF_OHT.GSE65544 0.03285668 -0.06889608 17.78755
To visualize the result we can plot a heatmap of molecular signatures activities for the top identified molecular signatures.
top_signatures <- head(res$results$remap, 10)
library(pheatmap)
pheatmap::pheatmap(
mat = top_signatures[, c("12hr", "24hr")],
scale = "none",
labels_row = top_signatures$name,
cluster_cols = FALSE,
color = colorRampPalette(c("blue", "white", "red"))(15),
breaks = seq(from = -0.1, to = 0.1, length.out = 16),
main = "SLAM293 molecular signatures activities"
)
As we only used a random subset of ReMap2020 molecular signatures the obtained result most likely has no biological meaning. Nonetheless, some general comment on results interpretation can be made. The top-scoring transcription factors can be hypothesized to be associated with observed changes in gene expression. Moreover, the predicted activities indicate if promoters targeted by a specific signature tend to be downregulated or upregulated.
Constructing molecular signatures
Constructing molecular signatures is as simple as intersecting
promoters annotation with transcription factors binding sites (TFBS).
For example, here we use FANTOM5’ promoters annotation found in the
xcoredata
package and predicted CTCF binding sites
available in CTCF
AnnotationHub
’s resource.
Promoter annotations and TFBS should be stored as a
GRanges
object. Moreover, it is required that the
promoter/TF name is held under the name
column. Next we can
construct a molecular signatures matrix using
getInteractionMatrix
function, where we can specify the
ext
parameter that will control how much the promoters are
extended in both directions before intersecting with TFBS.
# obtain promoters annotation; *promoter name must be stored under column 'name'
promoters_f5 <- eh[["EH7303"]]
head(promoters_f5)
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 9943315-9943407 - | hg19::chr1:10003372... 102331
## [2] chr1 9943429-9943493 + | hg19::chr1:10003486... 69018
## [3] chr1 99646025-99646217 + | hg19::chr1:100111580.. 87706
## [4] chr1 99766622-99766642 - | hg19::chr1:100232177.. 25838
## [5] chr1 99850058-99850135 + | hg19::chr1:100315613.. 92180
## [6] chr1 99969990-99970041 + | hg19::chr1:100435545.. 180010
## gene_type_gencode ENTREZID SYMBOL
## <factor> <character> <character>
## [1] protein_coding 84328 LZIC
## [2] protein_coding 64802 NMNAT1
## [3] protein_coding 54873 PALMD
## [4] protein_coding 391059 FRRS1
## [5] protein_coding 178 AGL
## [6] protein_coding 23443 SLC35A3
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
# obtain predicted CTCF binding for hg38;
# *TF/motif name must be stored under column 'name'
library("AnnotationHub")
ah <- AnnotationHub()
CTCF_hg38 <- ah[["AH104724"]]
CTCF_hg38$name <- "CTCF"
head(CTCF_hg38)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | 5PrimeGene 3PrimeGene name
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr1 100038106-100038125 * | SLC35A3 HIAT1 CTCF
## [2] chr1 100868553-100868572 * | - EXTL2 CTCF
## [3] chr1 100921277-100921296 * | EXTL2 SLC30A7 CTCF
## [4] chr1 101204233-101204252 * | - EDG1 CTCF
## [5] chr1 101288599-101288618 * | EDG1 - CTCF
## [6] chr1 101357558-101357577 * | - - CTCF
## -------
## seqinfo: 24 sequences from hg38 genome
# construct molecular signatures matrix
molecular_signature <- getInteractionMatrix(
a = promoters_f5,
b = CTCF_hg38,
ext = 500
)
head(molecular_signature)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## CTCF
## hg19::chr1:10003372..10003465,-;hg_10258.1 .
## hg19::chr1:10003486..10003551,+;hg_541.1 .
## hg19::chr1:100111580..100111773,+;hg_4181.1 .
## hg19::chr1:100232177..100232198,-;hg_13495.1 .
## hg19::chr1:100315613..100315691,+;hg_4187.1 .
## hg19::chr1:100435545..100435597,+;hg_4201.1 .
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9 IRanges_2.32.0
## [4] S4Vectors_0.36.2 pheatmap_1.0.12 glmnet_4.1-7
## [7] Matrix_1.4-1 xcoredata_1.2.0 ExperimentHub_2.6.0
## [10] AnnotationHub_3.6.0 BiocFileCache_2.6.1 dbplyr_2.3.3
## [13] BiocGenerics_0.44.0 xcore_1.1.5 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_1.0.0
## [3] fs_1.6.3 bit64_4.0.5
## [5] RColorBrewer_1.1-3 doParallel_1.0.17
## [7] filelock_1.0.2 httr_1.4.6
## [9] rprojroot_2.0.3 tools_4.2.0
## [11] bslib_0.5.0 utf8_1.2.3
## [13] R6_2.5.1 colorspace_2.1-0
## [15] DBI_1.1.3 withr_2.5.0
## [17] tidyselect_1.2.0 bit_4.0.5
## [19] curl_5.0.1 compiler_4.2.0
## [21] textshaping_0.3.6 cli_3.6.1
## [23] Biobase_2.58.0 desc_1.4.2
## [25] DelayedArray_0.24.0 bookdown_0.34
## [27] sass_0.4.7 scales_1.2.1
## [29] rappdirs_0.3.3 pkgdown_2.0.7
## [31] systemfonts_1.0.4 stringr_1.5.0
## [33] digest_0.6.33 rmarkdown_2.23
## [35] XVector_0.38.0 pkgconfig_2.0.3
## [37] htmltools_0.5.5 MatrixGenerics_1.10.0
## [39] highr_0.10 fastmap_1.1.1
## [41] limma_3.54.2 rlang_1.1.1
## [43] RSQLite_2.3.1 shiny_1.7.4.1
## [45] shape_1.4.6 jquerylib_0.1.4
## [47] generics_0.1.3 jsonlite_1.8.7
## [49] BiocParallel_1.32.6 dplyr_1.1.2
## [51] RCurl_1.98-1.12 magrittr_2.0.3
## [53] GenomeInfoDbData_1.2.9 munsell_0.5.0
## [55] Rcpp_1.0.11 fansi_1.0.4
## [57] lifecycle_1.0.3 stringi_1.7.12
## [59] yaml_2.3.7 edgeR_3.40.2
## [61] SummarizedExperiment_1.28.0 zlibbioc_1.44.0
## [63] grid_4.2.0 blob_1.2.4
## [65] parallel_4.2.0 promises_1.2.0.1
## [67] crayon_1.5.2 lattice_0.20-45
## [69] Biostrings_2.66.0 splines_4.2.0
## [71] KEGGREST_1.38.0 locfit_1.5-9.8
## [73] knitr_1.43 pillar_1.9.0
## [75] codetools_0.2-18 glue_1.6.2
## [77] BiocVersion_3.16.0 evaluate_0.21
## [79] BiocManager_1.30.21.1 MultiAssayExperiment_1.24.0
## [81] png_0.1-8 vctrs_0.6.3
## [83] httpuv_1.6.11 foreach_1.5.2
## [85] gtable_0.3.3 purrr_1.0.1
## [87] cachem_1.0.8 BiocBaseUtils_1.0.0
## [89] xfun_0.39 mime_0.12
## [91] xtable_1.8-4 later_1.3.1
## [93] ragg_1.2.5 survival_3.3-1
## [95] tibble_3.2.1 iterators_1.0.14
## [97] AnnotationDbi_1.60.2 memoise_2.0.1
## [99] ellipsis_0.3.2 interactiveDisplayBase_1.36.0