Here powsimR method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data
command.
library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
powsimR provides some choices for users to select suitable parameters according to different types of data, platforms, normalization methods, distributions and so on.
RNAseq
“bulk” or “singlecell” (default).Protocol
Options are “UMI” (default) (e.g. 10X Genomics, CEL-seq2) or “Read” (e.g. Smart-seq2).Distribution
“NB” (default) for negative binomial or “ZINB” for zero-inflated negative binomial distribution fitting.Normalisation
“TMM” (default), “MR”, “PosCounts”, “UQ”, “scran”, “Linnorm”, “SCnorm”, “Census”, “depth”, “none”.estimate_result <- powsimR_estimation(
ref_data = ref_data,
other_prior = list(RNAseq = "singlecell",
Protocol = "UMI",
Normalisation = "scran"),
verbose = TRUE,
seed = 111)
# Warning: replacing previous import 'DECENT::lrTest' by 'MAST::lrTest' when
# loading 'powsimR'
# Warning: replacing previous import 'penalized::predict' by 'stats::predict' when
# loading 'powsimR'
# Warning: replacing previous import 'zinbwave::glmWeightedF' by
# 'zingeR::glmWeightedF' when loading 'powsimR'
# Estimating parameters using estimateParam function
powsimR also provides an another choice to estimate parameters (not neccessary) via spike-ins. If users want to use this, make sure that the reference data must contain ERCC spike-in counts. In addtion, users must set dilution.factor and volume information by other_prior = list(dilution.factor = xxx, volume = xxx)
.
rownames(ref_data)[grep(pattern = "^ERCC", x = rownames(ref_data))]
# [1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00014"
# [6] "ERCC-00019" "ERCC-00022" "ERCC-00025" "ERCC-00034" "ERCC-00035"
# [11] "ERCC-00042" "ERCC-00043" "ERCC-00044" "ERCC-00046" "ERCC-00051"
# [16] "ERCC-00053" "ERCC-00054" "ERCC-00059" "ERCC-00060" "ERCC-00062"
# [21] "ERCC-00069" "ERCC-00071" "ERCC-00074" "ERCC-00076" "ERCC-00078"
# [26] "ERCC-00079" "ERCC-00084" "ERCC-00092" "ERCC-00095" "ERCC-00096"
# [31] "ERCC-00099" "ERCC-00108" "ERCC-00111" "ERCC-00112" "ERCC-00113"
# [36] "ERCC-00116" "ERCC-00130" "ERCC-00131" "ERCC-00136" "ERCC-00144"
# [41] "ERCC-00145" "ERCC-00148" "ERCC-00154" "ERCC-00157" "ERCC-00160"
# [46] "ERCC-00162" "ERCC-00163" "ERCC-00165" "ERCC-00170" "ERCC-00171"
Make sure there are ERCC names in reference data and users must input the dilution.factor and volume (microliter) to determine the concentration of ERCC molecules.
estimate_result <- powsimR_estimation(
ref_data = ref_data,
other_prior = list(RNAseq = "singlecell",
Protocol = "UMI",
Normalisation = "scran",
dilution.factor = 50000,
volume = 1),
verbose = TRUE,
seed = 111)
# Estimating parameters using estimateParam function
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
powsimR provides some choices for users to select suitable parameters according to different normalization methods, and methods for differential expressed analysis.
Normalisation
“TMM” (default), “MR”, “PosCounts”, “UQ”, “scran”, “Linnorm”, “sctransform”, “SCnorm”, “Census”, “depth”.DEmethod
“T-Test”, “edgeR-LRT”, “edgeR-QL”, “edgeR-zingeR”, “edgeR-ZINB-WaVE”, “limma-voom”, “limma-trend” (default), “DESeq2”, “DESeq2-zingeR”, “DESeq2-ZINB-WaVE”, “ROTS”, “baySeq”, “NOISeq”, “EBSeq”, “MAST”, “BPSC”, “scDD”, “DECENT”.The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group and one batch of cells.
simulate_result <- simmethods::powsimR_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "SCE",
verbose = TRUE,
seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 1
#
# Setup Seed: 111
# You have chosen to simulate the expression of 4000 genes, which will be randomly drawn with replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
#
# SIMULATION NUMBER 1
# Generating gene expression.
# Generating spike-in expression.
# 80 vs. 80
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000 160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
# cell_name group
# <character> <character>
# Cell1 Cell1 Group1
# Cell2 Cell2 Group1
# Cell3 Cell3 Group1
# Cell4 Cell4 Group1
# Cell5 Cell5 Group1
# Cell6 Cell6 Group1
Time consuming:
simulate_result$simulate_detection$Elapsed_Time_sec
# [1] 0.982
Here, we simulate a new dataset with 500 cells and 1000 genes:
simulate_result <- simmethods::powsimR_simulation(
parameters = estimate_result[["estimate_result"]],
other_prior = list(nCells = 500,
nGenes = 1000),
return_format = "list",
verbose = TRUE,
seed = 111)
# nCells: 500
# nGenes: 1000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 1
#
# Setup Seed: 111
# You have chosen to simulate the expression of 1000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
#
# SIMULATION NUMBER 1
# Generating gene expression.
# Generating spike-in expression.
# 250 vs. 250
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000 500
In powsimR, we can not set nGroups
directly and should set prob.group
instead. For example, if we want to simulate 2 groups, we can type other_prior = list(prob.group = c(0.5, 0.5))
. Note that the sum of prob.group
numeric vector must equal to 1, so we can also set prob.group = c(0.3, 0.7)
.
In addtion, if we want to simulate three or more groups, we should obey the rules:
prob.group
vector must always be 2 when using powsinR.prob.group
numeric vector must equal to 1.For demonstration, we will simulate two groups using the learned parameters. (20% DEGs and 4 fold change)
simulate_result <- simmethods::powsimR_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(nCells = 500,
nGenes = 1000,
prob.group = c(0.3, 0.7),
de.prob = 0.2,
fc.group = 4),
seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 2
# de.prob: 0.2
# fc.group: 4
# nBatches: 1
#
# You have chosen to simulate the expression of 1000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000 500
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$group)
#
# Group1 Group2
# 150 350
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes
# 0.2
In powsimR, we can not set nBatches
directly and should set prob.batch
instead. For example, if we want to simulate 2 batches, we can type other_prior = list(prob.batch = c(0.5, 0.5))
. Note that the sum of prob.batch
numeric vector represents the total number of cells and the length of the vector equals to the number of batches.
In addtion, if we want to simulate three or more batches, we should obey the rules:
prob.batch
vector must always equal to the number of batchesprob.batch
numeric vector must equal to 1.For demonstration, we will simulate two batches using the learned parameters. (2 fold change)
simulate_result <- simmethods::powsimR_simulation(
parameters = estimate_result[["estimate_result"]],
other_prior = list(prob.batch = c(0.4, 0.6),
fc.batch = 2),
return_format = "list",
verbose = TRUE,
seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 2
# fc.batch: 2
# Setup Seed: 111
# You have chosen to simulate the expression of 4000 genes, which will be randomly drawn with replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
#
# SIMULATION NUMBER 1
# Generating gene expression.
# Generating spike-in expression.
# 32 vs. 32
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
# 48 vs. 48
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 4000 160
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
#
# Batch1 Batch2
# 64 96
As mentioned before, we can set prob.group
and prob.batch
to determine the number of groups and batches and we can also set de.prob
to specify the proportion of DEGs. Here, we simulate a dataset with following settings:
simulate_result <- simmethods::powsimR_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(nCells = 1000,
nGenes = 2000,
de.prob = 0.2,
prob.group = c(0.4, 0.6),
prob.batch = c(0.5, 0.5)),
seed = 111
)
# nCells: 1000
# nGenes: 2000
# nGroups: 2
# de.prob: 0.2
# fc.group: 2
# nBatches: 2
# fc.batch: 2
# You have chosen to simulate the expression of 2000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 2000 1000
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
#
# Batch1 Batch2
# 500 500
table(cell_info$group)
#
# Group1 Group2
# 400 600
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result)
# yes
# 0.2