powsimR

Here powsimR method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data

powsimR provides some choices for users to select suitable parameters according to different types of data, platforms, normalization methods, distributions and so on.

  1. RNAseq “bulk” or “singlecell” (default).
  2. Protocol Options are “UMI” (default) (e.g. 10X Genomics, CEL-seq2) or “Read” (e.g. Smart-seq2).
  3. Distribution “NB” (default) for negative binomial or “ZINB” for zero-inflated negative binomial distribution fitting.
  4. Normalisation “TMM” (default), “MR”, “PosCounts”, “UQ”, “scran”, “Linnorm”, “SCnorm”, “Census”, “depth”, “none”.

Estimate parameters without ERCC spike-in

estimate_result <- powsimR_estimation(
  ref_data = ref_data,
  other_prior = list(RNAseq = "singlecell",
                     Protocol = "UMI",
                     Normalisation = "scran"),
  verbose = TRUE,
  seed = 111)
# Warning: replacing previous import 'DECENT::lrTest' by 'MAST::lrTest' when
# loading 'powsimR'
# Warning: replacing previous import 'penalized::predict' by 'stats::predict' when
# loading 'powsimR'
# Warning: replacing previous import 'zinbwave::glmWeightedF' by
# 'zingeR::glmWeightedF' when loading 'powsimR'
# Estimating parameters using estimateParam function

Estimate parameters with ERCC spike-in

powsimR also provides an another choice to estimate parameters (not neccessary) via spike-ins. If users want to use this, make sure that the reference data must contain ERCC spike-in counts. In addtion, users must set dilution.factor and volume information by other_prior = list(dilution.factor = xxx, volume = xxx).

rownames(ref_data)[grep(pattern = "^ERCC", x = rownames(ref_data))]
#  [1] "ERCC-00002" "ERCC-00003" "ERCC-00004" "ERCC-00009" "ERCC-00014"
#  [6] "ERCC-00019" "ERCC-00022" "ERCC-00025" "ERCC-00034" "ERCC-00035"
# [11] "ERCC-00042" "ERCC-00043" "ERCC-00044" "ERCC-00046" "ERCC-00051"
# [16] "ERCC-00053" "ERCC-00054" "ERCC-00059" "ERCC-00060" "ERCC-00062"
# [21] "ERCC-00069" "ERCC-00071" "ERCC-00074" "ERCC-00076" "ERCC-00078"
# [26] "ERCC-00079" "ERCC-00084" "ERCC-00092" "ERCC-00095" "ERCC-00096"
# [31] "ERCC-00099" "ERCC-00108" "ERCC-00111" "ERCC-00112" "ERCC-00113"
# [36] "ERCC-00116" "ERCC-00130" "ERCC-00131" "ERCC-00136" "ERCC-00144"
# [41] "ERCC-00145" "ERCC-00148" "ERCC-00154" "ERCC-00157" "ERCC-00160"
# [46] "ERCC-00162" "ERCC-00163" "ERCC-00165" "ERCC-00170" "ERCC-00171"

Make sure there are ERCC names in reference data and users must input the dilution.factor and volume (microliter) to determine the concentration of ERCC molecules.

estimate_result <- powsimR_estimation(
  ref_data = ref_data,
  other_prior = list(RNAseq = "singlecell",
                     Protocol = "UMI",
                     Normalisation = "scran",
                     dilution.factor = 50000,
                     volume = 1),
  verbose = TRUE,
  seed = 111)
# Estimating parameters using estimateParam function

Simulating datasets using powsimR

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate two or more groups
  4. Simulate two or more batches
  5. Simulate more groups and batches simutaniously

powsimR provides some choices for users to select suitable parameters according to different normalization methods, and methods for differential expressed analysis.

  1. Normalisation “TMM” (default), “MR”, “PosCounts”, “UQ”, “scran”, “Linnorm”, “sctransform”, “SCnorm”, “Census”, “depth”.
  2. DEmethod “T-Test”, “edgeR-LRT”, “edgeR-QL”, “edgeR-zingeR”, “edgeR-ZINB-WaVE”, “limma-voom”, “limma-trend” (default), “DESeq2”, “DESeq2-zingeR”, “DESeq2-ZINB-WaVE”, “ROTS”, “baySeq”, “NOISeq”, “EBSeq”, “MAST”, “BPSC”, “scDD”, “DECENT”.

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group and one batch of cells.

simulate_result <- simmethods::powsimR_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  verbose = TRUE,
  seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 1
# 
# Setup Seed: 111
# You have chosen to simulate the expression of 4000 genes, which will be randomly drawn with replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
# 
#   SIMULATION   NUMBER   1
# Generating gene expression.
# Generating spike-in expression.
# 80 vs. 80
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
#         cell_name       group
#       <character> <character>
# Cell1       Cell1      Group1
# Cell2       Cell2      Group1
# Cell3       Cell3      Group1
# Cell4       Cell4      Group1
# Cell5       Cell5      Group1
# Cell6       Cell6      Group1

Time consuming:

simulate_result$simulate_detection$Elapsed_Time_sec
# [1] 0.982

Determin the number of cells and genes

Here, we simulate a new dataset with 500 cells and 1000 genes:

simulate_result <- simmethods::powsimR_simulation(
  parameters = estimate_result[["estimate_result"]],
  other_prior = list(nCells = 500,
                     nGenes = 1000),
  return_format = "list",
  verbose = TRUE,
  seed = 111)
# nCells: 500
# nGenes: 1000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 1
# 
# Setup Seed: 111
# You have chosen to simulate the expression of 1000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
# 
#   SIMULATION   NUMBER   1
# Generating gene expression.
# Generating spike-in expression.
# 250 vs. 250
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000  500

Simulate two or more groups

In powsimR, we can not set nGroups directly and should set prob.group instead. For example, if we want to simulate 2 groups, we can type other_prior = list(prob.group = c(0.5, 0.5)). Note that the sum of prob.group numeric vector must equal to 1, so we can also set prob.group = c(0.3, 0.7).

In addtion, if we want to simulate three or more groups, we should obey the rules:

  • The length of prob.group vector must always be 2 when using powsinR.
  • The sum of prob.group numeric vector must equal to 1.

For demonstration, we will simulate two groups using the learned parameters. (20% DEGs and 4 fold change)

simulate_result <- simmethods::powsimR_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nCells = 500,
                     nGenes = 1000,
                     prob.group = c(0.3, 0.7),
                     de.prob = 0.2,
                     fc.group = 4),
  seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 2
# de.prob: 0.2
# fc.group: 4
# nBatches: 1
# 
# You have chosen to simulate the expression of 1000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000  500
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$group)
# 
# Group1 Group2 
#    150    350
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes 
# 0.2

Simulate two or more batches

In powsimR, we can not set nBatches directly and should set prob.batch instead. For example, if we want to simulate 2 batches, we can type other_prior = list(prob.batch = c(0.5, 0.5)). Note that the sum of prob.batch numeric vector represents the total number of cells and the length of the vector equals to the number of batches.

In addtion, if we want to simulate three or more batches, we should obey the rules:

  • The length of prob.batch vector must always equal to the number of batches
  • The sum of prob.batch numeric vector must equal to 1.

For demonstration, we will simulate two batches using the learned parameters. (2 fold change)

simulate_result <- simmethods::powsimR_simulation(
  parameters = estimate_result[["estimate_result"]],
  other_prior = list(prob.batch = c(0.4, 0.6),
                     fc.batch = 2),
  return_format = "list",
  verbose = TRUE,
  seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# de.prob: 0.1
# fc.group: 2
# nBatches: 2
# fc.batch: 2
# Setup Seed: 111
# You have chosen to simulate the expression of 4000 genes, which will be randomly drawn with replacement from the observed expression of 4000 genes.
# Simulating datasets using powsimR
# limma-trend is developed for bulk RNA-seq experiments.
# Preparing output arrays.
# 
#   SIMULATION   NUMBER   1
# Generating gene expression.
# Generating spike-in expression.
# 32 vs. 32
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
# 48 vs. 48
# Applying TMM normalisation
# Applying limma-trend for DE analysis on raw count data.
# Saving raw simulated counts.
# Estimating moments of raw count data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 4000  160
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
# 
# Batch1 Batch2 
#     64     96

Simulate more groups and batches simutaniously

As mentioned before, we can set prob.group and prob.batch to determine the number of groups and batches and we can also set de.prob to specify the proportion of DEGs. Here, we simulate a dataset with following settings:

  • 1000 cells
  • 2000 genes
  • two groups with 0.2 proportion of DEGs
  • two batches
simulate_result <- simmethods::powsimR_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nCells = 1000,
                     nGenes = 2000,
                     de.prob = 0.2,
                     prob.group = c(0.4, 0.6),
                     prob.batch = c(0.5, 0.5)),
  seed = 111
)
# nCells: 1000
# nGenes: 2000
# nGroups: 2
# de.prob: 0.2
# fc.group: 2
# nBatches: 2
# fc.batch: 2
# You have chosen to simulate the expression of 2000 genes, which will be randomly drawn without replacement from the observed expression of 4000 genes.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 2000 1000
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
# 
# Batch1 Batch2 
#    500    500
table(cell_info$group)
# 
# Group1 Group2 
#    400    600
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result)
# yes 
# 0.2