zingeR

Here zingeR method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

When you use zingeR to estimate parameters from a real dataset, you must input a numeric vector to specify the groups or plates that each cell comes from, like other_prior = list(group.condition = the numeric vector).

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
group_condition <- simmethods::group_condition
## group_condition can must be a numeric vector.
other_prior <- list(group.condition = as.numeric(group_condition))

Using simmethods::zingeR_estimation command to execute the estimation step.

estimate_result <- simmethods::zingeR_estimation(ref_data = ref_data,
                                                 other_prior = other_prior,
                                                 verbose = T,
                                                 seed = 10)
# Estimating parameters using zingeR

Simulating datasets using zingeR

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate two groups

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group of cells.

simulate_result <- simmethods::zingeR_simulation(
  ref_data = ref_data,
  other_prior = other_prior,
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# prob.group: 0.1
# fc.group: 2
# Loading required package: edgeR
# Loading required package: limma
# 
# Attaching package: 'limma'
# The following object is masked from 'package:BiocGenerics':
# 
#     plotMA
# 
# Attaching package: 'edgeR'
# The following object is masked from 'package:SingleCellExperiment':
# 
#     cpm
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
head(colData(SCE_result))
# DataFrame with 6 rows and 1 column
#         cell_name
#       <character>
# Cell1       Cell1
# Cell2       Cell2
# Cell3       Cell3
# Cell4       Cell4
# Cell5       Cell5
# Cell6       Cell6
head(rowData(SCE_result))
# DataFrame with 6 rows and 3 columns
#         gene_name     de_gene     de_fc
#       <character> <character> <numeric>
# Gene1       Gene1          no         0
# Gene2       Gene2          no         0
# Gene3       Gene3          no         0
# Gene4       Gene4          no         0
# Gene5       Gene5          no         0
# Gene6       Gene6          no         0

Determin the number of cells and genes

In zingeR, users can only set the number of cells and genes which is higher than the reference data. Here, we simulate a new dataset with 1000 cells and 5000 genes:

simulate_result <- simmethods::zingeR_simulation(
  ref_data = ref_data,
  other_prior = list(group.condition = as.numeric(group_condition),
                     nCells = 1000,
                     nGenes = 5000),
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# prob.group: 0.1
# fc.group: 2
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 5000 1000

Simulate two groups

In zingeR, we can only simulate two groups and note that zingeR dose not return cell group information.

For demonstration, we will simulate two groups using the learned parameters. We can set de.prob = 0.2 to simulate 20% genes as DEGs.

simulate_result <- simmethods::zingeR_simulation(
  ref_data = ref_data,
  other_prior = list(group.condition = as.numeric(group_condition),
                     nCells = 1000,
                     nGenes = 5000,
                     de.prob = 0.2,
                     fc.group = 4),
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# prob.group: 0.2
# fc.group: 4
# Preparing dataset. Using existing parameters.
# Sampling.
# Calculating differential expression.
# Simulating data.

zingeR dose not return cell group information.

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 5000 1000
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes 
# 0.2