Here hierarchicell method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.
library(simmethods)
library(SingleCellExperiment)
# Load data
set.seed(111)
ref_data <- SingleCellExperiment::counts(scater::mockSCE())
estimate_result <- simmethods::hierarchicell_estimation(
ref_data = ref_data,
other_prior = NULL,
verbose = T,
seed = 10
)
# Filtering user input
# Genes and cells have been filtered, ready for estimating parameters
# You do not set the type of the data (Raw or Norm), we will set 'Raw' by default
# Estimating parameters using hierarchicell
# Normalizing ...
# Removing highly correlated genes
# Computing sample means, dropout rates, and dispersion ...
# Computing final data summaries ...
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
The reference data contains 200 cells and 2000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.
hierarchicell can only simulate two groups of cells.
simulate_result <- simmethods::hierarchicell_simulation(
parameters = estimate_result[["estimate_result"]],
other_prior = NULL,
return_format = "SCE",
seed = 111
)
# nCells: 200
# nGenes: 2000
# nGroups: 2
# fc.group: 2
# Computing simulation parameters ...
# -------------------------------------------------------
# Distribution of grand means is a gamma
# with shape: 783 and rate: 0.15
# -------------------------------------------------------
# Distribution for gene-wise dropout is a gamma
# with shape: 45767.5 and rate: 48102.92
# -------------------------------------------------------
# Function for dropout SD is:
# DropoutStD = 0.07 + 0.78*DropOut + 15.41*(DropOut**2)
# -------------------------------------------------------
# Function for inter-individual SD is:
# InterStDev = 0 + 0.86*GrandMean)
# -------------------------------------------------------
# Function for dispersion is:
# exp(-10.58 + 2103.53/IntraMean)
# -------------------------------------------------------
# Simulating cells ...
# -------------------------------------------------------
# Simulating expression values ...
# -------------------------------------------------------
# All done!
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 2000 200
Some cells in the result may contain NA values across all genes due to the failing of GLM fitting.
col_data <- as.data.frame(SingleCellExperiment::colData(SCE_result))
table(col_data$group)
#
# Group1 Group2
# 102 96
The hierarchicell method is not stable and usually causes failed simulation
In hierarchicell, we can set nCells
and nGenes
to specify the number of cells and genes.
Here, we simulate a new dataset with 1000 cells and 1000 genes:
simulate_result <- simmethods::hierarchicell_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(nCells = 1000,
nGenes = 1000),
seed = 111
)
# nCells: 1000
# nGenes: 1000
# nGroups: 2
# fc.group: 2
# Computing simulation parameters ...
# -------------------------------------------------------
# Distribution of grand means is a gamma
# with shape: 783 and rate: 0.15
# -------------------------------------------------------
# Distribution for gene-wise dropout is a gamma
# with shape: 45767.5 and rate: 48102.92
# -------------------------------------------------------
# Function for dropout SD is:
# DropoutStD = 0.07 + 0.78*DropOut + 15.41*(DropOut**2)
# -------------------------------------------------------
# Function for inter-individual SD is:
# InterStDev = 0 + 0.86*GrandMean)
# -------------------------------------------------------
# Function for dispersion is:
# exp(-10.58 + 2103.53/IntraMean)
# -------------------------------------------------------
# Simulating cells ...
# -------------------------------------------------------
# Simulating expression values ...
# -------------------------------------------------------
# All done!
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000 1000