hierarchicell

Here hierarchicell method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.

library(simmethods)
library(SingleCellExperiment)
# Load data
set.seed(111)
ref_data <- SingleCellExperiment::counts(scater::mockSCE())
estimate_result <- simmethods::hierarchicell_estimation(
  ref_data = ref_data,
  other_prior = NULL,
  verbose = T,
  seed = 10
)
# Filtering user input
# Genes and cells have been filtered, ready for estimating parameters
# You do not set the type of the data (Raw or Norm), we will set 'Raw' by default
# Estimating parameters using hierarchicell
# Normalizing ...
# Removing highly correlated genes
# Computing sample means, dropout rates, and dispersion ...
# Computing final data summaries ...

Simulating datasets using hierarchicell

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes

Datasets with default parameters

The reference data contains 200 cells and 2000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.

hierarchicell can only simulate two groups of cells.

simulate_result <- simmethods::hierarchicell_simulation(
  parameters = estimate_result[["estimate_result"]],
  other_prior = NULL,
  return_format = "SCE",
  seed = 111
)
# nCells: 200
# nGenes: 2000
# nGroups: 2
# fc.group: 2
# Computing simulation parameters ...
# -------------------------------------------------------
# Distribution of grand means is a gamma
# with shape: 783 and rate: 0.15
# -------------------------------------------------------
# Distribution for gene-wise dropout is a gamma 
#  with shape: 45767.5 and rate: 48102.92
# -------------------------------------------------------
# Function for dropout SD is:
# DropoutStD = 0.07 + 0.78*DropOut + 15.41*(DropOut**2)
# -------------------------------------------------------
# Function for inter-individual SD is:
# InterStDev = 0 + 0.86*GrandMean)
# -------------------------------------------------------
# Function for dispersion is:
#  exp(-10.58 + 2103.53/IntraMean)
# -------------------------------------------------------
# Simulating cells ...
# -------------------------------------------------------
# Simulating expression values ...
# -------------------------------------------------------
# All done!
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 2000  200

Some cells in the result may contain NA values across all genes due to the failing of GLM fitting.

col_data <- as.data.frame(SingleCellExperiment::colData(SCE_result))
table(col_data$group)
# 
# Group1 Group2 
#    102     96

The hierarchicell method is not stable and usually causes failed simulation

Determin the number of cells and genes

In hierarchicell, we can set nCells and nGenes to specify the number of cells and genes.

Here, we simulate a new dataset with 1000 cells and 1000 genes:

simulate_result <- simmethods::hierarchicell_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nCells = 1000,
                     nGenes = 1000),
  seed = 111
)
# nCells: 1000
# nGenes: 1000
# nGroups: 2
# fc.group: 2
# Computing simulation parameters ...
# -------------------------------------------------------
# Distribution of grand means is a gamma
# with shape: 783 and rate: 0.15
# -------------------------------------------------------
# Distribution for gene-wise dropout is a gamma 
#  with shape: 45767.5 and rate: 48102.92
# -------------------------------------------------------
# Function for dropout SD is:
# DropoutStD = 0.07 + 0.78*DropOut + 15.41*(DropOut**2)
# -------------------------------------------------------
# Function for inter-individual SD is:
# InterStDev = 0 + 0.86*GrandMean)
# -------------------------------------------------------
# Function for dispersion is:
#  exp(-10.58 + 2103.53/IntraMean)
# -------------------------------------------------------
# Simulating cells ...
# -------------------------------------------------------
# Simulating expression values ...
# -------------------------------------------------------
# All done!
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000 1000