SparseDC

Here SparseDC method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- SingleCellExperiment::counts(scater::mockSCE())
dim(ref_data)
# [1] 2000  200
set.seed(111)
group_condition <- sample(1:2, ncol(ref_data), replace = TRUE)

When you use SparseDC to estimate parameters from a real dataset, you must input a numeric vector to specify the groups or plates that each cell comes from, like other_prior = list(group.condition = the numeric vector).

Note that SparseDC defines 2 clusters presented in the dataset by default and users can input other number if the estimation step failed through nclusters parameter.

estimate_result <- simmethods::SparseDC_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group_condition,
                     nclusters = 2),
  verbose = T,
  seed = 10
)
# Estimating parameters using SparseDC

SparseDC is not stable, and some datasets can not be estimated due to the failing estimation.

Simulating datasets using SparseDC

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate two or more groups

Datasets with default parameters

The reference data contains 200 cells and 2000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group of cells.

simulate_result <- simmethods::SparseDC_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 200
# nGenes: 2000
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 2000  200

Determin the number of cells and genes

In SparseDC, we can set nCells and nGenes to specify the number of cells and genes.

Here, we simulate a new dataset with 1000 cells and 1000 genes:

Note that SparseDC defines 2 clusters in the estimation step by default and the number of cells is defined as nCells * nclusters.

simulate_result <- simmethods::SparseDC_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nCells = 500,
                     nGenes = 1000),
  seed = 111
)
# nCells: 1000
# nGenes: 1000
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000 1000

Simulate two or more groups

In SparseDC, the number of groups is determined by the group information (group.condition parameter) used in estimation step. If the group information contains two cell states, and the simulated dataset will contain two groups of cells.

The demonstrations above use two groups of cells in the eatimation step, and the result will hold two groups.

## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$group)
# 
# Group1 Group2 
#    500    500