dyngen

Here dyngen method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.

library(simmethods)
# Load data (downloaded from https://zenodo.org/record/8251596/files/data82_cellbench-SC1_luyitian.rds?download=1)
data <- readRDS("../../../../preprocessed_data/data82_cellbench-SC1_luyitian.rds")
ref_data <- t(as.matrix(data$data$counts))

Default estimation

estimate_result <- simmethods::dyngen_estimation(
  ref_data = ref_data,
  other_prior = NULL,
  verbose = TRUE,
  seed = 111
)
# Performing k-means and determin the best number of clusters...
# Add grouping to data...
# Estimating parameters using dyngen
# Executing 'slingshot' on '20230924_105246__data_wrapper__p1tJGIOUko'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix

Information of cell groups

If the information of cell groups is available, you can use another way to estimate the parameters.

## cell groups
group_condition <- as.numeric(data$data_info$group_condition)
estimate_result <- simmethods::dyngen_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group_condition),
  verbose = TRUE,
  seed = 111
)
# Add grouping to data...
# Estimating parameters using dyngen
# Executing 'slingshot' on '20230924_105250__data_wrapper__lC7s9gqYTq'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix

Simulating datasets using dyngen

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes

Datasets with default parameters

The reference data contains 157 cells and 1770 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.

simulate_result <- simmethods::dyngen_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 154
# nGenes: 1770
# Generating TF network
# Sampling feature network from real network
# Generating kinetics for 1770 features
# Generating formulae
# Generating gold standard mod changes
# Precompiling reactions for gold standard
# Running gold simulations
# 
  |                                                  | 0 % elapsed=00s   
  |========                                          | 14% elapsed=00s, remaining~01s
  |===============                                   | 29% elapsed=00s, remaining~01s
  |======================                            | 43% elapsed=00s, remaining~01s
  |=============================                     | 57% elapsed=01s, remaining~00s
  |====================================              | 71% elapsed=01s, remaining~00s
  |===========================================       | 86% elapsed=01s, remaining~00s
  |==================================================| 100% elapsed=01s, remaining~00s
# Precompiling reactions for simulations
# Running 1 simulations
# Mapping simulations to gold standard
# Performing dimred
# Simulating experiment
# Wrapping dataset as list
# as(<dgeMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 1770  154

Determin the number of cells and genes

In dyngen, we can set nCells and nGenes parameters to specify the number of cells and genes that need to be simulated. Here, we simulate a new dataset with 100 cells and 100 genes:

simulate_result <- simmethods::dyngen_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nCells = 100,
                     nGenes = 100),
  seed = 111
)
# nCells: 100
# nGenes: 100
# Generating TF network
# Sampling feature network from real network
# Generating kinetics for 100 features
# Generating formulae
# Generating gold standard mod changes
# Precompiling reactions for gold standard
# Running gold simulations
# 
  |                                                  | 0 % elapsed=00s   
  |========                                          | 14% elapsed=00s, remaining~01s
  |===============                                   | 29% elapsed=00s, remaining~01s
  |======================                            | 43% elapsed=00s, remaining~01s
  |=============================                     | 57% elapsed=01s, remaining~00s
  |====================================              | 71% elapsed=01s, remaining~00s
  |===========================================       | 86% elapsed=01s, remaining~00s
  |==================================================| 100% elapsed=01s, remaining~00s
# Precompiling reactions for simulations
# Running 1 simulations
# Mapping simulations to gold standard
# Performing dimred
# Simulating experiment
# Wrapping dataset as list
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 100 100

Dyngen may need a large amount of memory when simulating new datasets, so users should always focus on your occupied computational resources.