Here dyngen method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.
library(simmethods)
# Load data (downloaded from https://zenodo.org/record/8251596/files/data82_cellbench-SC1_luyitian.rds?download=1)
data <- readRDS("../../../../preprocessed_data/data82_cellbench-SC1_luyitian.rds")
ref_data <- t(as.matrix(data$data$counts))
estimate_result <- simmethods::dyngen_estimation(
ref_data = ref_data,
other_prior = NULL,
verbose = TRUE,
seed = 111
)
# Performing k-means and determin the best number of clusters...
# Add grouping to data...
# Estimating parameters using dyngen
# Executing 'slingshot' on '20230924_105246__data_wrapper__p1tJGIOUko'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE, reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L, smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix
If the information of cell groups is available, you can use another way to estimate the parameters.
## cell groups
group_condition <- as.numeric(data$data_info$group_condition)
estimate_result <- simmethods::dyngen_estimation(
ref_data = ref_data,
other_prior = list(group.condition = group_condition),
verbose = TRUE,
seed = 111
)
# Add grouping to data...
# Estimating parameters using dyngen
# Executing 'slingshot' on '20230924_105250__data_wrapper__lC7s9gqYTq'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE, reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L, smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
The reference data contains 157 cells and 1770 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.
simulate_result <- simmethods::dyngen_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "SCE",
seed = 111
)
# nCells: 154
# nGenes: 1770
# Generating TF network
# Sampling feature network from real network
# Generating kinetics for 1770 features
# Generating formulae
# Generating gold standard mod changes
# Precompiling reactions for gold standard
# Running gold simulations
#
| | 0 % elapsed=00s
|======== | 14% elapsed=00s, remaining~01s
|=============== | 29% elapsed=00s, remaining~01s
|====================== | 43% elapsed=00s, remaining~01s
|============================= | 57% elapsed=01s, remaining~00s
|==================================== | 71% elapsed=01s, remaining~00s
|=========================================== | 86% elapsed=01s, remaining~00s
|==================================================| 100% elapsed=01s, remaining~00s
# Precompiling reactions for simulations
# Running 1 simulations
# Mapping simulations to gold standard
# Performing dimred
# Simulating experiment
# Wrapping dataset as list
# as(<dgeMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 1770 154
In dyngen, we can set nCells
and nGenes
parameters to specify the number of cells and genes that need to be simulated.
Here, we simulate a new dataset with 100 cells and 100 genes:
simulate_result <- simmethods::dyngen_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(nCells = 100,
nGenes = 100),
seed = 111
)
# nCells: 100
# nGenes: 100
# Generating TF network
# Sampling feature network from real network
# Generating kinetics for 100 features
# Generating formulae
# Generating gold standard mod changes
# Precompiling reactions for gold standard
# Running gold simulations
#
| | 0 % elapsed=00s
|======== | 14% elapsed=00s, remaining~01s
|=============== | 29% elapsed=00s, remaining~01s
|====================== | 43% elapsed=00s, remaining~01s
|============================= | 57% elapsed=01s, remaining~00s
|==================================== | 71% elapsed=01s, remaining~00s
|=========================================== | 86% elapsed=01s, remaining~00s
|==================================================| 100% elapsed=01s, remaining~00s
# Precompiling reactions for simulations
# Running 1 simulations
# Mapping simulations to gold standard
# Performing dimred
# Simulating experiment
# Wrapping dataset as list
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 100 100
Dyngen may need a large amount of memory when simulating new datasets, so users should always focus on your occupied computational resources.