scMultiSim

Here scMultiSim method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
estimate_result <- simmethods::scMultiSim_estimation(
  ref_data = ref_data,
  verbose = T,
  seed = 10
)
# Estimating parameters using scMultiSim
# Loading required package: amap
# Computing nearest neighbor graph
# Computing SNN
# Your data has 3 groups

Users can also input the group information of cells:

group <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::scMultiSim_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group),
  verbose = T,
  seed = 10
)
# Estimating parameters using scMultiSim

The estimation result contains an object with phylo data structure:

plot(estimate_result[["estimate_result"]][["phylo"]])

Simulating datasets using scMultiSim

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate groups
  4. Simulate batches
  5. Simulate cellular differential trajectory

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.

simulate_result <- simmethods::scMultiSim_simulation(
  parameters = estimate_result$estimate_result,
  other_prior = NULL,
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 1
# Time spent: 0.17 mins
# Adding experimental noise...
# 50..100..150..Using atac_counts
# Time spent: 0.54 mins
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
#         cell_name       group
#       <character> <character>
# cell1       cell1      group1
# cell2       cell2      group1
# cell3       cell3      group1
# cell4       cell4      group2
# cell5       cell5      group2
# cell6       cell6      group2

Determin the number of cells and genes

simulate_result <- simmethods::scMultiSim_simulation(
  parameters = estimate_result$estimate_result,
  other_prior = list(nCells = 500,
                     nGenes = 1000),
  return_format = "SCE",
  seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 2
# nBatches: 1
# Time spent: 0.15 mins
# Adding experimental noise...
# 50..100..150..200..250..300..350..400..450..500..Using atac_counts
# Time spent: 0.45 mins
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 1000  500

Simulate groups

In scMultiSim, we can not set nGroups directly and should set prob.group instead. For example, if we want to simulate 2 groups, we can type other_prior = list(prob.group = c(0.5, 0.5)). Note that the sum of prob.group numeric vector must equal to 1, so we can also set prob.group = c(0.3, 0.7).

The group number should be equal to that used or detected in the estimation step, otherwise the error may occur.

In addtion, if we want to simulate three or more groups, we should obey the rules:

  • The length of prob.group vector must always equal to the number of groups.
  • The sum of prob.group numeric vector must equal to 1.

For demonstration, we will simulate two groups using the learned parameters.

simulate_result <- simmethods::scMultiSim_simulation(
  parameters = estimate_result$estimate_result,
  other_prior = list(prob.group = c(0.4, 0.6)),
  return_format = "list",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 1
# Time spent: 0.18 mins
# Adding experimental noise...
# 50..100..150..Using atac_counts
# Time spent: 0.56 mins
col_data <- simulate_result[["simulate_result"]][["col_meta"]]
table(col_data$group)
# 
# group1 group2 
#     64     96

Simulate batches

In scMultiSim, we can set nBatches directly to simulate cell batches.

For demonstration, we will simulate two batches.

simulate_result <- simmethods::scMultiSim_simulation(
  parameters = estimate_result$estimate_result,
  other_prior = list(nBatches = 2),
  return_format = "list",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 2
# Time spent: 0.17 mins
# Adding experimental noise...
# 50..100..150..Using atac_counts
# Time spent: 0.55 mins
# Adding batch effects...
## cell information
col_data <- simulate_result[["simulate_result"]][["col_meta"]]
table(col_data$batch)
# 
# Batch1 Batch2 
#     82     78

Simulate cellular differential trajectory

The parameter estimation step is the same as that demonstrated above. If you want to simulate datasets with trajectory, you should specify the parameter traj in other_prior to TRUE.

simulate_result <- simmethods::scMultiSim_simulation(
  parameters = estimate_result$estimate_result,
  other_prior = list(traj = TRUE),
  return_format = "list",
  seed = 111
)
# Simulating datasets with trajectory.../n
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 1
# Time spent: 0.16 mins
# Adding experimental noise...
# 50..100..150..Using atac_counts
# Time spent: 0.99 mins

Before visualization, Make sure that you have already installed several R packages:

if(!requireNamespace("dynwrap", quietly = TRUE)){install.packages("dynwrap")}
if(!requireNamespace("dyndimred", quietly = TRUE)){install.packages("dyndimred")}
if(!requireNamespace("dynplot", quietly = TRUE)){install.packages("dynplot")}
if(!requireNamespace("tislingshot", quietly = TRUE)){devtools::install_github("dynverse/ti_slingshot/package/")}

First we should wrap the data into a standard object:

data <- simulate_result$simulate_result$count_data
dyn_object <- dynwrap::wrap_expression(counts = t(data),
                                       expression = log2(t(data) + 1))

Next, we infer the trajectory using SlingShot which has been proved to be the most best method to do this:

model <- dynwrap::infer_trajectory(dataset = dyn_object,
                                   method = tislingshot::ti_slingshot(),
                                   parameters = NULL,
                                   give_priors = NULL,
                                   seed = 111,
                                   verbose = TRUE)
# Executing 'slingshot' on '20230903_172409__data_wrapper__Wad2hj4sH8'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix

Finally, we can plot the trajectory after performing dimensionality reduction:

dimred <- dyndimred::dimred_umap(dyn_object$expression)
dynplot::plot_dimred(model, dimred = dimred)
# Coloring by milestone
# Using milestone_percentages from trajectory

For more details about trajectory inference and visualization, please check dynverse.