scDesign3

Here scDesign3 method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data

Default estimation

estimate_result <- simmethods::scDesign3_estimation(
  ref_data = ref_data,
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using scDesign3
# Convert Residuals to Multivariate Gaussian
# Converting End
# Copula group A starts

Information of cell groups

If the information of cell groups is available, you can use another way to estimate the parameters.

## cell groups
group_condition <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::scDesign3_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group_condition),
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using scDesign3
# Convert Residuals to Multivariate Gaussian
# Converting End
# Copula group Group1 starts
# Copula group Group2 starts

Simulating datasets using scDesign3

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Simulate two or more groups
  3. Simulate two or more batches
  4. Simulate data with cellular differentiation trajectory
  5. Simulate spatial transcriptome data

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.

simulate_result <- simmethods::scDesign3_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 1
# Use Copula to sample a multivariate quantile matrix
# Sample Copula group Group1 starts
# Sample Copula group Group2 starts

We will get two or groups if information of cell groups is used in estimation step.

SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
table(colData(SCE_result)$group)
# 
# Group1 Group2 
#     80     80

We can not set the cell or gene number in scDesign3 unless a data frame which contains covariates of targeted simulated data from construct_data function is provided through new_covariate parameter.

Simulate two or more groups

In scDesign3, we can not set nGroups directly and should specify the group labels of cells in the estimation step in scDesign3_estimation function.

We randomly assign four group labels to cells.

set.seed(666)
estimate_result <- simmethods::scDesign3_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = sample(1:4, ncol(ref_data), replace = TRUE)),
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using scDesign3
# Convert Residuals to Multivariate Gaussian
# Converting End
# Copula group Group2 starts
# Copula group Group4 starts
# Copula group Group3 starts
# Copula group Group1 starts
simulate_result <- simmethods::scDesign3_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 4
# nBatches: 1
# Use Copula to sample a multivariate quantile matrix
# Sample Copula group Group2 starts
# Sample Copula group Group4 starts
# Sample Copula group Group3 starts
# Sample Copula group Group1 starts

Check cell groups

SCE_result <- simulate_result[["simulate_result"]]
table(colData(SCE_result)$group)
# 
# Group1 Group2 Group3 Group4 
#     29     43     43     45

Simulate two or more batches

In scDesign3, we can not set nBatches directly and should the batch labels of cells in the estimation step in scDesign3_estimation function.

If you custom the simulated cell number which is not equal to that of the real data, the batch information for simulated cells is not returned.

Simulate datasets with cellular differentiation trajectory

Before modeling from the reference data, we should construct dynwrap::expression data using dynwrap::wrap_expression function and perform trajectory inference using Slingshot using tislingshot::ti_slingshot or other methods.

if(!requireNamespace("tislingshot", quietly = TRUE)){
  devtools::install_github("dynverse/ti_slingshot/package/")
}
dyn_data <- dynwrap::wrap_expression(
  counts = t(ref_data),
  expression = log2(t(ref_data) + 1)
)
dyn_data <- dynwrap::add_grouping(dyn_data, group_condition)
dyn_model <- dynwrap::infer_trajectory(dyn_data, tislingshot::ti_slingshot())
# Using full covariance matrix

The traj parameter must be TURE and dynwrap_data is needed when you want to estimate parameters from trajectory data.

estimate_result <- simmethods::scDesign3_estimation(
  ref_data = ref_data,
  other_prior = list(traj = TRUE,
                     group.condition = group_condition,
                     dynwrap_data = dyn_model),
  verbose = TRUE,
  seed = 111
)
# Constructing lineages for the data...
# Estimating parameters using scDesign3
# Convert Residuals to Multivariate Gaussian
# Converting End
# Copula group Group1 starts
# Copula group Group2 starts

Next, we can simulate trajectory data using scDesign3.

simulate_result <- simmethods::scDesign3_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# nBatches: 1
# Use Copula to sample a multivariate quantile matrix
# Sample Copula group Group1 starts
# Sample Copula group Group2 starts

Before visualization, make sure that you have already installed several R packages:

if(!requireNamespace("dyndimred", quietly = TRUE)){install.packages("dyndimred")}
if(!requireNamespace("dynplot", quietly = TRUE)){install.packages("dynplot")}

First we should wrap the data into a standard object:

dyn_object <- dynwrap::wrap_expression(counts = t(simulate_result$simulate_result$count_data),
                                       expression = log2(t(simulate_result$simulate_result$count_data) + 1))

Next, we infer the trajectory using SlingShot which has been proved to be the most best method to do this:

model <- dynwrap::infer_trajectory(dataset = dyn_object,
                                   method = tislingshot::ti_slingshot(),
                                   parameters = NULL,
                                   give_priors = NULL,
                                   seed = 666,
                                   verbose = TRUE)
# Executing 'slingshot' on '20230911_114342__data_wrapper__3AFJQcaDVH'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix

Finally, we can plot the trajectory after performing dimensionality reduction:

dimred <- dyndimred::dimred_umap(dyn_object$expression)
dynplot::plot_dimred(model, dimred = dimred)
# Coloring by milestone
# Using milestone_percentages from trajectory

For more details about trajectory inference and visualization, please check dynverse.

Simulate spatial transcriptome data

scDesign3 can also simulate spatial transcriptome data. Besides the gene expression profile, users should also provide the spatial coordinates of each cell (spot). The reference data can be downloaded here.

# Load data (downloaded from https://zenodo.org/record/8251596/files/data118_spatial_OV.rds?download=1)
data <- readRDS("../../../../preprocessed_data/data118_spatial_OV.rds")
ref_data <- t(as.matrix(data$data$counts))

In addition, we can set the spatial coordinates by spatial.x and spatial.y parameters.

other_prior <- list(spatial.x = data$data_info$spatial_coordinate$x,
                    spatial.y = data$data_info$spatial_coordinate$y,
                    group.condition = as.numeric(as.factor(data$data_info$cluster_info)))

Execute the parameter estimation:

estimate_result <- simmethods::scDesign3_estimation(
  ref_data = ref_data,
  other_prior = other_prior,
  verbose = T,
  seed = 10
)
# Estimating parameters using scDesign3
# Convert Residuals to Multivariate Gaussian
# Converting End
# Copula group Group2 starts
# Copula group Group1 starts

Simulate spatial transcriptome data:

simulate_result <- simmethods::scDesign3_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 3492
# nGenes: 1056
# nGroups: 2
# nBatches: 1
# Use Copula to sample a multivariate quantile matrix
# Sample Copula group Group2 starts
# Sample Copula group Group1 starts

Notably, the simulated cells can match the real ones, so that the spatial coordinates are the same as the previously input ones.

Visualize the spatial spots:

library(ggplot2)
location <- simulate_result$simulate_result$col_meta
p <- ggplot(location, aes(x = x, y = y))+
  geom_point(aes(color = group))+
  theme(panel.grid = element_blank(),
        axis.title = element_blank(),
        axis.text = element_blank(),
        legend.position = "bottom")
p