SPARSim

Here SPARSim method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data

Estimation without cell group information

In SPARSim, users can estimate parameters from real data only using gene expression matrix.

Using simmethods::SPARSim_estimation command to execute the estimation step.

estimate_result <- simmethods::SPARSim_estimation(
  ref_data = ref_data,
  other_prior = NULL,
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"

Estimation with cell group information

In addition, you can also input a numeric vector to specify the groups or plates that each cell comes from, like other_prior = list(group.condition = the numeric vector).

group <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::SPARSim_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group),
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"

Estimation with spike-in genes

In SPARSim, ERCC spike-in genes can be used to estimate data parameters from the real data. In this case, the gene matrix must contain spike-in gene counts and hold the right ERCC spike-in gene names. Note that users must also input the dilution factor and volume (microliter) which the experiment used.

group <- as.numeric(simmethods::group_condition)
other_prior <- list(group.condition = group,
                    dilution.factor = 50000,
                    volume = 0.01)
estimate_result <- simmethods::SPARSim_estimation(
  ref_data = ref_data,
  other_prior = other_prior,
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"

Check spike-in parameters

spikein_params <- estimate_result[["estimate_result"]][["SPARSim_spikein_parameter"]]
str(spikein_params)
# List of 4
#  $ spikein_set       :List of 1
#   ..$ spikein:List of 4
#   .. ..$ mix_name   : chr "spikein"
#   .. ..$ abundance  : Named num [1:50] 3613.2 903.3 225.8 112.9 56.5 ...
#   .. .. ..- attr(*, "names")= chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
#   .. ..$ variability: Named num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
#   .. .. ..- attr(*, "names")= chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
#   .. ..$ ids        : chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
#  $ spikein_sample    : chr [1:160] "spikein" "spikein" "spikein" "spikein" ...
#  $ spikein_proportion: num 0.05
#  $ spikein_ids       : chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...

Simulating datasets using SPARSim

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Simulate groups
  3. Simulate batches

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.

The simulated dataset will have one group of cells as no group information is used in estimation step.

estimate_result <- simmethods::SPARSim_estimation(
  ref_data = ref_data,
  other_prior = NULL,
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
simulate_result <- simmethods::SPARSim_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# fc.group: 0
# de.prob: 0
# nBatches: 0
# Number of experimental conditions:  1 
# Number of genes:  4000 
# Number of cells:  160 
# Setting gene expression intensity...  
# Setting gene expression variability ...  
# Simulating biological variability ...  
# Simulating technical variability ...
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
#         cell_name       group
#       <character> <character>
# Cell1       Cell1      Group1
# Cell2       Cell2      Group1
# Cell3       Cell3      Group1
# Cell4       Cell4      Group1
# Cell5       Cell5      Group1
# Cell6       Cell6      Group1
head(rowData(SCE_result))
# DataFrame with 6 rows and 1 column
#         gene_name
#       <character>
# Gene1       Gene1
# Gene2       Gene2
# Gene3       Gene3
# Gene4       Gene4
# Gene5       Gene5
# Gene6       Gene6

Simulate groups

The number of groups simulated by SPARSim is determined by the group information used in estimation step. If the group information of two cell states is provided, then the simulated dataset will contain two groups. SPARSim also provides other parameters related to DEGs such as the proportion of DEGs (de.prob) and the fold change of DGEs (fc.group).

Estimating parameters using group information

group <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::SPARSim_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group),
  verbose = TRUE,
  seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"

For demonstration, we will simulate two groups using the learned parameters. We can set de.prob = 0.2 to simulate 20% genes as DEGs.

simulate_result <- simmethods::SPARSim_simulation(
  other_prior = list(de.prob = 0.2,
                     fc.group = 4),
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# fc.group: 4
# de.prob: 0.2
# nBatches: 0
# Number of experimental conditions:  2 
# Number of genes:  4000 
# Number of cells:  160 
# Setting gene expression intensity...  
# Setting gene expression variability ...  
# Simulating biological variability ...  
# Simulating technical variability ...
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 4000  160
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes 
# 0.2

Simulate batch

Users can simulate batches when batch.condition parameter is activated and just input the numeric vectors that specify the batch labels of cells.

For demonstration, we will simulate three batches using the learned parameters.

simulate_result <- simmethods::SPARSim_simulation(
  other_prior = list(de.prob = 0.2,
                     fc.group = 4,
                     batch.condition = sample(1:3, 160, replace = TRUE)),
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# fc.group: 4
# de.prob: 0.2
# nBatches: 3
# Number of experimental conditions:  2 
# Number of genes:  4000 
# Number of cells:  160 
# Setting gene expression intensity...  
# Setting gene expression variability ...  
# Simulating batch effects ...  
# Simulating biological variability ...  
# Simulating technical variability ...
## cell information
col_data <- simulate_result[["simulate_result"]][["col_meta"]]
table(col_data$group)
# 
# Group1 Group2 
#     80     80
table(col_data$batch)
# 
# Batch1 Batch2 Batch3 
#     63     43     54