Here SPARSim method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data
command.
library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
In SPARSim, users can estimate parameters from real data only using gene expression matrix.
Using simmethods::SPARSim_estimation
command to execute the estimation step.
estimate_result <- simmethods::SPARSim_estimation(
ref_data = ref_data,
other_prior = NULL,
verbose = TRUE,
seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
In addition, you can also input a numeric vector to specify the groups or plates that each cell comes from, like other_prior = list(group.condition = the numeric vector)
.
group <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::SPARSim_estimation(
ref_data = ref_data,
other_prior = list(group.condition = group),
verbose = TRUE,
seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
In SPARSim, ERCC spike-in genes can be used to estimate data parameters from the real data. In this case, the gene matrix must contain spike-in gene counts and hold the right ERCC spike-in gene names. Note that users must also input the dilution factor and volume (microliter) which the experiment used.
group <- as.numeric(simmethods::group_condition)
other_prior <- list(group.condition = group,
dilution.factor = 50000,
volume = 0.01)
estimate_result <- simmethods::SPARSim_estimation(
ref_data = ref_data,
other_prior = other_prior,
verbose = TRUE,
seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
Check spike-in parameters
spikein_params <- estimate_result[["estimate_result"]][["SPARSim_spikein_parameter"]]
str(spikein_params)
# List of 4
# $ spikein_set :List of 1
# ..$ spikein:List of 4
# .. ..$ mix_name : chr "spikein"
# .. ..$ abundance : Named num [1:50] 3613.2 903.3 225.8 112.9 56.5 ...
# .. .. ..- attr(*, "names")= chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
# .. ..$ variability: Named num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
# .. .. ..- attr(*, "names")= chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
# .. ..$ ids : chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
# $ spikein_sample : chr [1:160] "spikein" "spikein" "spikein" "spikein" ...
# $ spikein_proportion: num 0.05
# $ spikein_ids : chr [1:50] "spikein_1" "spikein_2" "spikein_3" "spikein_4" ...
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data.
The simulated dataset will have one group of cells as no group information is used in estimation step.
estimate_result <- simmethods::SPARSim_estimation(
ref_data = ref_data,
other_prior = NULL,
verbose = TRUE,
seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
simulate_result <- simmethods::SPARSim_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "SCE",
seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# fc.group: 0
# de.prob: 0
# nBatches: 0
# Number of experimental conditions: 1
# Number of genes: 4000
# Number of cells: 160
# Setting gene expression intensity...
# Setting gene expression variability ...
# Simulating biological variability ...
# Simulating technical variability ...
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000 160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
# cell_name group
# <character> <character>
# Cell1 Cell1 Group1
# Cell2 Cell2 Group1
# Cell3 Cell3 Group1
# Cell4 Cell4 Group1
# Cell5 Cell5 Group1
# Cell6 Cell6 Group1
head(rowData(SCE_result))
# DataFrame with 6 rows and 1 column
# gene_name
# <character>
# Gene1 Gene1
# Gene2 Gene2
# Gene3 Gene3
# Gene4 Gene4
# Gene5 Gene5
# Gene6 Gene6
The number of groups simulated by SPARSim is determined by the group information used in estimation step. If the group information of two cell states is provided, then the simulated dataset will contain two groups. SPARSim also provides other parameters related to DEGs such as the proportion of DEGs (de.prob
) and the fold change of DGEs (fc.group
).
Estimating parameters using group information
group <- as.numeric(simmethods::group_condition)
estimate_result <- simmethods::SPARSim_estimation(
ref_data = ref_data,
other_prior = list(group.condition = group),
verbose = TRUE,
seed = 111
)
# Estimating parameters using SPARSim
# [1] "Experimental condition 1"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
# [1] "Experimental condition 2"
# [1] "...estimating gene intensity"
# [1] "...estimating gene variability"
# [1] "...estimating library size"
# [1] "...creating SPARSim simulation parameter"
For demonstration, we will simulate two groups using the learned parameters. We can set de.prob = 0.2
to simulate 20% genes as DEGs.
simulate_result <- simmethods::SPARSim_simulation(
other_prior = list(de.prob = 0.2,
fc.group = 4),
parameters = estimate_result[["estimate_result"]],
return_format = "list",
seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# fc.group: 4
# de.prob: 0.2
# nBatches: 0
# Number of experimental conditions: 2
# Number of genes: 4000
# Number of cells: 160
# Setting gene expression intensity...
# Setting gene expression variability ...
# Simulating biological variability ...
# Simulating technical variability ...
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 4000 160
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes
# 0.2
Users can simulate batches when batch.condition
parameter is activated and just input the numeric vectors that specify the batch labels of cells.
For demonstration, we will simulate three batches using the learned parameters.
simulate_result <- simmethods::SPARSim_simulation(
other_prior = list(de.prob = 0.2,
fc.group = 4,
batch.condition = sample(1:3, 160, replace = TRUE)),
parameters = estimate_result[["estimate_result"]],
return_format = "list",
seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# fc.group: 4
# de.prob: 0.2
# nBatches: 3
# Number of experimental conditions: 2
# Number of genes: 4000
# Number of cells: 160
# Setting gene expression intensity...
# Setting gene expression variability ...
# Simulating batch effects ...
# Simulating biological variability ...
# Simulating technical variability ...
## cell information
col_data <- simulate_result[["simulate_result"]][["col_meta"]]
table(col_data$group)
#
# Group1 Group2
# 80 80
table(col_data$batch)
#
# Batch1 Batch2 Batch3
# 63 43 54