> Programming Usage > Data Simulation

Data Simulation

Simulate Datasets From Local R
Simulate Datasets From Docker in R
Simulate Datasets By Simpipe Package
- Generate Multiple Datasets For Every Estimation Result

We have already known how to estimate parameters from one or more real datasets and got the estimation results. In this chapter, we will demonstrate how to simulate single-cell transcriptomics data based on the previous estimation results, especially the useful parameters that are usually customized to satisfy the different application situations.

For demonstrations, we use Splat method as it contains all functionalities and available parameters that we want to introduce.

Library our packages first:

library(simmethods)
library(simpipe)

Simulate Datasets From Local R

Step1: Prepare Estimation Results

Load data and perform estimation:

ref_data <- simmethods::data
estimation_result <- simmethods::Splat_estimation(
  ref_data = ref_data,
  verbose = TRUE,
  seed = 666
)
# Estimating parameters using Splat

Step2: Check Availabel Parameters

Next, check the optional parameters that control the size of the simulated datasets, the proportion of DEGs, the number of cell batches and datasets with cellular trajectory. In this way, you will know the essential parameters that may satisfy your simulation requirements.

help(SplatPop_simulation)

## Details
# In addtion to simulate datasets with default parameters, users want to simulate other kinds of datasets, e.g. a counts matrix with 2 or more cell groups. In Splat, you can set extra parameters to simulate datasets.
# 
# The customed parameters you can set are below:
# 
# nCells. In Splat, you can not set nCells directly and should set batchCells instead. For example, if you want to simulate 1000 cells, you can type other_prior = list(batchCells = 1000). If you type other_prior = list(batchCells = c(500, 500)), the simulated data will have two batches.
# 
# nGenes. You can directly set other_prior = list(nGenes = 5000) to simulate 5000 genes.
# 
# nGroups. You can not directly set other_prior = list(nGroups = 3) to simulate 3 groups. Instead, you should set other_prior = list(prob.group = c(0.2, 0.3, 0.5)) where the sum of group probabilities must equal to 1.
# 
# de.prob. You can directly set other_prior = list(de.prob = 0.2) to simulate DEGs that account for 20 percent of all genes.
# 
# prob.group. You can directly set other_prior = list(prob.group = c(0.2, 0.3, 0.5)) to assign three proportions of cell groups. Note that the number of groups always equals to the length of the vector.
# 
# nBatches. You can not directly set other_prior = list(nBatches = 3) to simulate 3 batches. Instead, you should set other_prior = list(batchCells = c(500, 500, 500)) to reach the goal and the total cells are 1500.
# 
# If users want to simulate datasets for trajectory inference, just set other_prior = list(paths = TRUE). Simulating trajectory datasets can also specify the parameters of group and batch. See Examples.

These parameters can be categorized into 4 classes and respectively represent the main four functionalities in Splat method:

parameters for cell groups
parameters for DEGs
parameters for batches
parameters for cellular differentiation trajectory

In the next part of step3, we will describe these application situations in detail.

Step3: Simulation

Task1: The Number of Cell and Gene

The first application situation is generating datasets with different number of cells and genes. After browsing the vignettes of Splat method, we know that batchCells parameter controls the number of cells and nGenes controls the number of genes.

Simulate 1000 cells and 5000 genes:

data_1000_5000 <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 5000),
  return_format = "Seurat",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

data_1000_5000$simulate_result
# An object of class Seurat 
# 5000 features across 1000 samples within 1 assay 
# Active assay: originalexp (5000 features, 0 variable features)

Simulate 10000 cells and 20000 genes:

data_10000_20000 <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 10000,
                     nGenes = 20000),
  return_format = "Seurat",
  verbose = TRUE,
  seed = 666
)
# nCells: 10000
# nGenes: 20000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.47 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.06 * dense matrix
# Skipping 'counts': estimated sparse size 1.06 * dense matrix
# Done!

See the number of cells and genes

data_10000_20000$simulate_result
# An object of class Seurat 
# 20000 features across 10000 samples within 1 assay 
# Active assay: originalexp (20000 features, 0 variable features)

Check the execution time:

data_10000_20000$simulate_detection$Elapsed_Time_sec
# [1] 44.628

Task2: Cell Groups

If we want to simulate two groups of cells using Splat method, we can use prob.group parameter to specify the proportions of cells in two groups. The length of prob.group vector defines the number of groups.

Simulate two groups (4:6):

data_4_6 <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 5000,
                     prob.group = c(0.4, 0.6)),
  return_format = "Seurat",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating group DE...
# Simulating cell means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

Check group labels of cells

table(data_4_6$simulate_result$group)
# 
# Group1 Group2 
#    407    593

Simulate five groups (1:1:2:3:3):

data_11233 <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 5000,
                     prob.group = c(0.1, 0.1, 0.2, 0.3, 0.3)),
  return_format = "Seurat",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 5
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating group DE...
# Simulating cell means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

Check group labels of cells

table(data_11233$simulate_result$group)
# 
# Group1 Group2 Group3 Group4 Group5 
#     95    106    206    290    303

Task3: Differential Expressed Genes

Users can also set the proportion of DEGs in Splat method via de.prob parameter which ranges from 0 to 1.

Here we set de.prob as 0.2 to simulate 20% DEGs in two cell groups.

simulated_data <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 5000,
                     prob.group = c(0.4, 0.6),
                     de.prob = 0.2),
  return_format = "list",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 2
# de.prob: 0.2
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating group DE...
# Simulating cell means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

Check group labels of cells

table(simulated_data$simulate_result$col_meta$group)
# 
# Group1 Group2 
#    407    593

Check the proportion of DEGs

row_meta <- simulated_data$simulate_result$row_meta
table(row_meta$de_gene == "yes")/length(row_meta$de_gene)
# 
#  FALSE   TRUE 
# 0.8068 0.1932

We then simulate another dataset which contains more than 2 groups (4 groups and 40% DEGs):

simulated_data <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 5000,
                     prob.group = c(0.2, 0.2, 0.3, 0.3),
                     de.prob = 0.4),
  return_format = "list",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 4
# de.prob: 0.4
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating group DE...
# Simulating cell means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

Check group labels of cells

table(simulated_data$simulate_result$col_meta$group)
# 
# Group1 Group2 Group3 Group4 
#    201    206    290    303

Check the proportion of DEGs

row_meta <- simulated_data$simulate_result$row_meta
table(row_meta$de_gene == "yes")/length(row_meta$de_gene)
# 
#  FALSE   TRUE 
# 0.6568 0.3432

Note that we can know the DEGs between any pair of two groups in Splat method (except for scDesign and SPARSim). For example, if we want to get the DEGs between the group1 and group2, we should extract the DEFactor in gene metadata:

gene_meta <- simulated_data$simulate_result$row_meta
DEFactor1 <- gene_meta$DEFacGroup1
DEFactor2 <- gene_meta$DEFacGroup2

Then we do the division:

DEFactor <- DEFactor1/DEFactor2

Check the gene that whose DEFactor is not equal to 1 and they are defined as the DEGs between group1 and group2:

table(DEFactor != 1)
# 
# FALSE  TRUE 
#  4034   966
DEGs_group1_group2 <- rownames(gene_meta)[DEFactor != 1]
DEGs_group1_group2[1:10]
#  [1] "Gene1"  "Gene4"  "Gene7"  "Gene11" "Gene12" "Gene15" "Gene16" "Gene17"
#  [9] "Gene36" "Gene45"

scDesign and SPARSim can not return the DEGs between any pair of groups when the number of cell groups is higher than 2. But when there are only two groups in a simulated data, the DEGs are valid.

Task4: Cell Batches

Simulating different cell batches is also an important application situation in many researches related to benchmarking and method development.

In Splat and many other methods, users can specify the number of cell batches and the cell numbers in every batch via batchCells parameter. Here, we will simulate 3 batches with cell numbers of 1000, 2000 and 3000, respectively.

simulated_data <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = c(1000, 2000, 3000),
                     nGenes = 5000),
  return_format = "list",
  verbose = TRUE,
  seed = 666
)
# nCells: 6000
# nGenes: 5000
# nGroups: 1
# de.prob: 0.1
# nBatches: 3
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating batch effects...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.59 * dense matrix
# Skipping 'counts': estimated sparse size 1.59 * dense matrix
# Done!

Check the batches:

table(simulated_data$simulate_result$col_meta$batch)
# 
# Batch1 Batch2 Batch3 
#   1000   2000   3000

Task5: Cellular Trajectory

Using Splat method to simulate the data with cellular differentiation trajectory is another application situation of data simulation. Simply, we can set paths parameter as TRUE.

simulated_data <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     prob.group = c(0.3, 0.2, 0.5),
                     nGenes = 5000,
                     paths = TRUE),
  return_format = "SingleCellExperiment",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 3
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Simulating trajectory datasets by Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating path endpoints...
# Simulating path steps...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

library(scater)
# Loading required package: scuttle
# Loading required package: ggplot2
sim.paths <- logNormCounts(simulated_data$simulate_result)
sim.paths <- runPCA(sim.paths)
plotPCA(sim.paths, colour_by = "group")

If you want to set other parameters related to the trajectory in Splat method, you can browse the official vignettes represented in Splatter package and the website.

help(splatSimulate, package = "splatter")

Here, we only add extra two parameters path.nSteps and path.skew:

simulated_data <- simmethods::Splat_simulation(
  parameters = estimation_result$estimate_result,
  other_prior = list(batchCells = 1000,
                     prob.group = c(0.3, 0.2, 0.5),
                     nGenes = 5000,
                     paths = TRUE,
                     path.nSteps = 20,
                     path.skew = 0.1),
  return_format = "SingleCellExperiment",
  verbose = TRUE,
  seed = 666
)
# nCells: 1000
# nGenes: 5000
# nGroups: 3
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Simulating trajectory datasets by Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating path endpoints...
# Simulating path steps...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 1.6 * dense matrix
# Skipping 'counts': estimated sparse size 1.6 * dense matrix
# Done!

library(scater)
sim.paths <- logNormCounts(simulated_data$simulate_result)
sim.paths <- runPCA(sim.paths)
plotPCA(sim.paths, colour_by = "group")

Simulate Datasets From Docker in R

This part we will demonstrate how to simulate datasets by using Docker in R and users should make sure that Docker has been installed on your device.

First, start Docker and check:

library(simpipe2docker)
test_docker_installation(detailed = TRUE)
# ✔ Docker is installed
# ✔ Docker daemon is running
# ✔ Docker is at correct version (>1.0): 1.41
# ✔ Docker is in linux mode
# ✔ Docker can pull images
# ✔ Docker can run image
# ✔ Docker can mount temporary volumes
# ✔ Docker test successful -----------------------------------------------------------------
# [1] TRUE

Estimation parameters from Docker:

estimation_result <- simpipe2docker::estimate_parameters_container(
  ref_data = ref_data,
  method = "Splat",
  verbose = TRUE,
  seed = 666
)
# Learning parameters from data 1
# Running /usr/local/bin/docker run --name \
#   20230807_112948__container__uxBxg1JNLM -e 'TMPDIR=/tmp2' --workdir \
#   /home/admin/ -v \
#   '/var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW:/home/admin/docker_path' \
#   -v \
#   '/tmp/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW/file8d9326fd785/tmp:/tmp2' \
#   duohongrui/simpipe
# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
# Estimating parameters using Splat
# Output is saved to  /var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW 
# Attempting to read output into R

Simulate new datasets from Docker:

## simulate 1000 cells and 1000 genes
simulated_data <- simpipe2docker::simulate_datasets_container(
  parameters = estimation_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 1000),
  return_format = "SingleCellExperiment",
  verbose = TRUE,
  seed = 666
)
# Simulating dataset 1
# Running /usr/local/bin/docker run --name \
#   20230807_113135__container__NigapuTAlX -e 'TMPDIR=/tmp2' --workdir \
#   /home/admin/ -v \
#   '/var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW:/home/admin/docker_path' \
#   -v \
#   '/tmp/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW/file8d913c1e7cb/tmp:/tmp2' \
#   duohongrui/simpipe
# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
# Registered S3 method overwritten by 'SeuratDisk':
#   method            from  
#   as.sparse.H5Group Seurat
# nCells: 1000 
# nGenes: 1000 
# nGroups: 1 
# de.prob: 0.1 
# nBatches: 1 
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Output is saved to  /var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW 
# Attempting to read output into R
simulated_data$refdata_Splat_1$simulate_result
# class: SingleCellExperiment 
# dim: 1000 1000 
# metadata(1): Params
# assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
# rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
# rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
# colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
# colData names(3): Cell Batch ExpLibSize
# reducedDimNames(0):
# mainExpName: NULL
# altExpNames(0):

## simulate 1000 cells and 1000 genes (two groups and 40% DEGs)
simulated_data <- simpipe2docker::simulate_datasets_container(
  parameters = estimation_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 1000,
                     prob.group = c(0.4, 0.6),
                     de.prob = 0.4),
  return_format = "list",
  verbose = TRUE,
  seed = 666
)
# Simulating dataset 1
# Running /usr/local/bin/docker run --name \
#   20230807_113237__container__4YsDYtfOI7 -e 'TMPDIR=/tmp2' --workdir \
#   /home/admin/ -v \
#   '/var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW:/home/admin/docker_path' \
#   -v \
#   '/tmp/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW/file8d938ac09a/tmp:/tmp2' \
#   duohongrui/simpipe
# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
# Registered S3 method overwritten by 'SeuratDisk':
#   method            from  
#   as.sparse.H5Group Seurat
# nCells: 1000 
# nGenes: 1000 
# nGroups: 1 
# de.prob: 0.4 
# nBatches: 1 
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Output is saved to  /var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpMrBHAW 
# Attempting to read output into R

Simulate Datasets By Simpipe Package

Based on simmethods package, simpipe package provides the other useful functions. Users can estimate parameters from multiple real datasets by using multiple methods. Meanwhile, users can also simulate multiple new datasets at once. In this part, we introduce some helpful functions in simpipe package.

First, we should use simpipe to estimate parameters from two real datasets:

## prepare a list of data
data <- list(data1 = ref_data,
             data2 = ref_data)

estimation_result <- simpipe::estimate_parameters(
  ref_data = data,
  method = "Splat",
  verbose = TRUE,
  seed = 666
)
# Estimating parameters using Splat
# Estimating parameters using Splat

Generate Multiple Datasets For Every Estimation Result

For every estimation result, we can generate multiple datasets by setting n parameter in simulate_datasets function:

simulated_data <- simpipe::simulate_datasets(
  parameters = estimation_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 1000),
  n = 3,
  return_format = "list",
  verbose = TRUE,
  seed = 666
)
# The length of seeds is not identical to the time(s) that every method will be executed 
# The seed will be set as: 100 200 300 when performing every method
# Simulating dataset 1
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.23 * dense matrix
# Skipping 'counts': estimated sparse size 2.23 * dense matrix
# Done!
# Simulating dataset 2
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.25 * dense matrix
# Skipping 'counts': estimated sparse size 2.25 * dense matrix
# Done!
# Simulating dataset 3
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Simulating dataset 4
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.23 * dense matrix
# Skipping 'counts': estimated sparse size 2.23 * dense matrix
# Done!
# Simulating dataset 5
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.25 * dense matrix
# Skipping 'counts': estimated sparse size 2.25 * dense matrix
# Done!
# Simulating dataset 6
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!

We can also set seed parameter whose length is equal to the number of n:

simulated_data <- simpipe::simulate_datasets(
  parameters = estimation_result,
  other_prior = list(batchCells = 1000,
                     nGenes = 1000),
  n = 3,
  return_format = "list",
  verbose = TRUE,
  seed = c(666, 888, 999)
)
# Simulating dataset 1
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Simulating dataset 2
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Simulating dataset 3
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.17 * dense matrix
# Skipping 'counts': estimated sparse size 2.17 * dense matrix
# Done!
# Simulating dataset 4
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Simulating dataset 5
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.21 * dense matrix
# Skipping 'counts': estimated sparse size 2.21 * dense matrix
# Done!
# Simulating dataset 6
# nCells: 1000
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using Splat
# Getting parameters...
# Creating simulation object...
# Simulating library sizes...
# Simulating gene means...
# Simulating BCV...
# Simulating counts...
# Simulating dropout (if needed)...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'BCV': estimated sparse size 1.5 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Skipping 'TrueCounts': estimated sparse size 2.17 * dense matrix
# Skipping 'counts': estimated sparse size 2.17 * dense matrix
# Done!