> References > Vignettes > SCRIP

SCRIP

Estimating parameters from a real dataset
Simulating datasets using SCRIP

Here SCRIP method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
dim(ref_data)
# [1] 4000  160

Using simmethods::SCRIP_estimation command to execute the estimation step.

estimate_result <- simmethods::SCRIP_estimation(ref_data = ref_data,
                                                verbose = T,
                                                seed = 10)
# Estimating parameters using SCRIP

Simulating datasets using SCRIP

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

Datasets with default parameters
Different modes in SCRIP
Determin the number of cells and genes
Simulate two or more groups
Simulate two or more batches
Simulate more groups and batches simutaniously

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group and one batch of cells.

simulate_result <- simmethods::SCRIP_simulation(
  parameters = estimate_result[["estimate_result"]],
  ref_data = ref_data,
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1

The reference data must be input when simulating new datasets.

SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160

head(colData(SCE_result))
# DataFrame with 6 rows and 3 columns
#         cell_name       batch       group
#       <character> <character> <character>
# Cell1       Cell1      Batch1      Group1
# Cell2       Cell2      Batch1      Group1
# Cell3       Cell3      Batch1      Group1
# Cell4       Cell4      Batch1      Group1
# Cell5       Cell5      Batch1      Group1
# Cell6       Cell6      Batch1      Group1

Time consuming:

simulate_result$simulate_detection$Elapsed_Time_sec
# [1] 19.768

Different modes in SCRIP

SCRIP contains five different simulation modes, and you can specify which mode do you use (default is GP-trendedBCV):

GP-trendedBCV
GP-commonBCV
BGP-commonBCV
BP
BGP-trendedBCV

simulate_result <- simmethods::SCRIP_simulation(parameters = estimate_result[["estimate_result"]],
                                                ref_data = ref_data,
                                                other_prior = list(mode = "BP"),
                                                return_format = "list",
                                                verbose = TRUE,
                                                seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using SCRIP

simulate_result <- simmethods::SCRIP_simulation(parameters = estimate_result[["estimate_result"]],
                                                ref_data = ref_data,
                                                other_prior = list(mode = "BGP-commonBCV"),
                                                return_format = "list",
                                                verbose = TRUE,
                                                seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1
# Simulating datasets using SCRIP

Determin the number of cells and genes

In SCRIP, we can not set nCells directly and should set batchCells instead. For example, if we want to simulate 500 cells, we can type other_prior = list(batchCells = 500). For genes, we can just set nGenes. Here, we simulate a new dataset with 500 cells and 1000 genes:

simulate_result <- simmethods::SCRIP_simulation(
  parameters = estimate_result[["estimate_result"]],
  ref_data = ref_data,
  return_format = "list",
  other_prior = list(batchCells = 500,
                     nGenes = 1000),
  seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 1

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000  500

Simulate two or more groups

In SCRIP, we can not set nGroups directly and should set prob.group instead. For example, if we want to simulate 2 groups, we can type other_prior = list(prob.group = c(0.5, 0.5)). Note that the sum of prob.group numeric vector must equal to 1, so we can also set prob.group = c(0.3, 0.7).

In addtion, if we want to simulate three or more groups, we should obey the rules:

The length of prob.group vector must always equal to the number of groups.
The sum of prob.group numeric vector must equal to 1.

For demonstration, we will simulate three groups using the learned parameters.

simulate_result <- simmethods::SCRIP_simulation(
  parameters = estimate_result[["estimate_result"]],
  ref_data = ref_data,
  return_format = "list",
  other_prior = list(batchCells = 500,
                     nGenes = 1000,
                     prob.group = c(0.1, 0.3, 0.6)),
  seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 3
# de.prob: 0.1
# nBatches: 1

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000  500

## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$group)
# 
# Group1 Group2 Group3 
#     46    156    298

## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.1
# yes 
# 0.1

We can see that the proportion of differential expressed genes is 0.1 (equals to the default). Next, if we want to know the fold change between two groups, we will do division with the groups that we are interested in.

## fc between group2 and group1
fc_group1_to_group2 <- gene_info$DEFacGroup2/gene_info$DEFacGroup1
## fc between group3 and group1
fc_group1_to_group3 <- gene_info$DEFacGroup3/gene_info$DEFacGroup1
## fc between group3 and group2
fc_group2_to_group3 <- gene_info$DEFacGroup3/gene_info$DEFacGroup2

Simulate two or more batches

In SCRIP, we can not set nBatches directly and should set batchCells instead. For example, if we want to simulate 2 batches, we can type other_prior = list(batchCells = c(250, 250)). Note that the sum of batchCells numeric vector represents the total number of cells and the length of the vector equals to the number of batches.

In addtion, if we want to simulate three or more batches, we should obey the rules:

The length of batchCells vector always equals to the number of batches.
The sum of batchCells numeric vector represents the total number of cells.

For demonstration, we will simulate three batches using the learned parameters.

simulate_result <- simmethods::SCRIP_simulation(
  parameters = estimate_result[["estimate_result"]],
  ref_data = ref_data,
  return_format = "list",
  other_prior = list(batchCells = c(200, 300),
                     nGenes = 1000),
  seed = 111
)
# nCells: 500
# nGenes: 1000
# nGroups: 1
# de.prob: 0.1
# nBatches: 2

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1000  500

## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
# 
# Batch1 Batch2 
#    200    300

Simulate more groups and batches simutaniously

As mentioned before, we can set prob.group and batchCells to determine the number of groups and batches and we can also set de.prob to specify the proportion of DEGs. Here, we simulate a dataset with following settings:

1000 cells
2000 genes
three groups with 0.2 proportion of DEGs
two batches

simulate_result <- simmethods::SCRIP_simulation(
  parameters = estimate_result[["estimate_result"]],
  ref_data = ref_data,
  return_format = "list",
  other_prior = list(batchCells = c(500, 500),
                     nGenes = 2000,
                     de.prob = 0.2,
                     prob.group = c(0.2, 0.3, 0.5)),
  seed = 111
)
# nCells: 1000
# nGenes: 2000
# nGroups: 3
# de.prob: 0.2
# nBatches: 2

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 2000 1000

## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$batch)
# 
# Batch1 Batch2 
#    500    500
table(cell_info$group)
# 
# Group1 Group2 Group3 
#    186    321    493

## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result)
#    yes 
# 0.1685