Here BASiCS method will be demonstrated clearly and hope that this document can help you.
Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. Errors usually occurred when using BASiCS, so that we used a dataset which can successfully pass through the execution for demonstration and it can be downloaded here.
library(simmethods)
# Load data
file_path <- "../../../../preprocessed_data/data95_pancreatic-alpha-cell-maturation_zhang.rds"
data <- readRDS(file_path)
ref_data <- t(data$data$counts)
BASiCS allows users to input the prior information of cell batches, which is a numeric vector that specifies the batch label for each cell. Data95 does not contain the batch information, so we can randomly sample some labels for cells.
set.seed(666)
batch_label <- sample(c(1,2), size = ncol(ref_data), replace = TRUE)
Using simmethods::BASiCS_estimation
command to execute the estimation step, but it may take a lot of time.
estimate_result <- simmethods::BASiCS_estimation(
ref_data = ref_data,
other_prior = list(batch.condition = batch_label),
verbose = TRUE,
seed = 8
)
Otherwise, users can also input the prior information of ERCC spike-in control RNA, which contains three important parameters:
ERCC-
. If not, the error may occur.dilution.factor
: The dilution factor to dilute the ERCC spike-in mix liquid.volume
: The volume (microliter) of spike-in mix used in sequencing step.Check out the names of ERCC spike-in RNA:
rownames(ref_data)[grep("^ERCC-", rownames(ref_data))]
# [1] "ERCC-00116" "ERCC-00025" "ERCC-00165" "ERCC-00053" "ERCC-00112"
# [6] "ERCC-00078" "ERCC-00084" "ERCC-00019" "ERCC-00163" "ERCC-00099"
# [11] "ERCC-00160" "ERCC-00059" "ERCC-00035" "ERCC-00092" "ERCC-00170"
# [16] "ERCC-00144" "ERCC-00062" "ERCC-00044" "ERCC-00157"
Prepare other two parameters:
other_prior <- list(dilution.factor = data$data_info$dilution_factor,
volume = data$data_info$volume)
Execute the parameter estimation (it may take a long time):
estimate_result <- simmethods::BASiCS_estimation(
ref_data = ref_data,
other_prior = other_prior,
verbose = TRUE,
seed = 8
)
After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.
simulate_result <- simmethods::BASiCS_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
seed = 1
)
# nCells: 322
# nGenes: 6119
# nBatches: 1
result <- simulate_result[["simulate_result"]]
dim(result$count_data)
# [1] 6138 322
In BASiCS, we can set batchCells
and nGenes
to specify the number of cells and genes.
Here, we simulate a new dataset with 1000 cells and 1000 genes:
simulate_result <- simmethods::BASiCS_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(batchCells = 1000,
nGenes = 1000),
seed = 3
)
# nCells: 1000
# nGenes: 1000
# nBatches: 1
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1019 1000
There is a strict rule for simulating cell batches using BASiCS: 1) Users can simulate cell batches when the information of cell batch labels is used for parameter estimation; 2) The number of the simulated batches must be equal to that of the real cell batches used in parameter estimation.
As we did not use the information of cell batches in parameter estimation, so we can not simulate the data with batch effects. But for demonstrations, we will show the approaches for simulating cell batches.
The number of cell batches is determined by batchCells
parameter, whose length represents the number of batches that need to be simulated. For example, three batches of cells will be simulated by setting batchCells=c(100,200,300)
.
simulate_result <- simmethods::BASiCS_simulation(
parameters = estimate_result[["estimate_result"]],
return_format = "list",
other_prior = list(batchCells = c(100,200,300),
nGenes = 1000),
seed = 3
)