BASiCS

Here BASiCS method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. Errors usually occurred when using BASiCS, so that we used a dataset which can successfully pass through the execution for demonstration and it can be downloaded here.

library(simmethods)
# Load data
file_path <- "../../../../preprocessed_data/data95_pancreatic-alpha-cell-maturation_zhang.rds"
data <- readRDS(file_path)
ref_data <- t(data$data$counts)

prior information of cell batches

BASiCS allows users to input the prior information of cell batches, which is a numeric vector that specifies the batch label for each cell. Data95 does not contain the batch information, so we can randomly sample some labels for cells.

set.seed(666)
batch_label <- sample(c(1,2), size = ncol(ref_data), replace = TRUE)

Using simmethods::BASiCS_estimation command to execute the estimation step, but it may take a lot of time.

estimate_result <- simmethods::BASiCS_estimation(
  ref_data = ref_data,
  other_prior = list(batch.condition = batch_label),
  verbose = TRUE,
  seed = 8
)

prior information of ERCC spike-in control RNA

Otherwise, users can also input the prior information of ERCC spike-in control RNA, which contains three important parameters:

  • Make sure that there are spike-in genes in your count matrix whose prefix are ERCC-. If not, the error may occur.
  • dilution.factor: The dilution factor to dilute the ERCC spike-in mix liquid.
  • volume: The volume (microliter) of spike-in mix used in sequencing step.

Check out the names of ERCC spike-in RNA:

rownames(ref_data)[grep("^ERCC-", rownames(ref_data))]
#  [1] "ERCC-00116" "ERCC-00025" "ERCC-00165" "ERCC-00053" "ERCC-00112"
#  [6] "ERCC-00078" "ERCC-00084" "ERCC-00019" "ERCC-00163" "ERCC-00099"
# [11] "ERCC-00160" "ERCC-00059" "ERCC-00035" "ERCC-00092" "ERCC-00170"
# [16] "ERCC-00144" "ERCC-00062" "ERCC-00044" "ERCC-00157"

Prepare other two parameters:

other_prior <- list(dilution.factor = data$data_info$dilution_factor,
                    volume = data$data_info$volume)

Execute the parameter estimation (it may take a long time):

estimate_result <- simmethods::BASiCS_estimation(
  ref_data = ref_data,
  other_prior = other_prior,
  verbose = TRUE,
  seed = 8
)

Simulating datasets using BASiCS

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate two or more batches
simulate_result <- simmethods::BASiCS_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 1
)
# nCells: 322
# nGenes: 6119
# nBatches: 1
result <- simulate_result[["simulate_result"]]
dim(result$count_data)
# [1] 6138  322

Determin the number of cells and genes

In BASiCS, we can set batchCells and nGenes to specify the number of cells and genes.

Here, we simulate a new dataset with 1000 cells and 1000 genes:

simulate_result <- simmethods::BASiCS_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(batchCells = 1000,
                     nGenes = 1000),
  seed = 3
)
# nCells: 1000
# nGenes: 1000
# nBatches: 1
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 1019 1000

Simulate two or more batches

There is a strict rule for simulating cell batches using BASiCS: 1) Users can simulate cell batches when the information of cell batch labels is used for parameter estimation; 2) The number of the simulated batches must be equal to that of the real cell batches used in parameter estimation.

.

As we did not use the information of cell batches in parameter estimation, so we can not simulate the data with batch effects. But for demonstrations, we will show the approaches for simulating cell batches.

The number of cell batches is determined by batchCells parameter, whose length represents the number of batches that need to be simulated. For example, three batches of cells will be simulated by setting batchCells=c(100,200,300).

simulate_result <- simmethods::BASiCS_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(batchCells = c(100,200,300),
                     nGenes = 1000),
  seed = 3
)