BEARscc

Here BEARscc method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. The reference data can be downloaded here.

BEARscc needs spike-in ERCC genes as the reference to measure the variation of the real dataset and there are some notes that users must pay attention to.

  1. Make sure that there are spike-in genes in your count matrix whose prefix are ERCC-. If not, the error may occur.

  2. BEARscc needs ensembl gene id to execute estimation step, so it is better to transform the gene id previously. But users can also input official gene id and the procedure will convert them into ensembl gene id and note that this step may result in losing some genes when matching gene ids.

  3. If users need the transformation of gene ids, users must input the species name parameter: mouse or human. And we will match the according database to accomplish the conversion step.

  4. Another important parameters: dilution.factor, volume

  • dilution.factor: The dilution factor to dilute the ERCC spike-in mix liquid.
  • volume: The volume (microliter) of spike-in mix used in sequencing step.
library(simmethods)
library(SingleCellExperiment)
# Load data (downloaded from https://zenodo.org/record/8251596/files/data23_GSE62270.rds?download=1)
data <- readRDS("../../../../preprocessed_data/data23_GSE62270.rds")
ref_data <- data$data
## group_condition can must be a numeric vector.
other_prior <- list(dilution.factor = 50000, volume = 0.03, species = "mouse")

Using simmethods::BEARscc_estimation command to execute the estimation step.

estimate_result <- simmethods::BEARscc_estimation(
  ref_data = ref_data,
  other_prior = other_prior,
  verbose = TRUE,
  seed = 8
)
# Estimating parameters using BEARscc
# [1] "Fitting parameter alpha to establish spike-in derived noise model."
# [1] "Estimating error for spike-ins with alpha = 0"
# [1] "Estimating error for spike-ins with alpha = 0.25"
# [1] "Estimating error for spike-ins with alpha = 0.5"
# [1] "Estimating error for spike-ins with alpha = 0.75"
# [1] "Estimating error for spike-ins with alpha = 1"
# [1] "Warning: there are no spike-ins that were detected inevery sample. As a result the actual transcript countthreshold, k, at which drop-outs are not present will beextrapolated rather than interpolated. The extrapolated value for k is, 2205."
# [1] "There are adequate spike-in drop-outs to build the drop-out model. Estimating the drop-out model now."

Simulating datasets using BEARscc

Users can not set the number of cells or genes in BEARscc.

simulate_result <- simmethods::BEARscc_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  seed = 111
)
# nCells: 672
# nGenes: 21427
# [1] "Creating a simulated replicated counts matrix: 1."
result <- simulate_result[["simulate_result"]]
dim(result$count_data)
# [1] 21427   672