Estimation Parameters

Estimating essential parameters from the real datasets is a necessary step before simulating a new dataset. In this vignette, we demonstrate three ways of performing estimation step, including:

  • Depend on local R environment
  • Utilize Docker and link Docker and R
  • Special estimation scenarios by Simpipe package

Estimate Parameters From Local R

Make sure that you have already installed three main packages (simutils, simmethods and simpipe) before doing estimation. If not, please refer to Installation.

Step1: Load Packages

First, library our packages by:

library(simmethods)
library(simpipe)

Step2: Prepare Your Data and Prior Information

In the estimation, real dataset or the input dataset is necessary. You can load the example dataset in simmethods by:

ref_data <- simmethods::data

The gene expression profile should be a matrix, not the sparese one or a data frame.

The prior information of cell groups is also in the simmethods package, and we will demonstrate how to use it later.

group_information <- simmethods::group_condition

Step3: Estimate Parameters

Example A: Splat (Only need gene expression matrix)

Splat is one of the methods in Splatter package and only reference (real) data is needed to learn the useful parameters. We can directly call Splat_estimation function to do so.

estimation_result <- simmethods::Splat_estimation(
  ref_data = ref_data,
  seed = 111)

The list of result contains two types of information:

  • estimate_result, the learned parameters by Splat
  • estimate_detection, the running time and memory usage detected by peakRAM package.

Example B: zingeR (Cell group information is needed)

In zingeR method, the information of cell groups is needed. We can prepare a numeric vector to specify the identity for every cell in the expression matrix.

group_information <- as.numeric(simmethods::group_condition)

The other_prior parameter learns the list of other prior information, including:

  • group.condition, the numeric vector of cell group labels.
  • batch.condition, the numeric vector of cell batch labels.
  • other names

After preparing the dataset and prior information, we can use zingeR to estimate paramaters.

estimation_result <- simmethods::zingeR_estimation(
  ref_data = ref_data,
  other_prior = list(group.condition = group_information),
  seed = 111,
  verbose = TRUE)
# Estimating parameters using zingeR

Example C: Parameter and Prior Information

Sometimes users may want to know what parameters are presented in a method and what kinds of prior information are needed. Usually, user can browse the help vignette by help(function_name) or ?function_name.

For example, if we want to know the parameters in the SPsimSeq method, we can call help(SPsimSeq_simulation).

help(SPsimSeq_simulation)

We detailed the prior information and parameters that the method requires and users usually use.

# Details
# In addtion to simulate datasets with default parameters, users want to simulate other kinds of datasets, e.g. a counts matrix with 2 or more # cell groups. In SPsimSeq, you can set extra parameters to simulate datasets.
# 
# The customed parameters you can set are below:
# 
# nCells. In SPsimSeq, you can set nCells directly. For example, if you want to simulate 1000 cells, you can type other_prior = list(nCells = # 1000).
# 
# nGenes. You can directly set other_prior = list(nGenes = 5000) to simulate 5000 genes.
# 
# group.condition. You can input cell group information as an integer vector to specify which group that each cell belongs to. See Examples.
# 
# de.prob. You can directly set other_prior = list(de.prob = 0.2) to simulate DEGs that account for 20 percent of all genes.
# 
# fc.group. You can directly set other_prior = list(fc.group = 2) to specify the minimum fold change of DEGs.
# 
# batch.condition. You can input cell batch information as an integer vector to specify which batch that each cell belongs to. See Examples.

Example D: Default Parameters

We only provide the default parameters for some of the methods:

  • Splat, SplatPop, Kersplat
  • Simple
  • SCRIP
  • Lun, Lun2,
  • zinbwave
  • BASiCS
  • ESCO

If we want to get the default parameter of SCRIP, input:

SCRIP_param <- simutils::default_parameters("SCRIP")

The object of default parameters can be directly used for simulation step.

Estimate Parameters From Docker in R

Estimating parameters by a Docker container in R is not challenging since all manipulations are the same as those that have been demonstrated above, except for the function and R package used.

First, start Docker service and check:

library(simpipe2docker)
simpipe2docker::test_docker_installation(detailed = TRUE)
# ✔ Docker is installed
# ✔ Docker daemon is running
# ✔ Docker is at correct version (>1.0): 1.41
# ✔ Docker is in linux mode
# ✔ Docker can pull images
# ✔ Docker can run image
# ✔ Docker can mount temporary volumes
# ✔ Docker test successful -----------------------------------------------------------------
# [1] TRUE

Next, prepare your data and prior information:

data <- simmethods::data
group_condition <- as.numeric(simmethods::group_condition)

Estimate parameters by Splat method:

estimation_result <- simpipe2docker::estimate_parameters_container(
  ref_data = data,
  method = "Splat", 
  other_prior = list(group.condition = group_condition),
  seed = 111,
  verbose = TRUE
)
# Learning parameters from data 1
# Running /usr/local/bin/docker run --name \
#   20230510_151550__container__wjYWE9mSHC -e 'TMPDIR=/tmp2' --workdir \
#   /home/admin/ -v \
#   '/var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI:/home/admin/docker_path' \
#   -v \
#   '/tmp/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI/file414b397e13a0/tmp:/tmp2' \
#   duohongrui/simpipe
# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
# Estimating parameters using Splat
# Output is saved to  /var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI 
# Attempting to read output into R

Users can also input the list of multiple datasets and use more than one method to performe the estimation step. In this case, you can refer to the next topic Estimate Parameters By Simpipe Package and just change the function name that located in the simpipe2docker package.

Estimate Parameters By Simpipe Package

In addition to calling functions from simmethods package, users can also use estimate_parameters function in simpipe package. There are some advantages:

  • One dataset can be estimated by multiple methods
  • Multiple datasets can be estimated by multiple methods

One dataset ▶️ Multiple Methods

If you want to estimate parameters from one dataset by many other simulation methods, please make sure that you have already know the requirements of prior information of every method. For example, if we want to estimate the parameters using three methods: Splat, zingR and powsimR, we should browse the vignettes of these three methods.

After checking the vignettes, we list the necessary prior information and optional cutomed parameters here:

  • Splat: none (just a gene matrix)
  • zingR: prior information (a numeric vector of cell groups)
  • powsimR: optional parameters (RNA-seq, Protocol and Normalisation)

Then we write these parameters in a list:

other_prior = list(group.condition = as.numeric(group_condition),
                   RNAseq = "singlecell",
                   Protocol = "UMI",
                   Normalisation = "scran")
estimation_result <- simpipe::estimate_parameters(
  method = c("Splat", "zingeR", "powsimR"),
  ref_data = ref_data,
  other_prior = other_prior)
# Registered S3 method overwritten by 'gdata':
#   method         from  
#   reorder.factor gplots
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR

If the necessary information is not input, the error message will turn out. You must also make sure that the names of methods are right spelled.

You can see a list of three elements in the result and that means the estimation is done:

names(estimation_result)
# [1] "refdata_powsimR" "refdata_Splat"   "refdata_zingeR"

Multiple datasets ▶️ Multiple Methods

Multiple datasets can also be estimated by many methods using estimate_parameters function. Besides the prior information and optional parameters, the ref_data parameter in estimate_parameters function should be a named list when multiple datasets are involved.

Here, we can first create a data list with customed names (data1 and data2):

data_list <- list(data1 = ref_data,
                  data2 = ref_data)

Then, set the prior information:

other_prior = list(group.condition = as.numeric(group_condition),
                   RNAseq = "singlecell",
                   Protocol = "UMI",
                   Normalisation = "scran")

Execute the procedure:

estimation_result <- simpipe::estimate_parameters(
  method = c("Splat", "zingeR", "powsimR"),
  ref_data = data_list,
  other_prior = other_prior)
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR

We will see a list of six elements in the result:

names(estimation_result)
# [1] "data1_powsimR" "data1_Splat"   "data1_zingeR"  "data2_powsimR"
# [5] "data2_Splat"   "data2_zingeR"

All of above manipulations can be done by Docker container and only the function name should be changed into estimate_parameters_container in simpipe2docker package