Estimating essential parameters from the real datasets is a necessary step before simulating a new dataset. In this vignette, we demonstrate three ways of performing estimation step, including:
Make sure that you have already installed three main packages (simutils, simmethods and simpipe) before doing estimation. If not, please refer to Installation.
First, library our packages by:
library(simmethods)
library(simpipe)
In the estimation, real dataset or the input dataset is necessary. You can load the example dataset in simmethods by:
ref_data <- simmethods::data
The gene expression profile should be a matrix, not the sparese one or a data frame.
The prior information of cell groups is also in the simmethods package, and we will demonstrate how to use it later.
group_information <- simmethods::group_condition
Splat is one of the methods in Splatter package and only reference (real) data is needed to learn the useful parameters. We can directly call Splat_estimation
function to do so.
estimation_result <- simmethods::Splat_estimation(
ref_data = ref_data,
seed = 111)
The list of result contains two types of information:
estimate_result
, the learned parameters by Splatestimate_detection
, the running time and memory usage detected by peakRAM package.In zingeR method, the information of cell groups is needed. We can prepare a numeric vector to specify the identity for every cell in the expression matrix.
group_information <- as.numeric(simmethods::group_condition)
The other_prior
parameter learns the list of other prior information, including:
group.condition
, the numeric vector of cell group labels.batch.condition
, the numeric vector of cell batch labels.After preparing the dataset and prior information, we can use zingeR to estimate paramaters.
estimation_result <- simmethods::zingeR_estimation(
ref_data = ref_data,
other_prior = list(group.condition = group_information),
seed = 111,
verbose = TRUE)
# Estimating parameters using zingeR
Sometimes users may want to know what parameters are presented in a method and what kinds of prior information are needed. Usually, user can browse the help
vignette by help(function_name)
or ?function_name
.
For example, if we want to know the parameters in the SPsimSeq method, we can call help(SPsimSeq_simulation)
.
help(SPsimSeq_simulation)
We detailed the prior information and parameters that the method requires and users usually use.
# Details
# In addtion to simulate datasets with default parameters, users want to simulate other kinds of datasets, e.g. a counts matrix with 2 or more # cell groups. In SPsimSeq, you can set extra parameters to simulate datasets.
#
# The customed parameters you can set are below:
#
# nCells. In SPsimSeq, you can set nCells directly. For example, if you want to simulate 1000 cells, you can type other_prior = list(nCells = # 1000).
#
# nGenes. You can directly set other_prior = list(nGenes = 5000) to simulate 5000 genes.
#
# group.condition. You can input cell group information as an integer vector to specify which group that each cell belongs to. See Examples.
#
# de.prob. You can directly set other_prior = list(de.prob = 0.2) to simulate DEGs that account for 20 percent of all genes.
#
# fc.group. You can directly set other_prior = list(fc.group = 2) to specify the minimum fold change of DEGs.
#
# batch.condition. You can input cell batch information as an integer vector to specify which batch that each cell belongs to. See Examples.
We only provide the default parameters for some of the methods:
If we want to get the default parameter of SCRIP, input:
SCRIP_param <- simutils::default_parameters("SCRIP")
The object of default parameters can be directly used for simulation step.
Estimating parameters by a Docker container in R is not challenging since all manipulations are the same as those that have been demonstrated above, except for the function and R package used.
First, start Docker service and check:
library(simpipe2docker)
simpipe2docker::test_docker_installation(detailed = TRUE)
# ✔ Docker is installed
# ✔ Docker daemon is running
# ✔ Docker is at correct version (>1.0): 1.41
# ✔ Docker is in linux mode
# ✔ Docker can pull images
# ✔ Docker can run image
# ✔ Docker can mount temporary volumes
# ✔ Docker test successful -----------------------------------------------------------------
# [1] TRUE
Next, prepare your data and prior information:
data <- simmethods::data
group_condition <- as.numeric(simmethods::group_condition)
Estimate parameters by Splat method:
estimation_result <- simpipe2docker::estimate_parameters_container(
ref_data = data,
method = "Splat",
other_prior = list(group.condition = group_condition),
seed = 111,
verbose = TRUE
)
# Learning parameters from data 1
# Running /usr/local/bin/docker run --name \
# 20230510_151550__container__wjYWE9mSHC -e 'TMPDIR=/tmp2' --workdir \
# /home/admin/ -v \
# '/var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI:/home/admin/docker_path' \
# -v \
# '/tmp/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI/file414b397e13a0/tmp:/tmp2' \
# duohongrui/simpipe
# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
# Estimating parameters using Splat
# Output is saved to /var/folders/1l/xmc98tgx0m37wxtbtwnl6h7c0000gn/T//RtmpyDFvJI
# Attempting to read output into R
Users can also input the list of multiple datasets and use more than one method to performe the estimation step. In this case, you can refer to the next topic Estimate Parameters By Simpipe Package
and just change the function name that located in the simpipe2docker package.
In addition to calling functions from simmethods package, users can also use estimate_parameters
function in simpipe package. There are some advantages:
If you want to estimate parameters from one dataset by many other simulation methods, please make sure that you have already know the requirements of prior information of every method. For example, if we want to estimate the parameters using three methods: Splat, zingR and powsimR, we should browse the vignettes of these three methods.
After checking the vignettes, we list the necessary prior information and optional cutomed parameters here:
Then we write these parameters in a list:
other_prior = list(group.condition = as.numeric(group_condition),
RNAseq = "singlecell",
Protocol = "UMI",
Normalisation = "scran")
estimation_result <- simpipe::estimate_parameters(
method = c("Splat", "zingeR", "powsimR"),
ref_data = ref_data,
other_prior = other_prior)
# Registered S3 method overwritten by 'gdata':
# method from
# reorder.factor gplots
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR
If the necessary information is not input, the error message will turn out. You must also make sure that the names of methods are right spelled.
You can see a list of three elements in the result and that means the estimation is done:
names(estimation_result)
# [1] "refdata_powsimR" "refdata_Splat" "refdata_zingeR"
Multiple datasets can also be estimated by many methods using estimate_parameters
function. Besides the prior information and optional parameters, the ref_data
parameter in estimate_parameters
function should be a named list when multiple datasets are involved.
Here, we can first create a data list with customed names (data1 and data2):
data_list <- list(data1 = ref_data,
data2 = ref_data)
Then, set the prior information:
other_prior = list(group.condition = as.numeric(group_condition),
RNAseq = "singlecell",
Protocol = "UMI",
Normalisation = "scran")
Execute the procedure:
estimation_result <- simpipe::estimate_parameters(
method = c("Splat", "zingeR", "powsimR"),
ref_data = data_list,
other_prior = other_prior)
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR
# Estimating parameters using powsimR
# Estimating parameters using estimateParam function
# The provided count matrix has 160 out of 160 single cells and 4000 out of 4000 genes with at least 1 count.
# 29 out of 160 single cells were determined to be outliers and removed prior to normalisation.
# 3 genes out of 4000 were deemed unexpressed and removed prior to normalisation.
# Using calculateSumFactors, i.e. deconvolution over all cells!
# Estimating moments.
# Fitting models.
# For 3996 out of 4000 genes, mean, dispersion and dropout could be estimated. 131 out of 160 single cells were used for this.
# Estimating parameters using Splat
# Estimating parameters using zingeR
We will see a list of six elements in the result:
names(estimation_result)
# [1] "data1_powsimR" "data1_Splat" "data1_zingeR" "data2_powsimR"
# [5] "data2_Splat" "data2_zingeR"
All of above manipulations can be done by Docker container and only the function name should be changed into estimate_parameters_container in simpipe2docker package