Lun

Here Lun method will be demonstrated clearly and hope that this document can help you.

Estimating parameters from a real dataset

Before simulating datasets, it is important to estimate some essential parameters from a real dataset in order to make the simulated data more real. If you do not have a single-cell transcriptomics count matrix now, you can use the data collected in simmethods package by simmethods:data command.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
dim(ref_data)
# [1] 4000  160

Using simmethods::Lun_estimation command to execute the estimation step.

estimate_result <- simmethods::Lun_estimation(ref_data = ref_data,
                                              verbose = T,
                                              seed = 10)
# Estimating parameters using Lun

Simulating datasets using Lun

After estimating parameter from a real dataset, we will simulate a dataset based on the learned parameters with different scenarios.

  1. Datasets with default parameters
  2. Determin the number of cells and genes
  3. Simulate two or more groups

Datasets with default parameters

The reference data contains 160 cells and 4000 genes, if we simulate datasets with default parameters and then we will obtain a new data which has the same size as the reference data. In addtion, the simulated dataset will have one group of cells.

simulate_result <- simmethods::Lun_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "SCE",
  seed = 111
)
# nCells: 160
# nGenes: 4000
# nGroups: 1
# de.prob: 0.25
# fc.up.group: 5
# fc.down.group: 0
SCE_result <- simulate_result[["simulate_result"]]
dim(SCE_result)
# [1] 4000  160
head(colData(SCE_result))
# DataFrame with 6 rows and 2 columns
#         cell_name       group
#       <character> <character>
# Cell1       Cell1      Group1
# Cell2       Cell2      Group1
# Cell3       Cell3      Group1
# Cell4       Cell4      Group1
# Cell5       Cell5      Group1
# Cell6       Cell6      Group1
head(rowData(SCE_result))
# DataFrame with 6 rows and 1 column
#         gene_name
#       <character>
# Gene1       Gene1
# Gene2       Gene2
# Gene3       Gene3
# Gene4       Gene4
# Gene5       Gene5
# Gene6       Gene6

Determin the number of cells and genes

In Lun, we can not set nCells directly and should set groupCells instead. For example, if we want to simulate 500 cells, we can type other_prior = list(groupCells = 500). For genes, we can just set nGenes. Here, we simulate a new dataset with 500 cells and 2000 genes:

simulate_result <- simmethods::Lun_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(groupCells = 500,
                     nGenes = 2000),
  seed = 111
)
# nCells: 500
# nGenes: 2000
# nGroups: 1
# de.prob: 0.5
# fc.up.group: 5
# fc.down.group: 0
result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 2000  500

Simulate two or more groups

In Lun, we can not set nGroups directly and should set prob.group instead. For example, if we want to simulate 2 groups, we can type other_prior = list(prob.group = c(0.5, 0.5)). Note that the sum of prob.group numeric vector must equal to 1, so we can also set prob.group = c(0.3, 0.7).

In addtion, if we want to simulate three or more groups, we should obey the rules:

  • The length of prob.group vector must always equal to the number of groups.
  • The sum of prob.group numeric vector must equal to 1.

For demonstration, we will simulate three groups using the learned parameters. We can set de.prob = 0.2 to simulate 20% genes as DEGs.

simulate_result <- simmethods::Lun_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(groupCells = 1000,
                     nGenes = 3000,
                     prob.group = c(0.1, 0.3, 0.6),
                     de.prob = 0.2),
  seed = 111
)
# nCells: 1000
# nGenes: 3000
# nGroups: 3
# de.prob: 0.2
# fc.up.group: 5
# fc.down.group: 0

If you encounter the error which is like Warning: NAs producedError in [[<-.data.frame (tmp, paste0(“DEFacGroup”, idx), value = c(5, :**, please set a higher gene number and try again.

result <- simulate_result[["simulate_result"]][["count_data"]]
dim(result)
# [1] 3000 1000
## cell information
cell_info <- simulate_result[["simulate_result"]][["col_meta"]]
table(cell_info$group)
# 
# Group1 Group2 Group3 
#    100    300    600
## gene information
gene_info <- simulate_result[["simulate_result"]][["row_meta"]]
### the proportion of DEGs
table(gene_info$de_gene)[2]/nrow(result) ## de.prob = 0.2
# yes 
# 0.2

We can see that the proportion of differential expressed genes is 0.2 (default is 1). Next, if we want to know the fold change between two groups, we will do division with the groups that we are interested in.

## fc between group2 and group1
fc_group1_to_group2 <- gene_info$DEFacGroup2/gene_info$DEFacGroup1
## fc between group3 and group1
fc_group1_to_group3 <- gene_info$DEFacGroup3/gene_info$DEFacGroup1
## fc between group3 and group2
fc_group2_to_group3 <- gene_info$DEFacGroup3/gene_info$DEFacGroup2

In addtion, users can also specify the foldchange of up-regulated or down-regulated DEGs by fc.up.group or fc.down.group.

simulate_result <- simmethods::Lun_simulation(parameters = estimate_result[["estimate_result"]],
                                              other_prior = list(prob.group = c(0.4, 0.6),
                                                                 de.prob = 0.2,
                                                                 fc.up.group = 2,
                                                                 fc.down.group = 0.5),
                                              return_format = "list",
                                              verbose = TRUE,
                                              seed = 111)
# nCells: 160
# nGenes: 4000
# nGroups: 2
# de.prob: 0.2
# fc.up.group: 2
# fc.down.group: 0.5
# Simulating datasets using Lun
# Getting parameters...
# Simulating means...
# Simulating cell means...
# Simulating counts...
# Creating final dataset...
# Sparsifying assays...
# Automatically converting to sparse matrices, threshold = 0.95
# Converting 'counts' to sparse matrix: estimated sparse size 0.82 * dense matrix
# Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
# Done!
row_data <- simulate_result[["simulate_result"]][["row_meta"]]
### fc.up.group
max(row_data$DEFacGroup1)
# [1] 2
### fc.down.group
min(row_data$DEFacGroup1)
# [1] 0.5