Data Assessment

After obtaining the simulated datasets and corresponding information of cells and genes, we can assess the simulation output with the cell or gene information. For example, if we want to know how reliable the simulated groups of cells are, or how accurate the differential expressed genes between cell groups are, we can adopt the evaluation procedure implemented in the simpipe R package.

There are four aspects related to the simulation output that can be assessed:

  1. Cell groups
  2. DEGs between cell groups
  3. Cell batches
  4. Cellular differentiation trajectory

In this chapter, we will demonstrate how to assess a simulated gene expression data on these four aspects using the simpipe R package.

Load the reference data, and perform the parameter estimation.

library(simmethods)
library(SingleCellExperiment)
# Load data
ref_data <- simmethods::data
estimate_result <- simmethods::Splat_estimation(ref_data = ref_data,
                                                verbose = T,
                                                seed = 10)
# Estimating parameters using Splat

Evaluate the simulated cell groups

Simulate the dataset with cell groups

Initially, we should simulate a dataset with four cell groups using Splat method:

simulate_result <- simmethods::Splat_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(batchCells = 1000,
                     nGenes = 1000,
                     prob.group = c(0.2, 0.3, 0.1, 0.4)),
  seed = 111
)
# nCells: 1000
# nGenes: 1000
# nGroups: 4
# de.prob: 0.1
# nBatches: 1

Prepare the labels of cell groups

### cell metadata
cell_metadata <- simulate_result$simulate_result$col_meta
cell_groups <- cell_metadata$group

Assess the simulated cell groups

cell_group_assess <- simpipe::data_functionality_summary(
  data = simulate_result$simulate_result$count_data,
  group = cell_groups
)
# The group information is input...
# -------------------------------------------------
#              Evaluating cell groups
# -------------------------------------------------
# 1-Calculating CDI...
# 2-Calculating ROUGE...
# Warning in simutils::calculate_cluster_properties(data = group_data, dist = NULL, : The ROUGE calculation failed
# 3-Calculating silhouette...
# 4-Calculating dunn...
# 5-Calculating connectivity...
# 6-Calculating DB index...
str(cell_group_assess)
# List of 4
#  $ group_evaluation     :List of 1
#   ..$ data:List of 6
#   .. ..$ CDI         : num 5788015
#   .. ..$ ROUGE       : logi NA
#   .. ..$ silhouette  : num -0.212
#   .. ..$ dunn        : num 4.41e-07
#   .. ..$ connectivity: num 1055
#   .. ..$ DB_index    : num 35.1
#  $ batch_evaluation     : NULL
#  $ DEGs_evaluation      : NULL
#  $ trajectory_evaluation: NULL

Extract the evaluation result of cell groups:

assess_group_result <- cell_group_assess$group_evaluation
assess_group_result
# $data
# $data$CDI
# [1] 5788015
# 
# $data$ROUGE
# [1] NA
# 
# $data$silhouette
# [1] -0.2119363
# 
# $data$dunn
# [1] 4.413479e-07
# 
# $data$connectivity
# [1] 1055.473
# 
# $data$DB_index
# [1] 35.06495

As you see, there are six metrics used for assessing the quality of cell groups:

CDI: Clustering Deviation Index. If the CDI score is small, the simulated label set is close to the true label set and more reliable.

ROUGE: ROUGE is an entropy-based metric, enabling accurate, sensitive and robust assessment of cell cluster purity with the concept of quantifying the randomness of gene expression in cells. A cell population with higher purity will receive a value close to 1 and reversely close to 0.

silhouette: Average silouette width (ASW). The silouette width is a common internal metric to measure the extent of within-cluster distances of a cell and between-cluster distances of that cell to the closest cluster. The average silouette width (ASW) ranges from -1 to1.

dunn: Dunn Index. Dunn Index is a clustering validity index by calculating the ratio of the smallest distance between cells which are not in the same cluster to the largest distance of cells in the same cluster. The higher the index, the better the clustering.

connectivity: The connectivity metric is designed to measure the connectedness degree of the partitioning clusters, which can be also used to assess the clustering performance.

DB_index: Davies-Bouldin Index (DBI). DBI is a measurement of similarity between intra-cluster dispersion and separation of inter-clusters. The smaller the DBI is, the better the clustering.

Evaluate the simulated DEGs

Simulate the dataset with specified proportion of DEGs

Firstly, we will simulate a dataset with two cell groups with 40% DEGs:

simulate_result <- simmethods::Splat_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(batchCells = 1000,
                     nGenes = 1000,
                     prob.group = c(0.4, 0.6),
                     de.prob = 0.4),
  seed = 111
)
# nCells: 1000
# nGenes: 1000
# nGroups: 2
# de.prob: 0.4
# nBatches: 1

Then, check the information of simulated DEGs

gene_metadata <- simulate_result$simulate_result$row_meta
table(gene_metadata$de_gene)/length(gene_metadata$de_gene)
# 
#    no   yes 
# 0.646 0.354

Prepare the information of DEGs

There are some essential information related to the DEGs to prepare:

  1. DEGs: A list of DEGs with the names of xxxvsxxx. Note that the names of DEGs are in the rownames of the matrix or the dataframe and the names of xxx is in the group characters. Needed if a matrix or a dataframe is input.

  2. DEA_method: The DEA method to get the DEGs. Choices: edgeRQLF, edgeRQLFDetRate, MASTcpmDetRate, MASTtpmDetRate, MASTcpm, MASTtpm, limmatrend, limmavoom, ttest and wilcox. Default is edgeRQLFDetRate.

  3. model_method: The method to establish the model using DEGs. SVM, Decision tree or RF (Random Forest). Default is SVM.

Now we prepare the list of DEGs between two cell groups:

col_meta <- simulate_result$simulate_result$col_meta
row_meta <- simulate_result$simulate_result$row_meta
DEGs_list <- simpipe::prepare_DEGs_list(col_meta, row_meta)
# Group2vsGroup1
str(DEGs_list)
# List of 1
#  $ Group2vsGroup1: chr [1:354] "Gene3" "Gene5" "Gene11" "Gene14" ...

Then, we can chose the method for differential expression analysis and the machine learning method for modeling:

DEA_method <- "edgeRQLFDetRate"
model_method <- "SVM"

Assess the validity of the simulated DEGs

assess_DEGs <- simpipe::data_functionality_summary(
  data = simulate_result$simulate_result$count_data,
  group = col_meta$group,
  DEGs = DEGs_list,
  DEA_method = DEA_method,
  model_method = model_method
)
# The DEGs information is input...
# The group information is input...
# -------------------------------------------------
#                   Evaluating DEGs
# -------------------------------------------------
# 1--Distribution of null data...
# 0 genes are removed when filtering
# --------------------------------------------------
# Performing DEA with the 1/1 paired of groups.
# Performing DEA using edgeR QLF model including the cellular detection rate
# 2--True proportions of DEGs...
# 0 genes are removed when filtering
# --------------------------------------------------
# Performing DEA with the 1/1 paired of groups.
# Performing DEA using edgeR QLF model including the cellular detection rate
# 3--Modeling using DEGs...
# Preprocessing data...
# Modeling by SVM...
# Predicting...
# Setting levels: control = Group1, case = Group2
# Setting direction: controls < cases
# -------------------------------------------------
#              Evaluating cell groups
# -------------------------------------------------
# 1-Calculating CDI...
# 2-Calculating ROUGE...
# Warning in simutils::calculate_cluster_properties(data = group_data, dist = NULL, : The ROUGE calculation failed
# 3-Calculating silhouette...
# 4-Calculating dunn...
# 5-Calculating connectivity...
# 6-Calculating DB index...

Show the structure of the result:

str(assess_DEGs$DEGs_evaluation)
# List of 1
#  $ data:List of 3
#   ..$ distribution_score:List of 2
#   .. ..$ DEA_result_combinations:List of 1
#   .. .. ..$ Group2vsGroup1:'data.frame':	646 obs. of  5 variables:
#   .. .. .. ..$ logFC : num [1:646] -0.223 -0.199 0.21 -0.138 0.134 ...
#   .. .. .. ..$ logCPM: num [1:646] 8.17 8.31 7.88 9.6 9.58 ...
#   .. .. .. ..$ F     : num [1:646] 10.9 9.08 8.66 7.35 6.1 ...
#   .. .. .. ..$ PValue: num [1:646] 0.000995 0.002641 0.003324 0.006807 0.013677 ...
#   .. .. .. ..$ FDR   : num [1:646] 0.643 0.716 0.716 0.998 0.998 ...
#   .. ..$ distribution_score     : num 1
#   ..$ true_proportions  :List of 2
#   .. ..$ DEA_result_combinations:List of 5
#   .. .. ..$ true_prop         :List of 1
#   .. .. .. ..$ Group2vsGroup1: num 0.853
#   .. .. ..$ DEA               :List of 1
#   .. .. .. ..$ Group2vsGroup1:List of 1
#   .. .. .. .. ..$ edgeRQLFDetRate:List of 2
#   .. .. .. .. .. ..$ de_genes  : chr [1:330] "Gene529" "Gene552" "Gene758" "Gene569" ...
#   .. .. .. .. .. ..$ DEA_result:'data.frame':	1000 obs. of  5 variables:
#   .. .. .. .. .. .. ..$ logFC : num [1:1000] 1.32 -1.91 1.67 0.97 1.48 ...
#   .. .. .. .. .. .. ..$ logCPM: num [1:1000] 13.79 10.93 9.92 13.93 10.26 ...
#   .. .. .. .. .. .. ..$ F     : num [1:1000] 2451 2095 1061 1037 1007 ...
#   .. .. .. .. .. .. ..$ PValue: num [1:1000] 7.38e-282 1.11e-256 2.44e-163 1.03e-160 2.46e-157 ...
#   .. .. .. .. .. .. ..$ FDR   : num [1:1000] 7.38e-279 5.53e-254 8.14e-161 2.57e-158 4.92e-155 ...
#   .. .. ..$ DEGs_num          : int 354
#   .. .. ..$ DEGs_total        : int 354
#   .. .. ..$ weighted_true_prop: num 0.853
#   .. ..$ true_proportion        : num 0.853
#   ..$ SVM_result        :List of 6
#   .. ..$ SVM_result:List of 2
#   .. .. ..$ conf_matrix:List of 6
#   .. .. .. ..$ positive: chr "Group1"
#   .. .. .. ..$ table   : 'table' int [1:2, 1:2] 78 1 1 120
#   .. .. .. .. ..- attr(*, "dimnames")=List of 2
#   .. .. .. .. .. ..$ Prediction: chr [1:2] "Group1" "Group2"
#   .. .. .. .. .. ..$ Reference : chr [1:2] "Group1" "Group2"
#   .. .. .. ..$ overall : Named num [1:7] 0.99 0.979 0.964 0.999 0.605 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:7] "Accuracy" "Kappa" "AccuracyLower" "AccuracyUpper" ...
#   .. .. .. ..$ byClass : Named num [1:11] 0.987 0.992 0.987 0.992 0.987 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:11] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...
#   .. .. .. ..$ mode    : chr "everything"
#   .. .. .. ..$ dots    : list()
#   .. .. .. ..- attr(*, "class")= chr "confusionMatrix"
#   .. .. ..$ roc        :List of 15
#   .. .. .. ..$ percent           : logi FALSE
#   .. .. .. ..$ sensitivities     : num [1:201] 1 1 1 1 1 1 1 1 1 1 ...
#   .. .. .. ..$ specificities     : num [1:201] 0 0.0127 0.0253 0.038 0.0506 ...
#   .. .. .. ..$ thresholds        : num [1:201] -Inf 5.54e-05 8.97e-05 1.07e-04 1.18e-04 ...
#   .. .. .. ..$ direction         : chr "<"
#   .. .. .. ..$ cases             : Named num [1:121] 0.999 0.994 0.999 1 0.999 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:121] "Cell7" "Cell8" "Cell26" "Cell31" ...
#   .. .. .. ..$ controls          : Named num [1:79] 0.011249 0.000637 0.002514 0.003981 0.000594 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:79] "Cell29" "Cell36" "Cell43" "Cell49" ...
#   .. .. .. ..$ fun.sesp          :function (thresholds, controls, cases, direction)  
#   .. .. .. ..$ auc               : 'auc' num 0.999
#   .. .. .. .. ..- attr(*, "partial.auc")= logi FALSE
#   .. .. .. .. ..- attr(*, "percent")= logi FALSE
#   .. .. .. .. ..- attr(*, "roc")=List of 15
#   .. .. .. .. .. ..$ percent           : logi FALSE
#   .. .. .. .. .. ..$ sensitivities     : num [1:201] 1 1 1 1 1 1 1 1 1 1 ...
#   .. .. .. .. .. ..$ specificities     : num [1:201] 0 0.0127 0.0253 0.038 0.0506 ...
#   .. .. .. .. .. ..$ thresholds        : num [1:201] -Inf 5.54e-05 8.97e-05 1.07e-04 1.18e-04 ...
#   .. .. .. .. .. ..$ direction         : chr "<"
#   .. .. .. .. .. ..$ cases             : Named num [1:121] 0.999 0.994 0.999 1 0.999 ...
#   .. .. .. .. .. .. ..- attr(*, "names")= chr [1:121] "Cell7" "Cell8" "Cell26" "Cell31" ...
#   .. .. .. .. .. ..$ controls          : Named num [1:79] 0.011249 0.000637 0.002514 0.003981 0.000594 ...
#   .. .. .. .. .. .. ..- attr(*, "names")= chr [1:79] "Cell29" "Cell36" "Cell43" "Cell49" ...
#   .. .. .. .. .. ..$ fun.sesp          :function (thresholds, controls, cases, direction)  
#   .. .. .. .. .. ..$ auc               : 'auc' num 0.999
#   .. .. .. .. .. .. ..- attr(*, "partial.auc")= logi FALSE
#   .. .. .. .. .. .. ..- attr(*, "percent")= logi FALSE
#   .. .. .. .. .. .. ..- attr(*, "roc")=List of 8
#   .. .. .. .. .. .. .. ..$ percent      : logi FALSE
#   .. .. .. .. .. .. .. ..$ sensitivities: num [1:201] 1 1 1 1 1 1 1 1 1 1 ...
#   .. .. .. .. .. .. .. ..$ specificities: num [1:201] 0 0.0127 0.0253 0.038 0.0506 ...
#   .. .. .. .. .. .. .. ..$ thresholds   : num [1:201] -Inf 5.54e-05 8.97e-05 1.07e-04 1.18e-04 ...
#   .. .. .. .. .. .. .. ..$ direction    : chr "<"
#   .. .. .. .. .. .. .. ..$ cases        : Named num [1:121] 0.999 0.994 0.999 1 0.999 ...
#   .. .. .. .. .. .. .. .. ..- attr(*, "names")= chr [1:121] "Cell7" "Cell8" "Cell26" "Cell31" ...
#   .. .. .. .. .. .. .. ..$ controls     : Named num [1:79] 0.011249 0.000637 0.002514 0.003981 0.000594 ...
#   .. .. .. .. .. .. .. .. ..- attr(*, "names")= chr [1:79] "Cell29" "Cell36" "Cell43" "Cell49" ...
#   .. .. .. .. .. .. .. ..$ fun.sesp     :function (thresholds, controls, cases, direction)  
#   .. .. .. .. .. .. .. ..- attr(*, "class")= chr "roc"
#   .. .. .. .. .. ..$ call              : language roc.default(response = test_group, predictor = attr(predict_prob, "probabilities")[,      1])
#   .. .. .. .. .. ..$ original.predictor: Named num [1:200] 0.9991 0.9937 0.999 0.0112 0.9998 ...
#   .. .. .. .. .. .. ..- attr(*, "names")= chr [1:200] "Cell7" "Cell8" "Cell26" "Cell29" ...
#   .. .. .. .. .. ..$ original.response : Factor w/ 2 levels "Group1","Group2": 2 2 2 1 2 2 1 1 1 2 ...
#   .. .. .. .. .. ..$ predictor         : Named num [1:200] 0.9991 0.9937 0.999 0.0112 0.9998 ...
#   .. .. .. .. .. .. ..- attr(*, "names")= chr [1:200] "Cell7" "Cell8" "Cell26" "Cell29" ...
#   .. .. .. .. .. ..$ response          : Factor w/ 2 levels "Group1","Group2": 2 2 2 1 2 2 1 1 1 2 ...
#   .. .. .. .. .. ..$ levels            : chr [1:2] "Group1" "Group2"
#   .. .. .. .. .. ..- attr(*, "class")= chr "roc"
#   .. .. .. ..$ call              : language roc.default(response = test_group, predictor = attr(predict_prob, "probabilities")[,      1])
#   .. .. .. ..$ original.predictor: Named num [1:200] 0.9991 0.9937 0.999 0.0112 0.9998 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:200] "Cell7" "Cell8" "Cell26" "Cell29" ...
#   .. .. .. ..$ original.response : Factor w/ 2 levels "Group1","Group2": 2 2 2 1 2 2 1 1 1 2 ...
#   .. .. .. ..$ predictor         : Named num [1:200] 0.9991 0.9937 0.999 0.0112 0.9998 ...
#   .. .. .. .. ..- attr(*, "names")= chr [1:200] "Cell7" "Cell8" "Cell26" "Cell29" ...
#   .. .. .. ..$ response          : Factor w/ 2 levels "Group1","Group2": 2 2 2 1 2 2 1 1 1 2 ...
#   .. .. .. ..$ levels            : chr [1:2] "Group1" "Group2"
#   .. .. .. ..- attr(*, "class")= chr "roc"
#   .. ..$ AUC       : num 0.999
#   .. ..$ Accuracy  : num 0.99
#   .. ..$ Precision : num 0.987
#   .. ..$ Recall    : num 0.987
#   .. ..$ F1        : num 0.987

We determine three approaches to evaluate the validity of the simulated DEGs:

names(assess_DEGs$DEGs_evaluation$data)
# [1] "distribution_score" "true_proportions"   "SVM_result"

distribution_score: We hypothesized that if there are no DEGs between two samples, the false positive rate of detected DEGs should be close to 0.05 and P values should follow a uniform distribution. We directly removed them from the gene expression matrix and then DEA was performed using edgeRQLFDetRate to obtain the P values. Next, we defined the null hypothesis that the P values follows a uniform distribution. The Pearson Chi-Squared test was then used to determine whether to reject the null hypothesis under the threshold (P≤0.05). If the null hypothesis is accepted for the dataset with only two cell groups, the distribution score is assigned to 1.

true_proportions: We chosed the most optimal algorithm (edgeRQLFDetRate, edgeR QLF model including the cellular detection rate) that performed best in a previous benchmarking study to identify DEGs on the simulated data. All identified DEGs were considered as the ground truth. Then, the ratio of candidate DEGs (returned by the given simulation method) to the true DEGs was calculated. For three or more groups, it was weighted by the proportion of DEGs from pairwise groups to all of them.

SVM_result: The third approach is to construct models using the simulated gene expression matrices with only DEGs. Accuracy, precision, recall, F1 score and area under ROC curve (AUC) were applied to assess the model performance. For multi-class classification, macro-averaged precision, recall and F1 score were computed.

The situation where more than two groups are simulated

Splat, SCRIP, Lun and ESCO can return the DEGs from each pair of cell groups, but other methods (powsimR, muscat, scDesign, SPARSim, SPsimSeq, Lun2) only return the total DEGs of all pairs of groups when the number of simulated cell groups is higher than 2. In this special case, we can only assess their validity through the performance of the prediction models using the simulated DEGs.

Evaluate the simulated cell batches

Simulate the dataset with cell batches

Firstly, we will simulate three batches:

simulate_result <- simmethods::Splat_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(batchCells = c(200, 300, 500),
                     nGenes = 2000),
  seed = 111
)
# nCells: 1000
# nGenes: 2000
# nGroups: 1
# de.prob: 0.1
# nBatches: 3

Then, check the batch information of cells:

cell_metadata <- simulate_result$simulate_result$col_meta
table(cell_metadata$batch)
# 
# Batch1 Batch2 Batch3 
#    200    300    500

Prepare the information of cell batches

Some necessary parameters and information related to the cell batches are needed:

  1. batch: The characters of batch names which cells belong to.
  2. k: k-nearest neighborhoods of cells.
batch_labels <- cell_metadata$batch

We can set k value as the square root of the cell number:

k <- round(sqrt(length(batch_labels)))

Assess the simulated cell batches

assess_batches <- simpipe::data_functionality_summary(
  data = simulate_result$simulate_result$count_data,
  batch = batch_labels,
  k = k
)
# The batch information is input...
# -------------------------------------------------
#              Evaluating cell batches
# -------------------------------------------------
# Calculate cms, ILSI, Mixing Metric and Shannon entropy...
# Calculate kBET...
# Calculate Average Silouette Width for batch..
# Calculate principal component regression...

List the assessment results of cell batches:

str(assess_batches$batch_evaluation)
# List of 1
#  $ data:List of 8
#   ..$ cms            : num [1:1000] 0 0.00321 0 0 0 ...
#   ..$ ISI            : num [1:1000] 0.441 2.01 1.719 2.582 1.36 ...
#   ..$ LISI           : num 1.59
#   ..$ mm             : num [1:1000] 32 16 23 15 28 27 15 16 12 12 ...
#   ..$ shannon_entropy: num [1:1000] 0.213 0.734 0.756 0.932 0.734 ...
#   ..$ kBET           : num [1:1000] 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ AWS_batch      : num 0.0199
#   ..$ pcr            : num 0.0107

Eight metrics are used for assessing the quality of cell batches:

cms: Cell-specific mixing score. CMS is an innovative measurement to determine whether the batch-wise distances in a KNN graph follow the same distribution using the Andersion-Darling test. Ideally, the method yields a lower CMS, indicating it can simulate more reliable data with different batches.

ISI: Inverse Simpson’s Index. ISI is an interpretable metric used to measure diversity within each neighborhood. A lower ISI score indicates the stronger batch effects. Otherwise, a higher ISI score suggests cells are well-mixed across different batches.

LISI: Local Inverse Simpson’s Index. To address the problems of the metric sensitivity to the local distances between cells and the interpretation, LISI builds Gaussian kernel-based distributions of neighborhoods and uses the inverse Simpson’s Index. LISI scores represent the expected cell numbers to be sampled before two are drawn from the same batch. A lower LISI score indicates stronger batch effects. Otherwise, a LISI score close to the batch number suggests cells are well-mixed across different batches.

mm: Mixing metric. The mixing metric utilizes the median position of the k-th cell in the KNN graphs from each batch to determine the degree of batch effects. The lower value of this score, the better mixed the cells from different batches are.

shannon_entropy: Shannon entropy quantifies the randomness and complexity of scRNA-seq datasets containing cells from different sources of batches.

kBET: k-nearest neighbor batch effect test. kBET assesses the batch mixing by comparing the batch distribution within KNNs of a cell with the global one.

AWS_batch: Average silhouette width for batch. ASW can also be used for batch evaluation, where the batch effect is more pronounced when the AWS closes to 1 or -1.

pcr: Principal component regression. The PCR score represents the total contribution of the batch variable to the variance of a dataset.

Evaluate the simulated trajectories

Simulate the dataset with cellular trajectory

Firstly, we will simulate a trajectory dataset with five cell groups:

simulate_result <- simmethods::Splat_simulation(
  parameters = estimate_result[["estimate_result"]],
  return_format = "list",
  other_prior = list(nGenes = 1000,
                     prob.group = rep(0.2, 5),
                     paths = TRUE),
  seed = 111
)
# nCells: 160
# nGenes: 1000
# nGroups: 5
# de.prob: 0.1
# nBatches: 1
# Simulating trajectory datasets by Splat

The cell numbers of the real and simulated data should be the same.

Assess the simulated trajectory

Before assessing the similarity of the reference and the simulated trajectory, we should prepare the reference data and simulated data (the information of cell groups in the real or simulated data is optionally input).

ref_data <- ref_data
ref_data_grouping <- simmethods::group_condition
sim_data <- simulate_result$simulate_result$count_data
sim_data_grouping <- simulate_result$simulate_result$col_meta$group

Calculate the metrics for quantifying the similarity of two trajectories (if the information of simulated cell groups is provided, the simulated groups are also evaluated):

assess_traj <- simpipe::data_functionality_summary(
  data = sim_data,
  group = sim_data_grouping,
  ref_data = ref_data,
  ref_data_grouping = ref_data_grouping,
  algorithm = "Hungarian",
  seed = 666
)
# The group information is input...
# -------------------------------------------------
#              Evaluating cell groups
# -------------------------------------------------
# 1-Calculating CDI...
# 2-Calculating ROUGE...
# Warning in simutils::calculate_cluster_properties(data = group_data, dist = NULL, : The ROUGE calculation failed
# 3-Calculating silhouette...
# 4-Calculating dunn...
# 5-Calculating connectivity...
# 6-Calculating DB index...
# -------------------------------------------------
#              Evaluating trajectories
# -------------------------------------------------
# The group information of reference data is input...
# Performing trajectory inference by Slingshot for reference data...
# Executing 'slingshot' on '20231001_211112__data_wrapper__o90IhR6HPl'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix
# Performing trajectory inference by Slingshot for simulated data...
# Executing 'slingshot' on '20231001_211112__data_wrapper__0KawbKxpqj'
# With parameters: list(cluster_method = "pam", ndim = 20L, shrink = 1L, reweight = TRUE,     reassign = TRUE, thresh = 0.001, maxit = 10L, stretch = 2L,     smoother = "smooth.spline", shrink.method = "cosine")
# inputs: expression
# priors :
# Using full covariance matrix
# Match cells in simulated and real data...
# Performing PCA...
# Performing Harmony...
# Calculate correlation matrix...
# 
# Match simulated and real cells using Hungarian...
#     reference simulation match_value
# 1 SC_serum_45      Cell1   0.3906363
# 2  SC_serum_3      Cell2   0.2300600
# 3    SC_2i_12      Cell3   0.3109244
# 4    SC_2i_39      Cell4   0.2567587
# 5 SC_serum_64      Cell5   0.3596158
# 6 SC_serum_16      Cell6   0.3042017
# Calculating correlation of distances...
# Waypoints of 9
# Waypoints of 17
# Waypoints of 25
# Waypoints of 33
# Waypoints of 41
# Waypoints of 49
# Waypoints of 57
# Waypoints of 65
# Waypoints of 73
# Waypoints of 80
# Change information in reference data...
# Calculating HIM...
# Warning: `data_frame()` was deprecated in tibble 1.1.0.
# ℹ Please use `tibble()` instead.
# ℹ The deprecated feature was likely used in the dyneval package.
#   Please report the issue to the authors.
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
# generated.
# Calculating F1 branches...
# Calculating F1 milestones...
# Done...

Display the result:

traj_result <- assess_traj$trajectory_evaluation
names(traj_result$data)
# [1] "HIM"           "F1_branches"   "F1_milestones" "Cor_dist"
str(traj_result$data)
# List of 4
#  $ HIM          : num 0.509
#  $ F1_branches  : num 0.364
#  $ F1_milestones: num 0.269
#  $ Cor_dist     :List of 2
#   ..$ cor_dist    :List of 2
#   .. ..$ cor_dist    : num 0.0167
#   .. ..$ every_result: num [1:10] 0.0394 0.033 0.0339 0.0341 0.0199 ...
#   ..$ match_result:List of 5
#   .. ..$ PCA_raw           : num [1:320, 1:50] -13.74 -8.76 -46.39 -36.64 9.81 ...
#   .. .. ..- attr(*, "dimnames")=List of 2
#   .. .. .. ..$ : chr [1:320] "SC_2i_1" "SC_2i_2" "SC_2i_3" "SC_2i_4" ...
#   .. .. .. ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
#   .. ..$ cor_result        : num [1:160, 1:160] -0.0854 0.1298 -0.1354 0.0151 0.231 ...
#   .. .. ..- attr(*, "dimnames")=List of 2
#   .. .. .. ..$ : chr [1:160] "Cell1" "Cell2" "Cell3" "Cell4" ...
#   .. .. .. ..$ : chr [1:160] "SC_2i_1" "SC_2i_2" "SC_2i_3" "SC_2i_4" ...
#   .. ..$ harmony_embeddings: num [1:320, 1:50] -18.22 -8.57 -52.81 -40.82 6.8 ...
#   .. .. ..- attr(*, "dimnames")=List of 2
#   .. .. .. ..$ : chr [1:320] "SC_2i_1" "SC_2i_2" "SC_2i_3" "SC_2i_4" ...
#   .. .. .. ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
#   .. ..$ cell_pair         :'data.frame':	160 obs. of  3 variables:
#   .. .. ..$ reference  : chr [1:160] "SC_serum_45" "SC_serum_3" "SC_2i_12" "SC_2i_39" ...
#   .. .. ..$ simulation : chr [1:160] "Cell1" "Cell2" "Cell3" "Cell4" ...
#   .. .. ..$ match_value: num [1:160] 0.391 0.23 0.311 0.257 0.36 ...
#   .. ..$ cost              : num -56.2

Four metrics are used in the assessment of two cellular trajectories:

HIM: Hamming–Ipsen–Mikhailov. The HIM metric is a distance function quantifying the difference between two milestone networks, which linearly combines a local Hamming distance and a global Ipsen-Mikhailov distance.

F1_branches: F1 score for branches. F1branches score is used to measure the accuracy of branch assignment to cells.

F1_milestones: F1 score for milestones. F1milestones is for comparing the arrangements of cells belonging to different milestones.

Cor_dist: Correlation between geodesic distances. A cell located at the same position in the real and simulated trajectories should have equal relative distances to all other cells. Correlation between geodesic distance measures the correlation between the two relative distances of a cell from the real and simulated trajectory. To match the real and simulated cells within two trajectories, the Hungarian algorithm implemented in RcppHungarian R package was adopted.