2  Methods

2.1 Pre-Processing

2.1.1 Case Selection

Prior to pre-processing, samples were split into a training, a confirmation, and a validation set.

  • Training
    • CS1: OOU, OOUE, VOA, MAYO, MTL
    • CS2: OOU, OOUE, VOA, MAYO, OVAR3, OVAR11, JAPAN, MTL, POOL-CTRL
    • CS3: OOU, OOUE, VOA, POOL-1, POOL-2, POOL-3
  • Confirmation:
    • CS3: TNCO
  • Validation:
    • CS3: DOVE4

2.1.2 Quality Control

Before normalization, we calculated several quality control measures and excluded samples that failed to achieve sample quality in one or more of these measures.

  • Linearity of positive control genes: If the R-squared from a linear model of positive controls and their concentrations is less than 0.95 or missing, then the sample is flagged.
  • Imaging quality: The sample is flagged if the field of view percentage is less than 75%.
  • Positive Control flag: We consider the two smallest positive controls at concentrations 0.5 and 1. If these two controls are less than the lower limit of detection (defined as two standard deviations below the mean of the negative control expression), or if the mean negative control expression is 0, the sample is flagged.
  • The signal-to-noise ratio or percent of genes detected: These two measures are defined as the ratio of the average housekeeping gene expression over the upper limit of detection, defined as two standard deviations above the mean of the negative control expression (or 0 if this limit is less than 0.001), and the proportion of endogenous genes with expression greater than the upper limit of detection. These measures are flagged if they are below a pre-specified threshold, which is determined visually by considering their bivariate distribution in a scatterplot. In this case, we used 100 for the SNR threshold and 50% for the threshold for genes detected. Note: these thresholds were determined by examining the relationship in Section 3.3.2.

2.1.3 Housekeeping Genes Normalization

The full training set (n=1257) comprised of data from three CodeSets (CS) 1, 2, and 3. Data normalization removes technical variation from high-throughput platforms to improve the validity of comparative analyses.

Each CodeSet was first normalized to housekeeping genes: ACTB, RPL19, POLR1B, SDHA, and PGK1. Housekeeping genes encode proteins responsible for basic cell function and have consistent expression in all cells. All expression values were log2 transformed. Normalization to housekeeping genes corrects the viable RNA from each sample. This is achieved by subtracting the average log (base 2)-transformed expression of the housekeeping genes from the log (base 2)-transformed expression of each gene:

\[log_2(\text{endogenous gene expression}) - \text{average(}log_2(\text{housekeeping gene expresssion})) = \text{relative expression} \tag{2.1}\]

2.1.4 Between CodeSet and Site Normalization

To normalize between CodeSets, we randomly selected five specimens, one from each histotype, among specimens repeated in all three CodeSets. This formed the reference set (Random 1). We selected only one sample from each histotype to use as few samples as possible for normalization and retain the rest for analysis.

A reference-based approach (Talhouk et al. (2016)) was used to normalize CS1 to CS3 and CS2 to CS3 across their common genes:

\[\begin{aligned} \text{X-Norm}_{\text{CS1}} = X_{\text{CS1}} + {\bar{R}_{\text{CS3}}} - {\bar{R}_{\text{CS1}}} \\ \text{X-Norm}_{\text{CS2}} = X_{\text{CS2}} + {\bar{R}_{\text{CS3}}} - {\bar{R}_{\text{CS2}}} \end{aligned} \tag{2.2}\]

Samples in CS3 were processed at three different locations; we also had to normalize for “site” in this CodeSet. Finally, the CS3 expression samples were included in the training set without further normalization:

\[\begin{aligned} \text{X-Norm}_{\text{CS3-USC}} = X_{\text{CS3-USC}} + {\bar{R}_{\text{CS3-VAN}}} - {\bar{R}_{\text{CS3-USC}}} \\ \text{X-Norm}_{\text{CS3-AOC}} = X_{\text{CS3-AOC}} + {\bar{R}_{\text{CS3-VAN}}} - {\bar{R}_{\text{CS3-AOC}}} \end{aligned} \tag{2.3}\]

The CS3 expression samples were included in the training set without further normalization. CS3 cross-site replicates were all chosen from Vancouver, thus we did not retain any samples from USC or AOC. Finally, the initial training set is assembled by combining the normalized datasets for CS1 and CS2 along with the CS3 expression not used in normalization:

\[\begin{aligned} \text{Training Set} &= \text{X-Norm}_{\text{CS1}} + \text{X-Norm}_{\text{CS2}} + \text{X}_{\text{CS3-VAN}} + \text{X-Norm}_{\text{CS3-USC}} + \text{X-Norm}_{\text{CS3-AOC}} \\ &= \text{X-Norm}_{\text{CS1}} + \text{X-Norm}_{\text{CS2}} + \text{X}_{\text{CS3}} \end{aligned} \tag{2.4}\]

Figure 2.1: Venn diagram of common and unique gene targets covered by each CodeSet

2.1.5 Final Processing

We map ovarian histotypes to all remaining samples and keep the major histotypes for building the predictive model: high-grade serous ovarian carcinoma (HGSOC), clear cell ovarian carcinoma (CCOC), endometrioid ovarian carcinoma (ENOC), mucinous ovarian carcinoma (MUOC), and low-grade serous ovarian carcinoma (LGSOC).

Duplicate cases (two samples with the same ottaID) were removed before generating the final training set to use for fitting the classification models. All CS3 cases were preferred over CS1 and CS2, and CS3-Vancouver cases were preferred over CS3-AOC and CS3-USC when selecting duplicates.

The final training set used only genes that were common across all three CodeSets.

Figure 2.2: Cohorts Selection

2.2 Classifiers

We used a supervised learning framework to train four multinomial classification algorithms in the Training Set. The full prediction model pipeline was run using SLURM batch jobs submitted to a partition on a Rocky 9 server. All resampling techniques, pre-processing, model specification, hyperparameter tuning, and evaluation metrics were implemented using the tidymodels suite of packages in R (v4.5.2). The classifiers we used are:

  • Random Forest (rf)
  • Support Vector Machine (svm)
  • XGBoost (xgb)
  • Regularized Multinomial Regression (mr)

2.2.1 Resampling of Training Set

We used a nested cross-validation design to evaluate each classifier and perform hyperparameter tuning. An outer 5-fold CV stratified by histotype, along with an inner 5-fold CV with two repeats also stratified by histotype, ensured that the test sets of the inner dataset contained a reasonable number of cases from the smallest minority class.

The hyperparameter combination that yielded the highest average F1 score across the inner training sets was selected for assessing prediction performance in the outer training loop. We avoided using the bootstrap method for outer resampling to prevent overlap between the training and test sets caused by sampling with replacement. This overlap could lead to inflated performance estimates, as some observations would be used for both training and evaluation in the inner loop.

2.2.2 Hyperparameter Tuning

The following specifications for each classifier were used for tuning hyperparameters:

  • rf: The number of trees was fixed at 500. Two other hyperparameters were tuned across 10 randomly selected points in a space-filling parameter grid design: 

    • mtry: number of randomly selected predictors 

    • min_n: minimal node size

  • xgb: The number of trees was fixed at 500. Seven other hyperparameters were tuned across 10 randomly selected points in a space-filling parameter grid design:

    • mtry: number of randomly selected predictors 

    • min_n: minimal node size 

    • tree_depth: maximum depth of the tree (splits) 

    • learn_rate: rate at which the boosting algorithm adapts from iteration-to-iteration (shrinkage parameter) 

    • loss_reduction: reduction in the loss function required to split further 

    • sample_size: proportion of observations exposed to fitting routine 

    • stop_iter: number of iterations without improvement before stopping

  • svm: The cost and sigma hyperparameters were tuned across 10 randomly selected points in a space-filling parameter grid design. We tuned the cost parameter in the range [1, 8]. The range for tuning the sigma parameter was obtained from the 10% and 90% quantiles of the estimates from the kernlab::sigest() function.

  • mr: The penalty (lambda) and mixture (alpha) hyperparameters were tuned across 100 randomly selected points in a space-filling parameter grid design.

The hyperparameter combination that resulted in the highest average F1-score across the inner training sets was selected for each classifier to use as the model for assessing prediction performance in the outer training loop.

2.2.3 Subsampling

One of the main concerns in the analysis is that the distribution of the five main histotypes is highly imbalanced, which has implications for the accuracy of the classifier. To mitigate this, we investigated several strategies based on subsampling approaches:

  • None: No subsampling is performed
  • Down-sampling: All levels except the minority class are sampled down to the same frequency as the minority class
  • Up-sampling: All levels except the majority class are sampled up to the same frequency as the majority class
  • SMOTE: All levels except the majority class have synthetic data generated until they have the same frequency as the majority class
  • Hybrid: All levels except the majority class have synthetic data generated up to 50% of the frequency of the majority class, then the majority class is sampled down to the same frequency as the rest.

Figure 2.3 helps visualize how the distribution of classes changes when we apply subsampling techniques to handle class imbalance:

Figure 2.3: Visualization of Subsampling Techniques

All subsampling methods to mitigate class imbalance were applied within the inner CV loop.

2.2.4 Workflows

The four algorithms and five subsampling methods created 20 unique classification workflows. For example, the hybrid_xgb workflow pre-processes the training set by applying a hybrid subsampling method and subsequently uses the XGBoost algorithm to classify ovarian histotypes. We additionally tested two ensemble algorithms, the two-step, and the sequential algorithms described in the following sections.

2.3 Two-Step Algorithm

The HGSOC histotype comprises most cases among ovarian carcinoma patients, while the remaining patients are uniformly split across ENOC, CCOC, MUOC, and LGSOC histotypes. We can implement a two-step algorithm as such:

  • Step 1: Select the best classifier to predict HGSOC vs. non-HGSOC, using the workflow with the best per-class F1-score for HGSOC when fit to a 5-class prediction model

  • Step 2: remove HGSOCs from the dataset and re-train a 4-class multinomial classifier to predict the four remaining non-HGSOC histotypes. Select the workflow with the best macro-averaged F1-score when fit to a 4-class prediction model. 

Let

\[\begin{aligned} & X_k = \text{Training data with k classes} \\ & C_k = \text{Class with highest}\;F_1\;\text{score from training}\;X_k \\ & W_k = \text{Workflow associated with}\;C_k \end{aligned} \tag{2.5}\]

Figure 2.4 shows how the two-step algorithm works:

two_step_algorithm a X 5 b1 Store HGSOC predictions a->b1 Train HGSOC vs. all using W 5   b2 Select W 5  for HGSOC a->b2 Train multinomial classes c X 4 b1->c Remove HGSOC cases from X 5   j Aggregate all predictions b1->j d1 Store remaining predictions c->d1 Train multinomial classes with best workflow d2 Select best workflow with highest overall F1-score c->d2 Retrain multinomial classes d1->j
Figure 2.4: Two-Step Algorithm

2.3.1 Aggregating Predictions

The aggregation for two-step predictions is quite straightforward:

  1. Predict HGSOC vs. non-HGSOC
  2. Among all non-HGSOC cases, predict CCOC vs. LGSOC vs. MUOC vs. ENOC
two_step_predictions a Predict HGSOC vs. non-HGSOC b1 HGSOC a->b1 b2 non-HGSOC a->b2 c Predict CCOC vs. LGSOC vs. MUOC vs. ENOC b2->c d1 CCOC c->d1 f1 LGSOC c->f1 h1 MUOC c->h1 h2 ENOC c->h2
Figure 2.5: Aggregating Predictions for Two-Step Algorithm

2.4 Sequential Algorithm

Instead of training on k classes simultaneously using multinomial classifiers, we can use a sequential algorithm that iteratively performs one-vs-all binary classifications k-1 times to obtain a final prediction of all cases. At each step in the sequence, we classify one class vs. all other classes, where the classes that make up the “other” class are those not equal to the current “one” class and exclude all “one” classes from previous steps. For example, if the “one” class in step 1 was HGSOC, the “other” classes would include CCOC, ENOC, LGSOC, and MUOC. If the “one” class in step 2 was CCOC, the “other” classes would include ENOC, LGSOC, and MUOC. 

The order of classes and workflows to use at each step in the sequential algorithm is determined after removing the data associated with an already classified class and retraining using the remaining data multinomial classifiers, as described before. The class and workflow to use for the next step in the sequence are selected based on the best per-class evaluation metric value (e.g., F1-score). 

Figure 2.6 illustrates how the sequential algorithm works for K=5, using ovarian histotypes as an example for the classes.

sequential_algorithm a X 5 b1 Store X 5  predictions a->b1 Train C 5  vs. all using W 5   b2 Select W 5  and C 5 a->b2 Train multinomial classes c X 4 b1->c Remove C 5  cases from X 5 j Aggregate all predictions b1->j d1 Store X 4  predictions c->d1 Train C 4  vs. all using W 4   d2 Select W 4  and C 4 c->d2 Retrain multinomial classes f X 3 d1->f Remove C 4  cases from X 4 d1->j g1 Store X 3  predictions f->g1 Train C 3  vs. all using W 3   g2 Select W 3  and C 3 f->g2 Retrain multinomial classes h X 2 g1->h Remove C 3  cases from X 3 g1->j i1 Store X 2  predictions h->i1 Train C 2  vs. all using W 2   i2 Select W 2  and C 2 h->i2 Retrain multinomial classes i1->j
Figure 2.6: Sequential Algorithm

2.4.1 Aggregating Predictions

We have to aggregate the one-vs-all predictions from each of the sequential algorithm workflows in order to obtain a final class prediction on a holdout test set. Each sequential workflow has to be assessed on every sample to ensure that cases classified into the “other” class from a previous step of the sequence are eventually assigned to a predicted class. For example, if based on certain class-specific metrics, we determined that the order of classes in the sequential algorithm was to predict HGSC vs. non-HGSC, CCOC vs. non-CCOC, LGSOC vs. non-LGSOC, and then MUOC vs. ENOC. Figure 2.7 illustrates how the final predictions are assigned:

sequential_predictions a Predicted HGSOC vs. non-HGSOC b1 HGSOC a->b1 b2 non-HGSOC a->b2 c Predicted CCOC vs. non-CCOC b2->c d1 CCOC c->d1 d2 non-CCOC c->d2 e Predicted LGSOC vs. non-LGSOC d2->e f1 LGSOC e->f1 f2 non-LGSOC e->f2 g Predicted MUOC vs. ENOC f2->g h1 MUOC g->h1 h2 ENOC g->h2
Figure 2.7: Aggregating Predictions for Sequential Algorithm

2.5 Performance Evaluation

2.5.1 Class Metrics

We use accuracy, sensitivity, specificity, F1-score, kappa, and balanced accuracy, as class metrics to measure both training and test performance between different workflows. Multiclass extensions of these metrics can be calculated except for F1-score, where we use macro-averaging to obtain an overall metric. Class-specific metrics are calculated by recoding classes into one-vs-all categories for each class. 

2.5.1.1 Accuracy

The accuracy is defined as the proportion of correct predictions out of all cases:

\[\text{accuracy} = \frac{TP}{TP + FP + FN + TN} \tag{2.6}\]

2.5.1.2 Sensitivity

Sensitivity is the proportional of correctly predicted positive cases, out of all cases that were truly positive

\[\text{sensitivity} = \frac{TP}{TP + FN} \tag{2.7}\]

2.5.1.3 Specificity

Specificity is the proportional of correctly predicted negative cases, out of all cases that were truly negative.

\[\text{specificity} = \frac{TN}{TN + FP} \tag{2.8}\]

2.5.1.4 F1-Score

The F-measure can be thought of as a harmonic mean between precision and recall:

\[F_{meas} = \frac{(1 + \beta^2) \times precision \times recall}{(\beta^2 \times precision) + recall} \tag{2.9}\]

The \(\beta\) value can be adjusted to place more weight upon precision or recall. The most common value is \(\beta\) is 1, which is also commonly known as the F1-score. A multiclass extension doesn’t exist for the F1-score, so we use macro-averaging to calculate this metric when there are more than two classes. For example, with \(k\) classes, the macro-averaged F1-score is equal to:

\[{F_1}_{macro} = \frac{1}{k} \sum_{i=1}^{k}{F_1}_{i} \tag{2.10}\]

where each \({F_1}_{i}\) is the F1-score computed frrom recoding classes into \(k=i\) vs. \(k \neq i\).

In situations where there is not at least one predicted case for each of the classes (e.g. for a poor classifier), \({F_1}_{i}\) is undefined because the per-class precision of class \(i\) is undefined. Those \({F_1}_{i}\) terms are removed from the \({F_1}_{macro}\) equation and the resulting value may be inflated. Interpreting the F1-score in such a case would be misleading.

2.5.1.5 Balanced Accuracy

Balanced accuracy is the arithmetic mean of sensitivity and specificity.

\[\text{Balanced Accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2} \tag{2.11}\]

2.5.1.6 Kappa

Kappa is the defined as:

\[\text{kappa} = \frac{p_0 - p_e}{1 - p_e} \tag{2.12}\]

where \(p_0\) is the observed agreement among raters and \(p_e\) is the hypothetical probability of agreement due to random chance.

2.5.2 AUC

The area under the receiver operating curve (AUC) is calculated by adding up the area under the curve formed by plotting sensitivity vs. 1 - specificity. The Hand-till method is used as a multiclass extension for the AUC.

We did not use AUC to measure class-specific training set performance because combining predicted probabilities in a one-vs-all fashion might be potentially misleading. The sum of probabilities that add up to the “other” class is not equivalent to the predicted probability of the “other” class when using a multiclass classifier.

Instead, we only reported ROC curves and their associated AUCs for test set performance among the highest ranked algorithms.

2.6 Rank Aggregation

To select the best algorithm, we implemented a two-stage rank aggregation procedure using the Genetic Algorithm (Pihur et al. 2009). First, we ranked all workflows based on per-class F1-scores, balanced accuracy, and kappa to see which workflows performed well in predicting all five histotypes. Then, we took the ranks from these three performance metrics and performed a second run of rank aggregation. The top 5 workflows were determined from the final rank aggregation result.

2.7 Gene Optimization

We want to discover an optimal set of genes for the classifiers while including specific genes from other studies such as PrOTYPE and SPOT. A total of 72 genes are used in the classifier training set.

The classifier set includes 16 genes that overlap with the PrOTYPE classifier: COL11A1, CD74, CD2, TIMP3, LUM, CYTIP, COL3A1, THBS2, TCF7L1, HMGA2, FN1, POSTN, COL1A2, COL5A2, PDZK1IP1, FBN1.

The classifier set also includes 13 genes that overlap with the SPOT signature: HIF1A, CXCL10, DUSP4, SOX17, MITF, CDKN3, BRCA2, CEACAM5, ANXA4, SERPINE1, TCF7L1, CRABP2, DNAJC9.

We obtain 28 genes from the union of PrOTYPE and SPOT genes that we want to include in the final classifier, regardless of model performance. We then incrementally add genes one at a time from the remaining 44 candidate genes based on a variable importance rank to the set of 28 base genes and recalculate performance metrics. The optimal number of genes is determined by the number of genes needed to achieve the highest average F1-Score across classes, including the overall metric in the average. We use an F1-score that is averaged across cross-validation folds (5) and class groups (6: Overall, HGSOC, CCOC, ENOC, MUOC, LGSOC) to compare performance between different number of genes selected. 

Here is the breakdown of genes used and whether they belong to the PrOTYPE and/or SPOT sets:

Table 2.1: Gene Distribution
Genes PrOTYPE SPOT
TCF7L1
COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
C10orf116
GAD1
TPX2
KGFLP2
EGFL6
KLK7
PBX1
LIN28B
TFF3
MUC5B
FUT3
STC1
BCL2
PAX8
GCNT3
GPR64
ADCYAP1R1
IGKC
BRCA1
IGJ
TFF1
MET
CYP2C18
CYP4B1
SLC3A1
EPAS1
HNF1B
IL6
ATP5G3
DKK4
SENP8
CAPN2
C1orf173
CPNE8
IGFBP1
WT1
TP53
SEMA6A
SERPINA5
ZBED1
TSPAN8
SCGB1D2
LGALS4
MAP1LC3A

2.7.1 Variable Importance

Variable importance is calculated using either a model-based approach if it is available, or a permutation-based VI score otherwise (e.g., for SVM). The variable importance scores are averaged across the outer training folds, and then ranked from highest to lowest.

For the sequential and two-step classifiers, we calculate an overall VI rank by taking the cumulative union of genes at each variable importance rank across all sequences, until all genes have been included.

The variable importance measures are:

  • Random Forest: impurity measure (Gini index)

  • XGBoost: gain (fractional contribution of each feature to the model based on the total gain of the corresponding features’s splits)

  • SVM: permutation based p-values

  • Multinomial regression: absolute value of estimated coefficients at cross-validated lambda value