4  Results

We summarize cross-validated training performance of class metrics in the training set. The accuracy, F1-score, and kappa, are the metrics of interest. Workflows are ordered by their mean estimates across the outer folds of the nested CV for each metric.

4.1 Training Set

4.2 Rank Aggregation

Multi-step methods:

  • sequential: sequential algorithm sequence of subsampling methods and algorithms used are:
    • HGSOC vs. non-HGSOC using upsubsampling and XGBoost
    • CCOC vs. non-CCOC using no subsampling and random forest
    • ENOC vs. non-ENOC using no subsampling and support vector machine
    • MUOC vs. LGSOC using SMOTE subsampling and random forest
  • two_step: two-step algorithm sequence of subsampling methods and algorithms used are:
    • HGSOC vs. non-HGSOC using upsampling and XGBoost
    • CCOC vs. ENOC vs. MUOC vs. LGSOC using SMOTE subsampling and support vector machine

We conduct rank aggregation using a two-stage nested approach:

  1. First we rank aggregate the per-class metrics for F1-score, balanced accuracy and kappa.
  2. Then we take the aggregated lists from the three metrics and perform a final rank aggregation.
  3. The top workflows from the final rank aggregation are used for gene optimization in the confirmation set

4.2.1 Across Classes

4.2.2 Across Metrics

Table 4.10: Rank Aggregation Comparison of Metrics Used in Training Set
Rank F1 Balanced Accuracy Kappa
1 sequential sequential sequential
2 two_step two_step two_step
3 up_xgb up_mr up_xgb
4 up_rf smote_mr up_rf
5 smote_svm up_xgb smote_svm
6 hybrid_svm smote_xgb hybrid_rf
7 smote_xgb hybrid_mr smote_xgb
8 hybrid_rf hybrid_xgb hybrid_svm
9 up_svm down_xgb up_svm
10 hybrid_xgb smote_svm hybrid_xgb
11 none_rf hybrid_rf smote_rf
12 smote_mr down_mr smote_mr
13 smote_rf hybrid_svm none_svm
14 hybrid_mr up_rf none_rf
15 up_mr smote_rf hybrid_mr
16 down_mr down_rf up_mr
17 down_rf none_svm down_mr
18 down_svm down_svm down_rf
19 down_xgb none_rf down_svm
20 NA up_svm down_xgb
21 NA none_mr none_mr
22 NA none_xgb none_xgb
Table 4.11: Top 5 Workflows from Final Rank Aggregation
Rank Workflow
1 sequential
2 two_step
3 up_xgb
4 up_rf
5 smote_svm

4.2.3 Top Workflows

We look at the per-class evaluation metrics of the top 5 workflows.

Table 4.12: Top Workflow Per-Class Evaluation Metrics
Histotypes
Metric Workflow HGSOC CCOC ENOC MUOC LGSOC
Accuracy sequential 0.963 (0.956, 0.968) 0.917 (0.896, 0.958) 0.856 (0.774, 0.909) 0.951 (0.882, 1) 0.951 (0.882, 1)
2-STEP 0.963 (0.956, 0.968) 0.934 (0.896, 0.958) 0.839 (0.729, 0.896) 0.892 (0.812, 0.96) 0.971 (0.938, 1)
Up-XGB 0.96 (0.944, 0.976) 0.985 (0.968, 0.996) 0.963 (0.956, 0.976) 0.977 (0.968, 0.988) 0.981 (0.976, 0.988)
Up-RF 0.955 (0.92, 0.976) 0.983 (0.972, 0.992) 0.959 (0.948, 0.972) 0.979 (0.964, 0.988) 0.982 (0.968, 0.988)
SMOTE-SVM 0.95 (0.936, 0.968) 0.979 (0.968, 0.988) 0.954 (0.94, 0.964) 0.978 (0.96, 0.984) 0.981 (0.972, 0.984)
Sensitivity sequential 0.978 (0.966, 0.99) 0.814 (0.75, 0.882) 0.865 (0.812, 0.941) 0.965 (0.909, 1) 0.92 (0.6, 1)
2-STEP 0.978 (0.966, 0.99) 0.866 (0.824, 0.933) 0.735 (0.556, 0.824) 0.842 (0.75, 0.923) 0.767 (0, 1)
Up-XGB 0.981 (0.976, 0.986) 0.817 (0.571, 0.941) 0.679 (0.444, 0.818) 0.8 (0.538, 1) 0.35 (0.25, 0.667)
Up-RF 0.988 (0.979, 0.995) 0.793 (0.571, 0.882) 0.635 (0.444, 0.812) 0.766 (0.462, 0.909) 0.183 (0, 0.333)
SMOTE-SVM 0.966 (0.955, 0.977) 0.757 (0.571, 0.85) 0.681 (0.5, 0.864) 0.748 (0.538, 0.909) 0.642 (0.5, 0.75)
Specificity sequential 0.897 (0.875, 0.918) 0.969 (0.938, 1) 0.847 (0.733, 0.875) 0.92 (0.6, 1) 0.965 (0.909, 1)
2-STEP 0.897 (0.875, 0.918) 0.969 (0.935, 1) 0.893 (0.833, 0.935) 0.908 (0.833, 0.973) 0.977 (0.953, 1)
Up-XGB 0.875 (0.826, 0.919) 0.996 (0.992, 1) 0.981 (0.97, 0.991) 0.986 (0.979, 0.992) 0.993 (0.984, 1)
Up-RF 0.822 (0.738, 0.892) 0.996 (0.992, 1) 0.98 (0.966, 0.987) 0.989 (0.983, 0.996) 0.997 (0.988, 1)
SMOTE-SVM 0.88 (0.804, 0.919) 0.993 (0.987, 1) 0.971 (0.962, 0.975) 0.989 (0.983, 0.996) 0.987 (0.976, 0.996)
F1-Score sequential 0.977 (0.973, 0.98) 0.868 (0.828, 0.933) 0.86 (0.788, 0.914) 0.966 (0.923, 1) 0.91 (0.75, 1)
2-STEP 0.977 (0.973, 0.98) 0.899 (0.848, 0.933) 0.755 (0.606, 0.848) 0.788 (0.667, 0.923) 0.733 (0, 1)
Up-XGB 0.975 (0.964, 0.986) 0.862 (0.667, 0.97) 0.685 (0.471, 0.857) 0.753 (0.636, 0.87) 0.368 (0.25, 0.571)
Up-RF 0.972 (0.949, 0.986) 0.849 (0.696, 0.938) 0.652 (0.471, 0.788) 0.757 (0.571, 0.87) 0.362 (0.286, 0.4)
SMOTE-SVM 0.969 (0.96, 0.981) 0.811 (0.667, 0.919) 0.638 (0.5, 0.809) 0.752 (0.583, 0.833) 0.524 (0.364, 0.714)
Balanced Accuracy sequential 0.938 (0.928, 0.947) 0.892 (0.859, 0.938) 0.856 (0.773, 0.908) 0.943 (0.8, 1) 0.943 (0.8, 1)
2-STEP 0.938 (0.928, 0.947) 0.917 (0.88, 0.952) 0.814 (0.694, 0.88) 0.875 (0.792, 0.948) 0.872 (0.479, 1)
Up-XGB 0.928 (0.901, 0.952) 0.906 (0.781, 0.971) 0.83 (0.714, 0.905) 0.893 (0.765, 0.99) 0.671 (0.621, 0.829)
Up-RF 0.905 (0.858, 0.941) 0.894 (0.784, 0.941) 0.808 (0.714, 0.898) 0.878 (0.727, 0.95) 0.59 (0.5, 0.665)
SMOTE-SVM 0.923 (0.885, 0.948) 0.875 (0.781, 0.925) 0.826 (0.735, 0.919) 0.869 (0.761, 0.948) 0.814 (0.746, 0.869)
Kappa sequential 0.879 (0.862, 0.895) 0.808 (0.754, 0.903) 0.712 (0.547, 0.818) 0.877 (0.679, 1) 0.877 (0.679, 1)
2-STEP 0.879 (0.862, 0.895) 0.85 (0.769, 0.903) 0.635 (0.402, 0.769) 0.716 (0.538, 0.896) 0.718 (-0.029, 1)
Up-XGB 0.871 (0.822, 0.905) 0.854 (0.65, 0.968) 0.666 (0.452, 0.844) 0.741 (0.62, 0.863) 0.36 (0.239, 0.566)
Up-RF 0.85 (0.768, 0.903) 0.84 (0.682, 0.933) 0.63 (0.452, 0.773) 0.746 (0.554, 0.863) 0.213 (0, 0.396)
SMOTE-SVM 0.839 (0.783, 0.876) 0.8 (0.65, 0.913) 0.614 (0.48, 0.789) 0.741 (0.563, 0.825) 0.516 (0.352, 0.706)
Figure 4.7: Top 5 Workflow Per-Class Evaluation Metrics by Metric
Table 4.13: Top Workflow Per-Class Evaluation Metrics and Ranks
Workflow Rank HGSOC CCOC ENOC MUOC LGSOC
F1-Score
sequential 1 0.977 0.868 0.860 0.966 0.910
2-STEP 2 0.977 0.899 0.755 0.788 0.733
Up-XGB 3 0.975 0.862 0.685 0.753 0.368
Up-RF 4 0.972 0.849 0.652 0.757 0.362
SMOTE-SVM 5 0.969 0.811 0.638 0.752 0.524
Balanced Accuracy
sequential 1 0.938 0.892 0.856 0.943 0.943
2-STEP 2 0.938 0.917 0.814 0.875 0.872
Up-XGB 5 0.928 0.906 0.830 0.893 0.671
SMOTE-SVM 10 0.923 0.875 0.826 0.869 0.814
Up-RF 14 0.905 0.894 0.808 0.878 0.590
Kappa
sequential 1 0.879 0.808 0.712 0.877 0.877
2-STEP 2 0.879 0.850 0.635 0.716 0.718
Up-XGB 3 0.871 0.854 0.666 0.741 0.360
Up-RF 4 0.850 0.840 0.630 0.746 0.213
SMOTE-SVM 5 0.839 0.800 0.614 0.741 0.516
Figure 4.8: Top 5 Workflow Per-Class Evaluation Metrics by Metric

Misclassified cases from a previous step of the sequence of classifiers are not included in subsequent steps of the training set CV folds. Thus, we cannot piece together the test set predictions from the sequential and two-step algorithms to obtain overall metrics.

4.3 Confirmation Set

Now we’d like to see how our best five workflows perform in the confirmation set. The class-specific F1-scores will be used. The top performing method will be selected for gene optimization.

Table 4.14: Evaluation Metrics on Confirmation Set Models
Histotypes
Method Metric Overall HGSOC CCOC ENOC MUOC LGSOC
Sequential Accuracy 0.830 0.860 0.977 0.882 0.969 0.974
Sensitivity 0.592 0.950 0.847 0.486 0.593 0.083
Specificity 0.923 0.683 0.993 0.961 0.985 0.990
F1-Score 0.618 0.900 0.891 0.578 0.615 0.105
Balanced Accuracy 0.757 0.817 0.920 0.723 0.789 0.537
Kappa 0.648 0.670 0.877 0.512 0.599 0.093
2-STEP Accuracy 0.833 0.860 0.970 0.889 0.978 0.969
Sensitivity 0.608 0.950 0.861 0.477 0.667 0.083
Specificity 0.923 0.683 0.984 0.972 0.992 0.986
F1-Score 0.633 0.900 0.867 0.590 0.720 0.091
Balanced Accuracy 0.766 0.817 0.923 0.724 0.829 0.535
Kappa 0.655 0.670 0.850 0.530 0.709 0.075
Up-XGB Accuracy 0.844 0.868 0.970 0.896 0.980 0.975
Sensitivity 0.658 0.958 0.875 0.467 0.741 0.250
Specificity 0.927 0.693 0.982 0.981 0.990 0.989
F1-Score 0.680 0.905 0.869 0.599 0.755 0.273
Balanced Accuracy 0.793 0.825 0.929 0.724 0.865 0.619
Kappa 0.678 0.688 0.852 0.544 0.744 0.260
Up-RF Accuracy 0.835 0.857 0.975 0.883 0.974 0.981
Sensitivity 0.613 0.972 0.875 0.383 0.667 0.167
Specificity 0.918 0.633 0.988 0.983 0.987 0.997
F1-Score 0.648 0.900 0.887 0.522 0.679 0.250
Balanced Accuracy 0.765 0.802 0.931 0.683 0.827 0.582
Kappa 0.646 0.654 0.873 0.466 0.665 0.243
SMOTE-SVM Accuracy 0.827 0.866 0.958 0.888 0.972 0.970
Sensitivity 0.650 0.939 0.861 0.477 0.556 0.417
Specificity 0.927 0.725 0.970 0.970 0.990 0.981
F1-Score 0.656 0.902 0.821 0.586 0.625 0.345
Balanced Accuracy 0.788 0.832 0.916 0.723 0.773 0.699
Kappa 0.651 0.690 0.797 0.525 0.611 0.330
Figure 4.9: Evaluation Metrics on Confirmation Set Models
Figure 4.10: Entropy vs. Predicted Probability in Confirmation Set
Figure 4.11: Gene Optimized Workflows Per-Class Metrics in Confirmation Set
Figure 4.12: Confusion Matrices for Confirmation Set Models

4.4 Gene Optimization

From Figure 4.9, we see that although Up-XGB has the highest overall evaluation metric scores, SMOTE-SVM is better at predicting the rarest histotype LGSOC. Thus we choose both of these two workflows for gene optimization in the confirmation set.

4.4.1 SMOTE-SVM

Figure 4.18: Gene Optimization for SMOTE-SVM Classifier using F1-Score

In the SMOTE-SVM classifier, the optimal number of genes is achieved at the fewest genes needed to achieve a mean F1-Score above 0.7 with 22 genes added, highlighted in red in Figure 4.18. Hence the optimal number of total genes used will be n=28+22=50.

The gene profile of the optimal set of genes used is displayed in Table 4.15. Base genes in the PrOTYPE and SPOT sets are annotated with green circles, and the added genes are annotated with yellow circles. The added genes are: EGFL6, IGJ, IGKC, TP53, DKK4, MUC5B, SLC3A1, MAP1LC3A, IGFBP1, CPNE8, SERPINA5, SCGB1D2, STC1, EPAS1, BRCA1, KGFLP2, SENP8, BCL2, PBX1, KLK7, C10orf116 and LIN28B. Unused genes are annotated with red crosses.

Table 4.15: Gene Profile of Optimal Set in SMOTE-SVM Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates EGFL6 1
IGJ 2
IGKC 3
TP53 4
DKK4 5
MUC5B 6
SLC3A1 7
MAP1LC3A 8
IGFBP1 9
CPNE8 10
SERPINA5 11
SCGB1D2 12
STC1 13
EPAS1 14
BRCA1 15
KGFLP2 16
SENP8 17
BCL2 18
PBX1 19
KLK7 20
C10orf116 21
LIN28B 22
LGALS4 23
ADCYAP1R1 24
IL6 25
ZBED1 26
WT1 27
TFF1 28
GCNT3 29
HNF1B 30
TFF3 31
CYP4B1 32
CYP2C18 33
TSPAN8 34
FUT3 35
MET 36
ATP5G3 37
SEMA6A 38
GPR64 39
PAX8 40
C1orf173 41
GAD1 42
CAPN2 43

4.4.2 Up-XGB

Figure 4.19: Gene Optimization for Up-XGB Classifier using Balanced Accuracy

In the Up-XGB classifier, the optimal number of genes is achieved at the highest mean balanced accuracy with 12 genes added, highlighted in red in Figure 4.19. Hence the optimal number of total genes used will be n=28+12=40.

The gene profile of the optimal set of genes used is displayed in Table 4.16. Base genes in the PrOTYPE and SPOT sets are annotated with green circles, and the added genes are annotated with yellow circles. The added genes are: HNF1B, TPX2, TFF1, CYP2C18, TFF3, WT1, GPR64, KLK7, SLC3A1, IGFBP1, GAD1 and LGALS4. Unused genes are annotated with red crosses.

Table 4.16: Gene Profile of Optimal Set in Up-XGB Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates HNF1B 1
TPX2 2
TFF1 3
CYP2C18 4
TFF3 5
WT1 6
GPR64 7
KLK7 8
SLC3A1 9
IGFBP1 10
GAD1 11
LGALS4 12
MET 13
GCNT3 14
FUT3 15
C1orf173 16
EGFL6 17
MUC5B 18
C10orf116 19
DKK4 20
IL6 21
CAPN2 22
KGFLP2 23
BRCA1 24
CYP4B1 25
IGKC 26
PBX1 27
TSPAN8 28
SEMA6A 29
SENP8 30
PAX8 31
TP53 32
SERPINA5 33
ATP5G3 34
CPNE8 35
LIN28B 36
STC1 37
EPAS1 38
BCL2 39
MAP1LC3A 40
SCGB1D2 41
ADCYAP1R1 42
IGJ 43

4.4.3 Gene List Comparisons in Confirmation Set

We train the SMOTE-SVM and Up-XGB workflows using the base and optimal gene lists in the training set. The models are evaluated on the confirmation set. Overall and per-class results are shown in Table 4.17. The gene lists are:

  1. Base (n=28): among the overlapping genes, the base set from the PrOTYPE and SPOT lists

  2. Optimal (n=40,50): among the overlapping genes, the base set plus the additional number of genes that result in the optimal value for a selected evaluation metric, as assessed in Figure 4.18 and Figure 4.19

Table 4.17: Model Comparisons using Different Gene Lists in Confirmation Set
Histotypes
Method Metric Overall HGSOC CCOC ENOC MUOC LGSOC
SMOTE-SVM, Optimal Accuracy 0.833 0.872 0.963 0.885 0.974 0.974
Sensitivity 0.674 0.939 0.806 0.533 0.593 0.500
Specificity 0.931 0.743 0.982 0.955 0.990 0.983
F1-Score 0.682 0.907 0.829 0.606 0.653 0.414
Balanced Accuracy 0.802 0.841 0.894 0.744 0.791 0.741
Kappa 0.665 0.705 0.808 0.540 0.639 0.401
SMOTE-SVM, Base Accuracy 0.815 0.860 0.952 0.874 0.970 0.974
Sensitivity 0.672 0.903 0.889 0.514 0.556 0.500
Specificity 0.930 0.775 0.960 0.946 0.989 0.983
F1-Score 0.660 0.895 0.805 0.576 0.612 0.414
Balanced Accuracy 0.801 0.839 0.924 0.730 0.772 0.741
Kappa 0.641 0.685 0.778 0.503 0.597 0.401
Up-XGB, Optimal Accuracy 0.833 0.866 0.969 0.886 0.974 0.972
Sensitivity 0.626 0.955 0.875 0.430 0.704 0.167
Specificity 0.925 0.693 0.981 0.978 0.985 0.987
F1-Score 0.639 0.904 0.863 0.558 0.691 0.182
Balanced Accuracy 0.775 0.824 0.928 0.704 0.845 0.577
Kappa 0.656 0.684 0.845 0.499 0.677 0.168
Up-XGB, Base Accuracy 0.808 0.854 0.953 0.861 0.974 0.975
Sensitivity 0.593 0.950 0.875 0.308 0.667 0.167
Specificity 0.916 0.665 0.963 0.972 0.987 0.990
F1-Score 0.602 0.896 0.808 0.426 0.679 0.200
Balanced Accuracy 0.754 0.808 0.919 0.640 0.827 0.579
Kappa 0.602 0.653 0.781 0.360 0.665 0.188
Figure 4.20: Gene List Comparisons of Evaluation Metrics in Confirmation Set

4.5 Validation Set

Based on the results in Figure 4.20, we assess the SMOTE-SVM model trained on the training set for the optimal and all overlap gene lists, evaluating in the validation set.

4.5.1 Evaluation Metrics

Table 4.18: Evaluation Metrics on Training Set Models in Validation Set
Histotypes
Method Metric Overall HGSOC CCOC ENOC MUOC LGSOC
SMOTE-SVM, All Overlap Accuracy 0.884 0.908 0.959 0.950 0.970 0.981
Sensitivity 0.792 0.908 0.986 0.682 0.600 0.783
Specificity 0.961 0.908 0.956 0.979 0.976 0.986
F1-Score 0.706 0.939 0.786 0.727 0.400 0.679
Balanced Accuracy 0.876 0.908 0.971 0.830 0.788 0.884
Kappa 0.716 0.752 0.764 0.700 0.386 0.670
SMOTE-SVM, Optimal Accuracy 0.874 0.900 0.961 0.945 0.966 0.974
Sensitivity 0.747 0.907 0.942 0.659 0.533 0.696
Specificity 0.954 0.877 0.962 0.976 0.974 0.982
F1-Score 0.671 0.934 0.788 0.703 0.348 0.582
Balanced Accuracy 0.851 0.892 0.952 0.818 0.754 0.839
Kappa 0.689 0.729 0.767 0.673 0.333 0.569
SMOTE-SVM, Base Accuracy 0.834 0.865 0.953 0.921 0.960 0.971
Sensitivity 0.717 0.877 0.957 0.477 0.533 0.739
Specificity 0.937 0.821 0.953 0.969 0.967 0.977
F1-Score 0.617 0.910 0.759 0.542 0.308 0.567
Balanced Accuracy 0.827 0.849 0.955 0.723 0.750 0.858
Kappa 0.601 0.637 0.734 0.499 0.291 0.552
Figure 4.21: Evaluation Metrics on Validation Set Models

4.5.2 Confusion Matrices

Figure 4.22: Confusion Matrix for Training Set Models evaluated on Validation Data

4.5.3 ROC Curves

4.5.4 Calibration Plots

4.5.5 Summary

The SMOTE-SVM, All Overlap model has the highest overall F1-Score. A summary of pertinent results are shown in Figure 4.29.

Figure 4.29: Validation Summary

4.5.6 Additional Explorations

Table 4.19: Clinicopath characteristics between correct and incorrect predictions of ENOC cases
Characteristic Predicted ENOC Correctly
N = 60
1
Missed ENOC
N = 28
1
p-value2
Age at diagnosis 53 (46, 64) 56 (51, 62) 0.6
Tumour grade

0.008
    low grade 43 (91%) 16 (64%)
    high grade 4 (8.5%) 9 (36%)
    Unknown 13 3
FIGO tumour stage

<0.001
    I 49 (83%) 13 (46%)
    II-IV 10 (17%) 15 (54%)
    Unknown 1 0
Race

>0.9
    white 53 (91%) 23 (92%)
    non-white 5 (8.6%) 2 (8.0%)
    Unknown 2 3
ARID1A

>0.9
    absent/subclonal 11 (18%) 5 (18%)
    present 49 (82%) 23 (82%)
WT1

0.4
    diffuse (>50%) 2 (3.3%) 3 (11%)
    focal (1-50%) 2 (3.3%) 1 (3.6%)
    negative 56 (93%) 24 (86%)
TP53

0.031
    mutated 1 (1.7%) 4 (15%)
    wild type 59 (98%) 23 (85%)
    Unknown 0 1
PR

<0.001
    diffuse (>50%) 39 (65%) 7 (25%)
    focal (1-50%) 10 (17%) 5 (18%)
    negative 11 (18%) 16 (57%)
P16

0.012
    abnormal block 1 (1.7%) 5 (18%)
    abnormal complete absence 16 (27%) 9 (32%)
    normal 43 (72%) 14 (50%)
NAPSIN A

>0.9
    negative 58 (97%) 27 (100%)
    positive 2 (3.3%) 0 (0%)
    Unknown 0 1
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Fisher’s exact test; Pearson’s Chi-squared test
Figure 4.30: Volcano Plots of Validation Set Predictions
Figure 4.31: Boxplot of Most Differentially Expressed Genes
Figure 4.32: Subtype Prediction Summary among Predicted HGSC Samples