4  Results

We summarize cross-validated training performance of class metrics in the training set. The accuracy, F1-score, and kappa, are the metrics of interest. Workflows are ordered by their mean estimates across the outer folds of the nested CV for each metric.

4.1 Training Set

4.2 Rank Aggregation

Multi-step methods:

  • sequential: sequential algorithm sequence of subsampling methods and algorithms used are:
    • HGSC vs. non-HGSC using upsubsampling and XGBoost
    • CCOC vs. non-CCOC using no subsampling and random forest
    • ENOC vs. non-ENOC using no subsampling and support vector machine
    • LGSC vs. MUC using SMOTE subsampling and random forest
  • two_step: two-step algorithm sequence of subsampling methods and algorithms used are:
    • HGSC vs. non-HGSC using upsampling and XGBoost
    • CCOC vs. ENOC vs. MUC vs. LGSC using SMOTE subsampling and support vector machine

We conduct rank aggregation using a two-stage nested appraoch:

  1. First we rank aggregate the per-class metrics for F1-score, balanced accuracy and kappa.
  2. Then we take the aggregated lists from the three metrics and perform a final rank aggregation.
  3. The top workflows from the final rank aggregation are used for gene optimization in the confirmation set

4.2.1 Across Classes

4.2.2 Across Metrics

Table 4.10: Rank Aggregation Comparison of Metrics Used
Rank F1 Balanced Accuracy Kappa
1 sequential sequential sequential
2 two_step two_step two_step
3 up_xgb up_mr up_xgb
4 up_rf smote_mr up_rf
5 smote_svm up_xgb smote_svm
6 hybrid_svm smote_xgb hybrid_rf
7 smote_xgb hybrid_mr smote_xgb
8 hybrid_rf hybrid_xgb hybrid_svm
9 up_svm down_xgb up_svm
10 hybrid_xgb smote_svm hybrid_xgb
11 none_rf hybrid_rf smote_rf
12 smote_mr down_mr smote_mr
13 smote_rf hybrid_svm none_svm
14 hybrid_mr up_rf none_rf
15 up_mr smote_rf hybrid_mr
16 down_mr down_rf up_mr
17 down_rf none_svm down_mr
18 down_svm down_svm down_rf
19 down_xgb none_rf down_svm
20 NA up_svm down_xgb
21 NA none_mr none_mr
22 NA none_xgb none_xgb
Table 4.11: Top 5 Workflows from Final Rank Aggregation
Rank Workflow
1 sequential
2 two_step
3 up_xgb
4 up_rf
5 smote_svm

4.2.3 Top Workflows

We look at the per-class evaluation metrics of the top 5 workflows.

Table 4.12: Top Workflow Per-Class Evaluation Metrics
Histotypes
Metric Workflow HGSC CCOC ENOC LGSC MUC
Accuracy sequential 0.963 (0.956, 0.968) 0.917 (0.896, 0.958) 0.856 (0.774, 0.909) 0.951 (0.882, 1) 0.951 (0.882, 1)
2-STEP 0.963 (0.956, 0.968) 0.934 (0.896, 0.958) 0.839 (0.729, 0.896) 0.971 (0.938, 1) 0.892 (0.812, 0.96)
Up-XGB 0.96 (0.944, 0.976) 0.985 (0.968, 0.996) 0.963 (0.956, 0.976) 0.981 (0.976, 0.988) 0.977 (0.968, 0.988)
Up-RF 0.955 (0.92, 0.976) 0.983 (0.972, 0.992) 0.959 (0.948, 0.972) 0.982 (0.968, 0.988) 0.979 (0.964, 0.988)
SMOTE-SVM 0.95 (0.936, 0.968) 0.979 (0.968, 0.988) 0.954 (0.94, 0.964) 0.981 (0.972, 0.984) 0.978 (0.96, 0.984)
Sensitivity sequential 0.978 (0.966, 0.99) 0.814 (0.75, 0.882) 0.865 (0.812, 0.941) 0.92 (0.6, 1) 0.965 (0.909, 1)
2-STEP 0.978 (0.966, 0.99) 0.866 (0.824, 0.933) 0.735 (0.556, 0.824) 0.767 (0, 1) 0.842 (0.75, 0.923)
Up-XGB 0.981 (0.976, 0.986) 0.817 (0.571, 0.941) 0.679 (0.444, 0.818) 0.35 (0.25, 0.667) 0.8 (0.538, 1)
Up-RF 0.988 (0.979, 0.995) 0.793 (0.571, 0.882) 0.635 (0.444, 0.812) 0.183 (0, 0.333) 0.766 (0.462, 0.909)
SMOTE-SVM 0.966 (0.955, 0.977) 0.757 (0.571, 0.85) 0.681 (0.5, 0.864) 0.642 (0.5, 0.75) 0.748 (0.538, 0.909)
Specificity sequential 0.897 (0.875, 0.918) 0.969 (0.938, 1) 0.847 (0.733, 0.875) 0.965 (0.909, 1) 0.92 (0.6, 1)
2-STEP 0.897 (0.875, 0.918) 0.969 (0.935, 1) 0.893 (0.833, 0.935) 0.977 (0.953, 1) 0.908 (0.833, 0.973)
Up-XGB 0.875 (0.826, 0.919) 0.996 (0.992, 1) 0.981 (0.97, 0.991) 0.993 (0.984, 1) 0.986 (0.979, 0.992)
Up-RF 0.822 (0.738, 0.892) 0.996 (0.992, 1) 0.98 (0.966, 0.987) 0.997 (0.988, 1) 0.989 (0.983, 0.996)
SMOTE-SVM 0.88 (0.804, 0.919) 0.993 (0.987, 1) 0.971 (0.962, 0.975) 0.987 (0.976, 0.996) 0.989 (0.983, 0.996)
F1-Score sequential 0.977 (0.973, 0.98) 0.868 (0.828, 0.933) 0.86 (0.788, 0.914) 0.91 (0.75, 1) 0.966 (0.923, 1)
2-STEP 0.977 (0.973, 0.98) 0.899 (0.848, 0.933) 0.755 (0.606, 0.848) 0.733 (0, 1) 0.788 (0.667, 0.923)
Up-XGB 0.975 (0.964, 0.986) 0.862 (0.667, 0.97) 0.685 (0.471, 0.857) 0.368 (0.25, 0.571) 0.753 (0.636, 0.87)
Up-RF 0.972 (0.949, 0.986) 0.849 (0.696, 0.938) 0.652 (0.471, 0.788) 0.362 (0.286, 0.4) 0.757 (0.571, 0.87)
SMOTE-SVM 0.969 (0.96, 0.981) 0.811 (0.667, 0.919) 0.638 (0.5, 0.809) 0.524 (0.364, 0.714) 0.752 (0.583, 0.833)
Balanced Accuracy sequential 0.938 (0.928, 0.947) 0.892 (0.859, 0.938) 0.856 (0.773, 0.908) 0.943 (0.8, 1) 0.943 (0.8, 1)
2-STEP 0.938 (0.928, 0.947) 0.917 (0.88, 0.952) 0.814 (0.694, 0.88) 0.872 (0.479, 1) 0.875 (0.792, 0.948)
Up-XGB 0.928 (0.901, 0.952) 0.906 (0.781, 0.971) 0.83 (0.714, 0.905) 0.671 (0.621, 0.829) 0.893 (0.765, 0.99)
Up-RF 0.905 (0.858, 0.941) 0.894 (0.784, 0.941) 0.808 (0.714, 0.898) 0.59 (0.5, 0.665) 0.878 (0.727, 0.95)
SMOTE-SVM 0.923 (0.885, 0.948) 0.875 (0.781, 0.925) 0.826 (0.735, 0.919) 0.814 (0.746, 0.869) 0.869 (0.761, 0.948)
Kappa sequential 0.879 (0.862, 0.895) 0.808 (0.754, 0.903) 0.712 (0.547, 0.818) 0.877 (0.679, 1) 0.877 (0.679, 1)
2-STEP 0.879 (0.862, 0.895) 0.85 (0.769, 0.903) 0.635 (0.402, 0.769) 0.718 (-0.029, 1) 0.716 (0.538, 0.896)
Up-XGB 0.871 (0.822, 0.905) 0.854 (0.65, 0.968) 0.666 (0.452, 0.844) 0.36 (0.239, 0.566) 0.741 (0.62, 0.863)
Up-RF 0.85 (0.768, 0.903) 0.84 (0.682, 0.933) 0.63 (0.452, 0.773) 0.213 (0, 0.396) 0.746 (0.554, 0.863)
SMOTE-SVM 0.839 (0.783, 0.876) 0.8 (0.65, 0.913) 0.614 (0.48, 0.789) 0.516 (0.352, 0.706) 0.741 (0.563, 0.825)
Figure 4.7: Top 5 Workflow Per-Class Evaluation Metrics by Metric
Table 4.13: Top Workflow Per-Class Evaluation Metrics and Ranks
Workflow Rank HGSC CCOC ENOC LGSC MUC
F1-Score
sequential 1 0.977 0.868 0.860 0.910 0.966
2-STEP 2 0.977 0.899 0.755 0.733 0.788
Up-XGB 3 0.975 0.862 0.685 0.368 0.753
Up-RF 4 0.972 0.849 0.652 0.362 0.757
SMOTE-SVM 5 0.969 0.811 0.638 0.524 0.752
Balanced Accuracy
sequential 1 0.938 0.892 0.856 0.943 0.943
2-STEP 2 0.938 0.917 0.814 0.872 0.875
Up-XGB 5 0.928 0.906 0.830 0.671 0.893
SMOTE-SVM 10 0.923 0.875 0.826 0.814 0.869
Up-RF 14 0.905 0.894 0.808 0.590 0.878
Kappa
sequential 1 0.879 0.808 0.712 0.877 0.877
2-STEP 2 0.879 0.850 0.635 0.718 0.716
Up-XGB 3 0.871 0.854 0.666 0.360 0.741
Up-RF 4 0.850 0.840 0.630 0.213 0.746
SMOTE-SVM 5 0.839 0.800 0.614 0.516 0.741
Figure 4.8: Top 5 Workflow Per-Class Evaluation Metrics by Metric

Misclassified cases from a previous step of the sequence of classifiers are not included in subsequent steps of the training set CV folds. Thus, we cannot piece together the test set predictions from the sequential and two-step algorithms to obtain overall metrics.

4.3 Optimal Gene Sets

4.3.1 Sequential Algorithm

Figure 4.9: Gene Optimization for Sequential Classifier

In the sequential algorithm, all sequences have relatively flat average F1-scores across the number of genes added. One area we can use to differentiate is in sequence 3, where the maximum F1-score is highest when we reach 5 genes added, hence the optimal number of genes used will be n=28+5=33 The added genes are: CYP2C18, HNF1B, EGFL6, IGFBP1 and IGKC.

Table 4.14: Gene Profile of Optimal Set in Sequential Algorithm
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates CYP2C18 1
HNF1B 2
EGFL6 3
IGFBP1 4
IGKC 5
LGALS4 6
WT1 7
TFF3 8
ZBED1 9
IL6 10
BRCA1 11
TFF1 12
GPR64 13
TSPAN8 14
SLC3A1 15
MUC5B 16
CPNE8 17
TPX2 18
PBX1 19
TP53 20
C1orf173 21
MET 22
SEMA6A 23
FUT3 24
GCNT3 25
KLK7 26
GAD1 27
PAX8 28
DKK4 29
IGJ 30
MAP1LC3A 31
EPAS1 32
SERPINA5 33
CYP4B1 34
STC1 35
ADCYAP1R1 36
C10orf116 37
LIN28B 38
SENP8 39
ATP5G3 40
CAPN2 41
KGFLP2 42
BCL2 43
SCGB1D2 44

4.3.2 2-STEP

Figure 4.10: Gene Optimization for 2-STEP Classifier
Table 4.15: Gene Profile of Optimal Set in 2-STEP Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates CYP2C18 1
MUC5B 2
HNF1B 3
WT1 4
IGKC 5
EGFL6 6
SLC3A1 7
TPX2 8
TP53 9
MET 10
C1orf173 11
BRCA1 12
GAD1 13
TFF3 14
ZBED1 15
STC1 16
CPNE8 17
IGJ 18
CYP4B1 19
BCL2 20
GPR64 21
KLK7 22
GCNT3 23
KGFLP2 24
TSPAN8 25
ATP5G3 26
SENP8 27
CAPN2 28
LIN28B 29
SEMA6A 30
EPAS1 31
SERPINA5 32
MAP1LC3A 33
DKK4 34
ADCYAP1R1 35
FUT3 36
IGFBP1 37
SCGB1D2 38
PAX8 39
IL6 40
C10orf116 41
LGALS4 42
PBX1 43
TFF1 44

4.3.3 Up-XGB

Figure 4.11: Gene Optimization for Up-XGB Classifier

In the Up-XGB classifier, the mean F1-score is highest when we reach 2 genes added, hence the optimal number of genes used will be n=28+2=30 The added genes are: HNF1B and TPX2.

Table 4.16: Gene Profile of Optimal Set in Up-XGB Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates HNF1B 1
TPX2 2
TFF1 3
CYP2C18 4
TFF3 5
WT1 6
GPR64 7
KLK7 8
SLC3A1 9
IGFBP1 10
GAD1 11
LGALS4 12
MET 13
GCNT3 14
FUT3 15
C1orf173 16
EGFL6 17
MUC5B 18
C10orf116 19
DKK4 20
IL6 21
CAPN2 22
KGFLP2 23
BRCA1 24
CYP4B1 25
IGKC 26
PBX1 27
TSPAN8 28
SEMA6A 29
SENP8 30
PAX8 31
TP53 32
SERPINA5 33
ATP5G3 34
CPNE8 35
LIN28B 36
STC1 37
EPAS1 38
BCL2 39
MAP1LC3A 40
SCGB1D2 41
ADCYAP1R1 42
IGJ 43
ZBED1 44

4.4 Test Set Performance

Now we’d like to see how our best methods perform in the confirmation and validation sets. The class-specific F1-scores will be used.

The top 2 methods are the sequential, 2-STEP, and Up-XGB classifiers. We can test 2 additional methods by using either the full set of genes or the optimal set of genes for all three these classifiers.

4.4.1 Confirmation Set

Table 4.17: Evaluation Metrics on Confirmation Set Models
Histotypes
Method Metric Overall HGSC CCOC ENOC LGSC MUC
Sequential, Full Set Accuracy 0.830 0.860 0.977 0.882 0.974 0.969
Sensitivity 0.592 0.950 0.847 0.486 0.083 0.593
Specificity 0.923 0.683 0.993 0.961 0.990 0.985
F1-Score 0.618 0.900 0.891 0.578 0.105 0.615
Balanced Accuracy 0.757 0.817 0.920 0.723 0.537 0.789
Kappa 0.648 0.670 0.877 0.512 0.093 0.599
Sequential, Optimal Set Accuracy 0.818 0.850 0.967 0.872 0.978 0.967
Sensitivity 0.624 0.946 0.847 0.402 0.333 0.593
Specificity 0.918 0.665 0.982 0.966 0.990 0.984
F1-Score 0.645 0.893 0.853 0.512 0.364 0.604
Balanced Accuracy 0.771 0.805 0.915 0.684 0.662 0.788
Kappa 0.622 0.647 0.835 0.445 0.353 0.587
2-STEP, Full Set Accuracy 0.833 0.860 0.970 0.889 0.969 0.978
Sensitivity 0.608 0.950 0.861 0.477 0.083 0.667
Specificity 0.923 0.683 0.984 0.972 0.986 0.992
F1-Score 0.633 0.900 0.867 0.590 0.091 0.720
Balanced Accuracy 0.766 0.817 0.923 0.724 0.535 0.829
Kappa 0.655 0.670 0.850 0.530 0.075 0.709
2-STEP, Optimal Set Accuracy 0.807 0.843 0.966 0.860 0.972 0.974
Sensitivity 0.612 0.939 0.847 0.355 0.250 0.667
Specificity 0.914 0.656 0.981 0.961 0.986 0.987
F1-Score 0.624 0.887 0.847 0.458 0.250 0.679
Balanced Accuracy 0.763 0.797 0.914 0.658 0.618 0.827
Kappa 0.600 0.629 0.828 0.385 0.236 0.665
Up-XGB, Full Set Accuracy 0.844 0.868 0.970 0.896 0.975 0.980
Sensitivity 0.658 0.958 0.875 0.467 0.250 0.741
Specificity 0.927 0.693 0.982 0.981 0.989 0.990
F1-Score 0.680 0.905 0.869 0.599 0.273 0.755
Balanced Accuracy 0.793 0.825 0.929 0.724 0.619 0.865
Kappa 0.678 0.688 0.852 0.544 0.260 0.744
Up-XGB, Optimal Set Accuracy 0.816 0.855 0.952 0.877 0.970 0.978
Sensitivity 0.621 0.939 0.875 0.383 0.167 0.741
Specificity 0.921 0.693 0.961 0.976 0.986 0.989
F1-Score 0.624 0.895 0.803 0.509 0.174 0.741
Balanced Accuracy 0.771 0.816 0.918 0.679 0.576 0.865
Kappa 0.625 0.662 0.775 0.448 0.159 0.729
Figure 4.12: Entropy vs. Predicted Probability in Confirmation Set
Figure 4.13: Gene Optimized Workflows Per-Class Metrics in Confirmation Set
Figure 4.14: Confusion Matrices for Confirmation Set Models

4.4.2 Validation Set

Table 4.18: Evaluation Metrics on Validation Set Model, Up-XGB, Full Set
Histotypes
Metric Overall HGSC CCOC ENOC LGSC MUC
Accuracy 0.902 0.915 0.975 0.951 0.979 0.983
Sensitivity 0.797 0.937 1.000 0.602 0.533 0.913
Specificity 0.954 0.836 0.973 0.989 0.986 0.985
F1-Score 0.742 0.945 0.862 0.707 0.457 0.737
Balanced Accuracy 0.876 0.886 0.987 0.796 0.760 0.949
Kappa 0.743 0.756 0.849 0.681 0.447 0.729
Figure 4.21: Up-XGB, Full Set, Per-Class Metrics in Validation Set
Figure 4.22: Confusion Matrix for Validation Set Model
Figure 4.23: ROC Curves for Up-XGB, Full Set Model in Validation Set
Figure 4.24: Validation Summary
Table 4.19: Clinicopath characteristics between correct and incorrect predictions of ENOC cases
Characteristic Predicted ENOC Correctly
N = 53
1
Missed ENOC
N = 35
1
p-value2
Age at diagnosis 52 (46, 63) 57 (51, 63) 0.2
Tumour grade

0.002
    low grade 41 (93%) 18 (64%)
    high grade 3 (6.8%) 10 (36%)
    Unknown 9 7
FIGO tumour stage

0.2
    I 40 (77%) 22 (63%)
    II-IV 12 (23%) 13 (37%)
    Unknown 1 0
Race

>0.9
    white 48 (92%) 28 (90%)
    non-white 4 (7.7%) 3 (9.7%)
    Unknown 1 4
ARID1A

0.4
    absent/subclonal 11 (21%) 5 (14%)
    present 42 (79%) 30 (86%)
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test
Figure 4.25: Volcano Plots of Validation Set Predictions
Figure 4.26: Subtype Prediction Summary among Predicted HGSC Samples