4  Results

We summarize cross-validated training performance of class metrics in the training set. The accuracy, F1-score, and kappa, are the metrics of interest. Workflows are ordered by their mean estimates across the outer folds of the nested CV for each metric.

4.1 Training Set

4.2 Rank Aggregation

Multi-step methods:

  • sequential: sequential algorithm sequence of subsampling methods and algorithms used are:
    • HGSC vs. non-HGSC using upsubsampling and random forest
    • CCOC vs. non-CCOC using SMOTE subsampling and XGBoost
    • ENOC vs. non-ENOC using hybrid subsampling and support vector machine
    • LGSC vs. MUC using hybrid subsampling and random forest
  • two_step: two-step algorithm sequence of subsampling methods and algorithms used are:
    • HGSC vs. non-HGSC using SMOTE subsampling and random forest
    • CCOC vs. ENOC vs. MUC vs. LGSC using hybrid subsampling and support vector machine

We conduct rank aggregation using a two-stage nested appraoch:

  1. First we rank aggregate the per-class metrics for F1-score, balanced accuracy and kappa.
  2. Then we take the aggregated lists from the three metrics and perform a final rank aggregation.
  3. The top workflows from the final rank aggregation are used for gene optimization in the confirmation set

4.2.1 Across Classes

4.2.2 Across Metrics

Table 4.10: Rank Aggregation Comparison of Metrics Used
Rank F1 Balanced Accuracy Kappa
1 sequential sequential sequential
2 two_step smote_xgb smote_rf
3 smote_rf hybrid_xgb smote_xgb
4 hybrid_xgb smote_mr hybrid_xgb
5 smote_xgb down_mr two_step
6 up_rf two_step up_svm
7 hybrid_svm up_xgb up_xgb
8 up_svm hybrid_mr up_rf
9 smote_svm smote_rf smote_svm
10 up_xgb up_mr hybrid_svm
11 hybrid_rf down_svm hybrid_rf
12 none_svm up_svm none_svm
13 smote_mr hybrid_rf smote_mr
14 hybrid_mr smote_svm hybrid_mr
15 up_mr hybrid_svm up_mr
16 down_mr up_rf down_mr
17 down_svm down_rf none_rf
18 down_rf none_svm down_svm
19 down_xgb none_rf down_rf
20 NA down_xgb down_xgb
21 NA none_mr none_mr
22 NA none_xgb none_xgb
Table 4.11: Top 5 Workflows from Final Rank Aggregation
Rank Workflow
1 sequential
2 smote_rf
3 smote_xgb
4 hybrid_xgb
5 two_step

4.2.3 Top Workflows

We look at the per-class evaluation metrics of the top 5 workflows.

Table 4.12: Top Workflow Per-Class Evaluation Metrics
Histotypes
Metric Workflow HGSC CCOC ENOC LGSC MUC
Accuracy sequential 0.951 (0.94, 0.964) 0.929 (0.875, 0.96) 0.857 (0.781, 0.935) 0.95 (0.867, 1) 0.95 (0.867, 1)
SMOTE-RF 0.955 (0.936, 0.98) 0.983 (0.972, 0.988) 0.959 (0.94, 0.972) 0.982 (0.972, 0.992) 0.976 (0.96, 0.984)
SMOTE-XGB 0.957 (0.936, 0.98) 0.98 (0.96, 0.992) 0.959 (0.937, 0.976) 0.985 (0.976, 0.992) 0.972 (0.968, 0.98)
hybrid-XGB 0.954 (0.936, 0.968) 0.982 (0.968, 0.992) 0.959 (0.925, 0.972) 0.983 (0.972, 0.992) 0.972 (0.964, 0.988)
2-STEP 0.949 (0.924, 0.964) 0.909 (0.826, 0.957) 0.848 (0.783, 0.936) 0.957 (0.935, 0.978) 0.931 (0.891, 0.957)
Sensitivity sequential 0.975 (0.961, 0.99) 0.863 (0.75, 0.941) 0.817 (0.75, 0.938) 0.96 (0.8, 1) 0.947 (0.818, 1)
SMOTE-RF 0.979 (0.969, 0.986) 0.833 (0.6, 0.944) 0.646 (0.556, 0.75) 0.312 (0, 0.667) 0.788 (0.462, 0.909)
SMOTE-XGB 0.965 (0.951, 0.986) 0.846 (0.6, 0.944) 0.63 (0.444, 0.818) 0.655 (0.25, 0.857) 0.856 (0.615, 1)
hybrid-XGB 0.964 (0.956, 0.972) 0.846 (0.667, 0.944) 0.646 (0.333, 0.773) 0.655 (0.25, 0.857) 0.839 (0.538, 1)
2-STEP 0.967 (0.95, 0.98) 0.839 (0.688, 0.933) 0.754 (0.583, 1) 0.883 (0.667, 1) 0.856 (0.786, 1)
Specificity sequential 0.851 (0.833, 0.875) 0.963 (0.938, 1) 0.899 (0.812, 0.938) 0.947 (0.818, 1) 0.96 (0.8, 1)
SMOTE-RF 0.861 (0.776, 0.946) 0.993 (0.987, 0.996) 0.98 (0.966, 0.988) 0.994 (0.984, 1) 0.986 (0.979, 0.996)
SMOTE-XGB 0.921 (0.857, 0.965) 0.989 (0.983, 1) 0.981 (0.97, 0.991) 0.99 (0.98, 0.996) 0.978 (0.971, 0.988)
hybrid-XGB 0.91 (0.837, 0.947) 0.991 (0.987, 0.996) 0.981 (0.97, 0.991) 0.989 (0.976, 0.996) 0.979 (0.967, 0.992)
2-STEP 0.871 (0.766, 0.957) 0.947 (0.9, 1) 0.884 (0.812, 1) 0.966 (0.921, 1) 0.96 (0.917, 1)
F1-Score sequential 0.97 (0.963, 0.978) 0.891 (0.8, 0.941) 0.852 (0.774, 0.938) 0.92 (0.8, 1) 0.963 (0.9, 1)
SMOTE-RF 0.972 (0.959, 0.988) 0.858 (0.72, 0.919) 0.663 (0.571, 0.762) 0.421 (0.222, 0.667) 0.742 (0.545, 0.833)
SMOTE-XGB 0.973 (0.96, 0.988) 0.84 (0.643, 0.941) 0.654 (0.5, 0.857) 0.576 (0.286, 0.857) 0.733 (0.667, 0.783)
hybrid-XGB 0.971 (0.96, 0.981) 0.852 (0.714, 0.944) 0.659 (0.387, 0.829) 0.55 (0.333, 0.857) 0.729 (0.609, 0.87)
2-STEP 0.969 (0.954, 0.978) 0.865 (0.733, 0.941) 0.738 (0.615, 0.897) 0.782 (0.667, 0.842) 0.864 (0.762, 0.917)
Balanced Accuracy sequential 0.913 (0.899, 0.93) 0.913 (0.844, 0.955) 0.858 (0.781, 0.935) 0.953 (0.9, 1) 0.953 (0.9, 1)
SMOTE-RF 0.92 (0.878, 0.966) 0.913 (0.798, 0.968) 0.813 (0.772, 0.864) 0.653 (0.5, 0.831) 0.887 (0.724, 0.946)
SMOTE-XGB 0.943 (0.906, 0.97) 0.917 (0.792, 0.964) 0.806 (0.709, 0.905) 0.823 (0.621, 0.927) 0.917 (0.801, 0.986)
hybrid-XGB 0.937 (0.899, 0.959) 0.919 (0.827, 0.97) 0.814 (0.652, 0.882) 0.822 (0.623, 0.927) 0.909 (0.763, 0.988)
2-STEP 0.919 (0.863, 0.954) 0.893 (0.794, 0.951) 0.819 (0.745, 0.956) 0.924 (0.833, 0.989) 0.908 (0.858, 0.971)
Kappa sequential 0.842 (0.805, 0.881) 0.839 (0.71, 0.911) 0.715 (0.562, 0.871) 0.884 (0.706, 1) 0.884 (0.706, 1)
SMOTE-RF 0.856 (0.799, 0.922) 0.849 (0.706, 0.912) 0.641 (0.539, 0.74) 0.33 (0, 0.663) 0.73 (0.525, 0.825)
SMOTE-XGB 0.866 (0.8, 0.922) 0.83 (0.622, 0.937) 0.632 (0.467, 0.844) 0.569 (0.276, 0.853) 0.719 (0.65, 0.772)
hybrid-XGB 0.854 (0.797, 0.9) 0.842 (0.697, 0.94) 0.638 (0.348, 0.814) 0.543 (0.326, 0.853) 0.715 (0.59, 0.863)
2-STEP 0.833 (0.745, 0.883) 0.796 (0.605, 0.908) 0.632 (0.465, 0.851) 0.758 (0.647, 0.802) 0.818 (0.692, 0.888)
Figure 4.7: Top 5 Workflow Per-Class Evaluation Metrics by Metric
Table 4.13: Top Workflow Per-Class Evaluation Metrics and Ranks
Workflow Rank HGSC CCOC ENOC LGSC MUC
F1-Score
sequential 1 0.970 0.891 0.852 0.920 0.963
2-STEP 2 0.969 0.865 0.738 0.782 0.864
SMOTE-RF 3 0.972 0.858 0.663 0.421 0.742
hybrid-XGB 4 0.971 0.852 0.659 0.550 0.729
SMOTE-XGB 5 0.973 0.840 0.654 0.576 0.733
Balanced Accuracy
sequential 1 0.913 0.913 0.858 0.953 0.953
SMOTE-XGB 2 0.943 0.917 0.806 0.823 0.917
hybrid-XGB 3 0.937 0.919 0.814 0.822 0.909
2-STEP 6 0.919 0.893 0.819 0.924 0.908
SMOTE-RF 9 0.920 0.913 0.813 0.653 0.887
Kappa
sequential 1 0.842 0.839 0.715 0.884 0.884
SMOTE-RF 2 0.856 0.849 0.641 0.330 0.730
SMOTE-XGB 3 0.866 0.830 0.632 0.569 0.719
hybrid-XGB 4 0.854 0.842 0.638 0.543 0.715
2-STEP 5 0.833 0.796 0.632 0.758 0.818
Figure 4.8: Top 5 Workflow Per-Class Evaluation Metrics by Metric

Misclassified cases from a previous step of the sequence of classifiers are not included in subsequent steps of the training set CV folds. Thus, we cannot piece together the test set predictions from the sequential and two-step algorithms to obtain overall metrics.

4.3 Optimal Gene Sets

4.3.1 Sequential Algorithm

Figure 4.9: Gene Optimization for Sequential Classifier

In the sequential algorithm, all sequences have relatively flat average F1-scores across the number of genes added. However, we can observe in sequence 4, the F1-score is highest when we reach 9 genes added, hence the optimal number of genes used will be n=28+9=37 The added genes are: CYP2C18, HNF1B, EGFL6, TFF3, IL6, CYP4B1, LGALS4, SLC3A1 and IGFBP1.

Table 4.14: Gene Profile of Optimal Set in Sequential Algorithm
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates CYP2C18 1
HNF1B 2
EGFL6 3
TFF3 4
IL6 5
CYP4B1 6
LGALS4 7
SLC3A1 8
IGFBP1 9
WT1 10
MUC5B 11
TFF1 12
GPR64 13
TP53 14
BRCA1 15
MET 16
FUT3 17
CPNE8 18
TPX2 19
PBX1 20
EPAS1 21
SCGB1D2 22
KLK7 23
SEMA6A 24
DKK4 25
CAPN2 26
GAD1 27
STC1 28
IGJ 29
GCNT3 30
TSPAN8 31
SERPINA5 32
C1orf173 33
PAX8 34
LIN28B 35
ZBED1 36
ATP5G3 37
BCL2 38
KGFLP2 39
IGKC 40
SENP8 41
MAP1LC3A 42
C10orf116 43
ADCYAP1R1 44

4.3.2 SMOTE-RF

Figure 4.10: Gene Optimization for SMOTE-RF Classifier

In the SMOTE-RF classifier, the mean F1-score is highest when we reach 16 genes added, hence the optimal number of genes used will be n=28+16=44 The added genes are: HNF1B, TFF3, TPX2, SLC3A1, CYP2C18, TFF1, WT1, KLK7, IGFBP1, LGALS4, GAD1, GCNT3, C1orf173, CAPN2, FUT3 and DKK4.

Table 4.15: Gene Profile of Optimal Set in SMOTE-RF Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates HNF1B 1
TFF3 2
TPX2 3
SLC3A1 4
CYP2C18 5
TFF1 6
WT1 7
KLK7 8
IGFBP1 9
LGALS4 10
GAD1 11
GCNT3 12
C1orf173 13
CAPN2 14
FUT3 15
DKK4 16
C10orf116 17
MUC5B 18
MET 19
GPR64 20
IGKC 21
PAX8 22
ATP5G3 23
CPNE8 24
PBX1 25
IL6 26
TP53 27
KGFLP2 28
EGFL6 29
SEMA6A 30
CYP4B1 31
STC1 32
EPAS1 33
BRCA1 34
LIN28B 35
TSPAN8 36
SERPINA5 37
SCGB1D2 38
BCL2 39
ZBED1 40
ADCYAP1R1 41
IGJ 42
SENP8 43
MAP1LC3A 44

4.3.3 2-STEP

Figure 4.11: Gene Optimization for 2-STEP Classifier
Table 4.16: Gene Profile of Optimal Set in 2-STEP Workflow
Set Genes PrOTYPE SPOT Optimal Set Candidate Rank
Base COL11A1
CD74
CD2
TIMP3
LUM
CYTIP
COL3A1
THBS2
TCF7L1
HMGA2
FN1
POSTN
COL1A2
COL5A2
PDZK1IP1
FBN1
HIF1A
CXCL10
DUSP4
SOX17
MITF
CDKN3
BRCA2
CEACAM5
ANXA4
SERPINE1
CRABP2
DNAJC9
Candidates CYP2C18 1
MUC5B 2
HNF1B 3
IL6 4
SLC3A1 5
EGFL6 6
WT1 7
ZBED1 8
MET 9
SENP8 10
KLK7 11
TFF3 12
CPNE8 13
STC1 14
GAD1 15
LIN28B 16
IGJ 17
DKK4 18
EPAS1 19
GCNT3 20
SCGB1D2 21
CYP4B1 22
C1orf173 23
IGFBP1 24
TPX2 25
SEMA6A 26
ATP5G3 27
SERPINA5 28
FUT3 29
C10orf116 30
KGFLP2 31
ADCYAP1R1 32
TP53 33
PBX1 34
GPR64 35
LGALS4 36
CAPN2 37
BCL2 38
MAP1LC3A 39
TSPAN8 40
TFF1 41
PAX8 42
BRCA1 43
IGKC 44

4.4 Test Set Performance

Now we’d like to see how our best methods perform in the confirmation and validation sets. The class-specific F1-scores will be used.

The top 2 methods are the sequential and SMOTE-RF classifiers. We can test 2 additional methods by using either the full set of genes or the optimal set of genes for both of these classifiers.

4.4.1 Confirmation Set

Table 4.17: Evaluation Metrics on Confirmation Set Models
Histotypes
Method Metric Overall HGSC CCOC ENOC LGSC MUC
Sequential, Full Set Accuracy 0.829 0.861 0.964 0.888 0.975 0.969
Sensitivity 0.591 0.950 0.861 0.467 0.083 0.593
Specificity 0.923 0.688 0.977 0.972 0.992 0.985
F1-Score 0.610 0.901 0.844 0.581 0.111 0.615
Balanced Accuracy 0.757 0.819 0.919 0.720 0.538 0.789
Kappa 0.646 0.674 0.823 0.521 0.100 0.599
Sequential, Optimal Set Accuracy 0.816 0.852 0.963 0.875 0.970 0.972
Sensitivity 0.554 0.955 0.875 0.383 0.000 0.556
Specificity 0.916 0.651 0.974 0.974 0.989 0.990
F1-Score 0.573 0.895 0.840 0.506 0.000 0.625
Balanced Accuracy 0.735 0.803 0.924 0.679 0.494 0.773
Kappa 0.614 0.648 0.819 0.443 -0.014 0.611
SMOTE-RF, Full Set Accuracy 0.841 0.866 0.972 0.897 0.975 0.972
Sensitivity 0.652 0.953 0.875 0.477 0.250 0.704
Specificity 0.927 0.697 0.984 0.981 0.989 0.984
F1-Score 0.667 0.904 0.875 0.607 0.273 0.679
Balanced Accuracy 0.789 0.825 0.930 0.729 0.619 0.844
Kappa 0.673 0.685 0.859 0.553 0.260 0.664
SMOTE-RF, Optimal Set Accuracy 0.840 0.869 0.966 0.900 0.980 0.964
Sensitivity 0.669 0.943 0.861 0.505 0.333 0.704
Specificity 0.930 0.725 0.979 0.979 0.992 0.976
F1-Score 0.677 0.905 0.849 0.628 0.381 0.623
Balanced Accuracy 0.800 0.834 0.920 0.742 0.663 0.840
Kappa 0.676 0.696 0.830 0.574 0.371 0.604
2-STEP, Full Set Accuracy 0.835 0.861 0.966 0.891 0.972 0.980
Sensitivity 0.651 0.941 0.875 0.486 0.250 0.704
Specificity 0.927 0.706 0.977 0.972 0.986 0.992
F1-Score 0.669 0.900 0.851 0.598 0.250 0.745
Balanced Accuracy 0.789 0.824 0.926 0.729 0.618 0.848
Kappa 0.664 0.677 0.832 0.538 0.236 0.735
2-STEP, Optimal Set Accuracy 0.843 0.866 0.967 0.900 0.975 0.977
Sensitivity 0.639 0.953 0.875 0.495 0.167 0.704
Specificity 0.927 0.697 0.979 0.981 0.990 0.989
F1-Score 0.660 0.904 0.857 0.624 0.200 0.717
Balanced Accuracy 0.783 0.825 0.927 0.738 0.579 0.846
Kappa 0.676 0.685 0.839 0.570 0.188 0.705
Figure 4.12: Entropy vs. Predicted Probability in Confirmation Set
Figure 4.13: Gene Optimized Workflows Per-Class Metrics in Confirmation Set
Figure 4.14: Confusion Matrices for Confirmation Set Models

4.4.2 Validation Set

Table 4.18: Evaluation Metrics on Validation Set Model, SMOTE-RF, Optimal Set
Histotypes
Metric Overall HGSC CCOC ENOC LGSC MUC
Accuracy 0.889 0.909 0.971 0.949 0.973 0.977
Sensitivity 0.783 0.911 0.986 0.727 0.467 0.826
Specificity 0.961 0.903 0.970 0.973 0.982 0.980
F1-Score 0.706 0.940 0.840 0.736 0.368 0.644
Balanced Accuracy 0.872 0.907 0.978 0.850 0.724 0.903
Kappa 0.728 0.754 0.824 0.707 0.355 0.633
Figure 4.21: SMOTE-RF Per-Class Metrics in Validation Set
Figure 4.22: Confusion Matrix for Validation Set Model
Figure 4.23: ROC Curves for SMOTE-RF, Optimal Set Model in Validation Set
Table 4.19: Clinicopath characteristics between correct and incorrect predictions of ENOC cases
Characteristic Predicted ENOC Correctly
N = 64
1
Missed ENOC
N = 24
1
p-value2
Age at diagnosis 53 (46, 62) 59 (53, 64) 0.040
Tumour grade

<0.001
    low grade 49 (94%) 10 (50%)
    high grade 3 (5.8%) 10 (50%)
    Unknown 12 4
FIGO tumour stage

0.030
    I 49 (78%) 13 (54%)
    II-IV 14 (22%) 11 (46%)
    Unknown 1 0
Race

0.7
    white 58 (92%) 18 (90%)
    non-white 5 (7.9%) 2 (10%)
    Unknown 1 4
ARID1A

0.8
    absent/subclonal 11 (17%) 5 (21%)
    present 53 (83%) 19 (79%)
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Fisher’s exact test; Pearson’s Chi-squared test
Figure 4.24: Volcano Plots of Validation Set Predictions
Figure 4.25: Subtype Prediction Summary among Predicted HGSC Samples