Abstract 摘要 This paper relates the post-analysis of the first edition of the HEad and neCK TumOR (HECKTOR) challenge. This challenge was held as a satellite event of the 23rd International Conference on Medical ImageComputing and Computer-Assisted Intervention (MICCAI) 2020, and was the first of its kind focusing onlesion segmentation in combined FDG-PET and CT image modalities. The challenge’s task is the automatic segmentation of the Gross Tumor Volume (GTV) of Head and Neck (H&N) oropharyngeal primarytumors in FDG-PET/CT images. To this end, the participants were given a training set of 201 cases fromfour different centers and their methods were tested on a held-out set of 53 cases from a fifth center.The methods were ranked according to the Dice Score Coefficient (DSC) averaged across all test cases. Anadditional inter-observer agreement study was organized to assess the difficulty of the task from a human perspective. 64 teams registered to the challenge, among which 10 provided a paper detailing theirapproach. The best method obtained an average DSC of 0.7591, showing a large improvement over ourproposed baseline method and the inter-observer agreement, associated with DSCs of 0.6610 and 0.61, respectively. The automatic methods proved to successfully leverage the wealth of metabolic and structuralproperties of combined PET and CT modalities, significantly outperforming human inter-observer agree
这篇论文讲述了第一届头颈肿瘤(HECKTOR)挑衅赛的后期分析。该挑衅赛作为第23届国际医学影像计算和计算机辅助干预集会(MICCAI 2020)的一个卫星运动举行,是初次专注于联合FDG-PET和CT影像模式的病变分割。挑衅的使命是自动分割头颈(H&N)口咽部原发肿瘤的总肿瘤体积(GTV)在FDG-PET/CT图像中。为此,参与者获得了来自四个不同中心的201例训练集,他们的方法在第五个中心保留的53例病例上举行了测试。
方法根据所有测试病例的Dice得分系数(DSC)匀称值举行排名。此外,还组织了一项观察者间划一性研究,以评估使命的难度从人类的角度来看。共有64支队伍注册参加挑衅赛,其中10支提供了详细形貌他们方法的论文。最佳方法获得了匀称DSC为0.7591,较我们提出的基线方法和观察者间划一性的DSC分别为0.6610和0.61,显示出明显的改进。自动化方法乐成地利用了联合PET和CT模式的丰富代谢和布局特性,明显凌驾了人类观察者间的划一性。
Conclusion 结论 This paper presents the HECKTOR 2020 challenge on the segmentation of the primary tumor of oropharyngeal H&N cancer inFDG PET/CT. Detailed information was reported on the dataset, participation, and segmentation performance. Good participation with18 teams and 10 participants’ publications allowed us to comparestate-of-the-art segmentation methods on this challenging task.The results are very satisfactory with the winning team achievingan average DSC of 0.7591, which is superior to the inter-observeragreement (average DSC 0.6110). These results were obtained witha strict testing scheme as the test cases were all from an unseencenter. It is reasonable to expect better results if the proposedmethods are fine-tuned on few examples from this center. All participants used U-Net based deep learning models, most of themwith a 3D architecture and standard pre-processing techniques.
本文先容了HECKTOR 2020挑衅赛,这是关于FDG PET/CT中口咽部头颈癌原发肿瘤的分割。文章详细报告了数据集、参与环境和分割性能。良好的参与度,有18支团队和10位参与者的出书物,使我们可以或许在这一具有挑衅性的使命上比力最先辈的分割方法。
效果非常令人满足,得胜团队的匀称DSC为0.7591,优于观察者间划一性(匀称DSC为0.6110)。这些效果是在严格的测试方案下获得的,因为测试案例全部来自一个未见过的中心。如果在这个中心的少数样本上对所提方法举行微调,可以公道预期会有更好的效果。所有参与者都使用了基于U-Net的深度学习模型,其中大多数采用了3D架构和标准的预处理技术。
Results 效果 This section regroups results in terms of challenge participation, algorithms used, segmentation performance, inter-observer agree ment, ensembling “super-algorithm”, simple PET thresholding, the relation between tumor size and segmentation performance, false positive analysis, and alternative ranking of the methods.
这部分汇总了挑衅赛的参与环境、使用的算法、分割性能、观察者间划一性、集成“超等算法”、简单的PET阈值设定、肿瘤巨细与分割性能之间的关系、假阳性分析以及方法的替代排名。
Figure 图
Fig. 1. Case examples of 2D sagittal slices of fused PET/CT images from each of the five centers. These images are obtained after resampling the PET image and the CT image to 1x1x1 mm3 with a tricubic interpolation. The CT window in Hounsfield unit is [−140, 260] and the PET window in SUV is .
图 1. 来自五个中心的融合PET/CT图像的2D矢状切片案例示例。这些图像在将PET图像和CT图像重新采样到1x1x1 mm³,并使用三次立方插值后获得。CT窗口的赫氏单位为[−140, 260],PET窗口的SUV为。
Fig. 2. Examples of results of the winning algorithm (andrei.iantsen (Iantsen et al., 2021b)). The automatic segmentation results (green) and ground truth annotations(red) are displayed on 2D slices of PET (right) and CT (left) images. The reported DSC is computed on the entire image (see Eq. 1). (a), (b) Excellent segmentation results,detecting the GTVt of the primary oropharyngeal tumor localized at the bse of the tongue and discarding the laterocervical lymph nodes despite high FDG uptake onPET. (c) Incorrect segmentation of the top volume at the level of the soft palate; (d) Incorrect segmentation of the smaller volume below the level of the hyoid bone.
图 2. 得胜算法的效果示例(andrei.iantsen (Iantsen等,2021b))。自动分割效果(绿色)和地面真实标注(赤色)显示在PET(右侧)和CT(左侧)图像的2D切片上。报告的DSC是在整个图像上计算的(见公式1)。(a) 和 (b) 优秀的分割效果,检测到位于舌根的口咽部原发肿瘤的GTVt,并且只管PET上FDG摄取量高,也清除了侧颈淋巴结。(c) 在软腭层面的顶部体积分割不精确;(d) 在舌骨下层面的较小体积分割不精确。
Fig. 3. Box plots of the distribution of the 53 test DSCs for each participant, ordered by decreasing rank.
图 3. 按降序分列的每位参与者的53个测试DSC分布的箱形图。
Fig. 4. Box plots of the distribution of DSCs across the 10 participants for each of the 53 patients in the test set.
图 4. 测试集中每位患者的10名参与者DSC分布的箱形图。
Fig. 5. Segmentation performance of PET thresholding-based method at different percentages of maximum SUV. Three results are reported: the automatic PET threshold, thesemi-automatic PET threshold (indicating the location of the ground truth GTVt), and the semi-automatic PET and CT (for removing the air) threshold.
图 5. 基于不同最大SUV百分比的PET阈值法的分割性能。报告了三个效果:自动PET阈值、半自动PET阈值(指示地面真实GTVt的位置),以及半自动PET和CT(用于去除氛围)阈值。
Fig. 6. Scatter plot of DSC vs. tumor volume (voxel count in the VOI) for 10 participants. The corresponding Spearman correlation is 0.43.
图 6。10名参与者中DSC与肿瘤体积(感兴趣地区内的体素计数)的散点图。相应的Spearman相关系数为0.43。
Fig. 7. Average DSC of each team’s algorithm in function of the volume of the tumors. This figure was generated by distributing the 53 test volumes in 4 bins of n =13, 13,13, and 14 each and then computing the average DSC for each bin.
图 7. 每个团队算法的匀称DSC与肿瘤体积的关系。此图通过将53个测试体积分布在每个包含13, 13, 13和14个体积的4个区间中,然后计算每个区间的匀称DSC天生。
Fig. 8. Histogram of the Euclidean distance of the FP voxels to the closest ground truth GTVt voxel and GTVn voxel. We evaluate here the prediction of the first rankedparticipant (andrei.iantsen) (a) and our baseline 3D PET/CT (b). For comparison, the False Discovery Rate (FDR), i.e. FP/(FP+TP) is 0.15, with 544,343 TPs in (a) and FDR= 0.37 with 621,413 TPs in (b).
图 8. 假阳性体素到最近的地面真实GTVt体素和GTVn体素的欧几里得间隔的直方图。这里我们评估第一名参与者(andrei.iantsen)的预测效果(a)和我们的基线3D PET/CT(b)。作为比力,假发现率(FDR),即 FP/(FP+TP) 在 (a) 中为0.15,有544,343个TPs,在 (b) 中为0.37,有621,413个TPs。
Fig. 9. Ranking robustness against changes in test data. The robustness is assessed by ranking 1000 bootstraps of the test set. The size of the circles is proportional tothe number of times a team obtained the corresponding rank for each bootstrap. The dashed lines represent the confidence intervals at 95% computed from the bootstrapanalysis. The current ranking, i.e. the one used in this challenge, is obtained by averaging the DSCs across all test cases. The alternative ranking is computed by averagingthe rankings of each team across the test cases.
图 9. 排名对测试数据变革的稳健性。通过对测试集的1000个自助样本举行排名来评估稳健性。圆圈的巨细与每个自助样本中团队获得相应排名的次数成正比。虚线代表由自助分析计算出的95%置信区间。当前排名,即本挑衅赛中使用的排名,是通过匀称所有测试案例的DSCs获得的。替代排名是通过匀称每个团队在测试案例中的排名计算得出的。
Table 表
Table 1List of scanners used in the different centers.
表 1 不同中心使用的扫描仪列表。
Table 2Summary of the algorithms in terms of main components used: 2D or 3D U-Net, resampling, preprocessing, training or testing data augmentation, loss used for optimization,an ensemble of multiple models for test prediction and postprocessing of the results. We use the following abbreviations for the preprocessing: Clipping (C), Standardization(S), and if it is applied only to one modality, it is specified in parentheses. For the image resampling, we specify whether the algorithms use Isotropic (I) or Anisotropic (A)resampling and Nearest Neighbor (NN), Linear (L), or Cubic (Cu) interpolation. We use the following abbreviation for the losses: Cross-Entropy (CE), Mumford-Shah (MS),and Mean Absolute Error (MAE). More details can be found in the respective participants’ publications.
表 2 关于主要组件使用的算法总结:2D或3D U-Net、重采样、预处理、训练或测试数据加强、用于优化的丧失函数、测试预测的多模型集成以及效果的后处理。我们使用以下缩写表现预处理:剪切(C)、标准化(S),如果仅应用于一种模式,则在括号中指定。对于图像重采样,我们指定算法使用的是等距(I)照旧非等距(A)重采样以及最近邻(NN)、线性(L)或立方(Cu)插值。我们使用以下缩写表现丧失函数:交叉熵(CE)、Mumford-Shah(MS)和匀称绝对误差(MAE)。更多详情可以在各参与者的出书物中找到。
Table 3Summary of the challenge results as of April 2021. The average DSC, precision, recall, SDSC and median HD95 are reported for the baseline algorithms and every team (thebest result of each team). The unit of the HD95 is [mm]. The participant names are reported when no team name was provided. The ranking is only provided for teamsthat presented their method in a paper submission. The post-challenge results are denoted by an asterisk ∗. Bold values represent the best scores for each metric, excludingpost-challenge results since we do not have any information about their method.
表 3截至2021年4月的挑衅赛效果总结。报告了基线算法和每个团队(每个团队的最佳效果)的匀称DSC、精度、召回率、SDSC和中位HD95。HD95的单位是[毫米]。如果没有提供团队名称,则报告参与者名称。只为提交了方法论文的团队提供排名。挑衅赛后的效果由星号∗表现。粗体值代表除挑衅赛后效果外每个指标的最佳分数,因为我们没有关于它们方法的任何信息。