Title

题目

CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shotdisease classification from chest X-ray

CXR-LT 2024：一场关于基于胸部X线的长尾、多标签和零样本疾病分类的MICCAI挑战赛

文献速递介绍

CXR-LT系列是一项由社区推动的计划，旨在利用胸部X光（CXR）改进肺部疾病分类，解决开放性长尾肺部疾病分类中的挑战，并提升前沿技术的可衡量性（Holste等人，2022）。在首届活动CXR-LT 2023（Holste等人，2024）中，这些目标通过以下方式实现：提供高质量的胸部X光基准数据用于模型开发，并开展详细评估以识别影响肺部疾病分类性能的持续性问题。CXR-LT 2023引发了广泛关注，共有59支团队提交了超过500份独特的成果。此后，该任务设置和数据为众多研究提供了基础（Hong等人，2024；Huijben等人，2024；Park和Ryu，2024；Li等人，2024a）。作为该系列的第二项活动，CXR-LT 2024延续了前作的总体设计和目标，同时新增了对零样本学习的关注。这一补充旨在解决CXR-LT 2023中发现的局限性。据估计，独特的放射学发现数量超过4500种（Budovec等人，2014），这表明胸部X光临床发现的实际分布规模至少比现有基准数据集所能覆盖的范围大两个数量级。因此，要有效应对放射学异常发现的“长尾”问题，必须开发能够以“零样本”方式对新类别进行泛化的模型。本文概述了CXR-LT 2024挑战赛，包括两项吸引广泛参与的长尾任务和一项新引入的零样本任务。任务1和任务2聚焦于长尾分类，其中任务1使用大规模含噪声测试集，任务2使用小规模人工标注测试集；任务3则针对未见过的疾病进行零样本泛化。每项任务均遵循CXR-LT 2023确立的总体框架，为参与者提供包含超过25万张胸部X光图像和40个二元疾病标签的大规模自动标注训练集。参与者的最终提交结果将通过单独的预留测试集进行评估，该测试集的制备方式与训练集一致。在下文章节中，我们将介绍各项任务设置并概述评估标准，详细说明数据整理流程，随后呈现每项任务的结果。接着，我们将汇总顶尖解决方案的关键见解并提供实践视角。最后，基于研究发现提出少样本和零样本疾病分类的未来发展方向，重点强调多模态基础模型的应用潜力。

Abatract

摘要

The CXR-LT series is a community-driven initiative designed to enhance lung disease classification usingchest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances themeasurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals byproviding high-quality benchmark CXR data for model development and conducting comprehensive evaluationsto identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identifiedin the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large,noisy test set, (ii) long-tailed classification on a manually annotated ‘‘gold standard’’ subset, and (iii) zeroshot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use ofmultimodal models for rare disease detection, advanced generative approaches to handle noisy labels, andzero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverageto better represent real-world clinical settings, offering a valuable resource for future research. By synthesizingthe insights and innovations of participating teams, we aim to advance the development of clinically realisticand generalizable diagnostic models for chest radiography

CXR-LT

系列是一项由社区推动的计划，旨在利用胸部X光（CXR）提升肺部疾病分类水平。它致力于应对开放性长尾肺部疾病分类中的挑战，并增强前沿技术的可衡量性。2023年举办的首届CXR-LT活动，通过提供高质量的胸部X光基准数据用于模型开发，并开展全面评估以找出影响肺部疾病分类性能的现有问题，来实现这些目标。在CXR-LT 2023取得成功的基础上，CXR-LT 2024 将数据集扩展到377,110张胸部X光片和45种疾病标签，其中包括19种新的罕见疾病发现。它还引入了对零样本学习的新关注，以解决上一届活动中发现的局限性。具体而言，CXR-LT 2024设置了三项任务：（i）在大规模、含噪声的测试集上进行长尾分类；（ii）在人工标注的 “金标准” 子集上进行长尾分类；（iii）对五种之前未见过的疾病发现进行零样本泛化。本文对CXR-LT 2024进行了概述，详细介绍了数据整理过程，并汇总了前沿的解决方案，包括使用多模态模型检测罕见疾病、采用先进的生成方法处理含噪声标签，以及针对未见疾病的零样本学习策略。此外，扩展后的数据集增强了疾病覆盖范围，能更好地反映现实临床场景，为未来研究提供了宝贵资源。通过综合各参赛团队的见解和创新成果，我们旨在推动胸部X光临床实用且具有泛化能力的诊断模型的发展。

Method

方法

2.1. Main tasks

The CXR-LT 2024 challenge includes three tasks: (1) long-tailedclassification on a large, noisy test set, (2) long-tailed classification ona small, manually annotated test set, and (3) zero-shot generalizationto previously unseen diseases. All can be formulated as multi-labelclassification problems.Given the severe label imbalance in these tasks, the primary evaluation metric was mean average precision (mAP), specifically the‘‘macro-averaged’’ AP across classes. While the area under the receiver operating characteristic curve (AUROC) is often used for similardatasets (Wang et al., 2017; Seyyed-Kalantari et al., 2020), it canbe heavily inflated in the presence of class imbalance (Fernándezet al., 2018; Davis and Goadrich, 2006). In contrast, mAP is moresuitable for long-tailed, multi-label settings as it measures the performance across decision thresholds without degrading under-classimbalance (Rethmeier and Augenstein, 2022). For thoroughness, meanAUROC (mAUROC) and mean F1 score (mF1) – with a threshold of 0.5– were computed as auxiliary classification metrics. We also calculatedthe mean expected calibration error (ECE) (Naeini et al., 2015) to quantify bias. To further enhance clinical interpretability, we also reportper-class F1 scores, as well as macro- and micro-averaged F1 scores andfalse-negative rates for critical findings, in addition to the challenge’sprimary evaluation metric. We believe these additions provide a moregranular understanding of model performance in practical settings.

2.1 主要任务 CXR-LT 2024挑战赛包含三项任务：（1）基于大规模含噪声测试集的长尾分类；（2）基于小规模人工标注测试集的长尾分类；（3）对未见过疾病的零样本泛化。所有任务均可表述为多标签分类问题。鉴于这些任务中存在严重的标签不平衡问题，主要评估指标为平均精度均值（mAP），具体为跨类别的“宏平均”精度（macro-averaged AP）。尽管接收者操作特征曲线下面积（AUROC）常用于类似数据集（Wang等人，2017；Seyyed-Kalantari等人，2020），但在存在类别不平衡时，该指标可能被严重高估（Fernández等人，2018；Davis和Goadrich，2006）。相比之下，mAP更适用于长尾多标签场景，因为它能在不同决策阈值下衡量性能，且不会因类别不平衡而降级（Rethmeier和Augenstein，2022）。为全面评估，我们还计算了平均AUROC（mAUROC）和平均F1分数（mF1，阈值设为0.5）作为辅助分类指标。此外，我们通过平均预期校准误差（ECE）（Naeini等人，2015）量化偏差。为增强临床可解释性，除挑战赛主要评估指标外，我们还报告了每个类别的F1分数、宏平均和微平均F1分数，以及关键发现的假阴性率。我们认为这些补充指标能更细致地反映模型在实际场景中的性能。

Conclusion

结论

In summary, we organized CXR-LT 2024 to address the challengesof long-tailed, multi-label disease classification and zero-shot learningfrom chest X-rays. For this purpose, we have curated and released alarge, long-tailed, multi-label CXR dataset containing 377,110 images,each labeled with one or more findings from a set of 45 diseasecategories. Additionally, we have provided a publicly available ‘‘goldstandard’’ subset with human-annotated consensus labels to facilitatefurther evaluation. Finally, we outline a pathway to enhance the reliability, generalizability, and practicality of methods, with the ultimategoal of making them applicable in real-world clinical settings.

总之，我们举办CXR-LT 2024是为了应对基于胸部X光的长尾、多标签疾病分类及零样本学习挑战。为此，我们整理并发布了一个大规模、长尾分布的多标签胸部X光数据集，包含377,110张图像，每张图像均标注有45种疾病类别中的一种或多种发现。此外，我们还提供了一个公开可用的“金标准”子集，该子集带有人工标注的共识标签，以方便进一步评估。最后，我们概述了提升方法可靠性、泛化性和实用性的路径，最终目标是使其能够应用于真实临床场景。

Results

结果

3.1. Participation

The CXR-LT challenge received 96 team applications on CodaLab,of which 61 were approved after providing proof of credentialed accessto MIMIC-CXR-JPG (Johnson et al., 2019b). During the DevelopmentPhase, 29 teams participated, submitting a total of 661, 349, and 364unique submissions to the public leaderboard for Tasks 1, 2, and 3,respectively. In the final Test Phase, a total of 17 teams participated.We selected the top 9 teams for the invitation to present at the CXR-LT2024 challenge event at MICCAI 20246 and for inclusion in this study.Since two teams excelled in both Tasks 1 and 2, this comprised the top 4solutions in Tasks 1 and 2 as well as the top 3 solutions for the zero-shotTask 3. Table 3 summarizes the top-performing groups participating inone or more of these tasks and system descriptions. Additional details,including all presentation slides, are available on GitHub,7 allowingreaders to explore the specifics of all methods in greater depth.

3.1 参与情况 CXR-LT挑战赛在CodaLab平台收到了96支团队的申请，其中61支团队在提交MIMIC-CXR-JPG数据集的授权访问证明后（Johnson等人，2019b）获得批准。在开发阶段，共有29支团队参与，分别向任务1、任务2和任务3的公开排行榜提交了661份、349份和364份独特的结果。在最终测试阶段，共有17支团队参与。我们选取了排名前9的团队，邀请其在MICCAI 2024的CXR-LT 2024挑战赛活动中进行成果展示，并纳入本研究。由于有两支团队在任务1和任务2中均表现优异，最终包含任务1和任务2的前4名解决方案，以及零样本任务3的前3名解决方案。表3汇总了参与一项或多项任务的顶尖团队及其系统描述。更多细节（包括所有演示幻灯片）可在GitHub上获取，供读者深入了解所有方法的具体细节。

Figure

图

Fig. 1. Long-tailed distribution of the CXR-LT 2024 challenge dataset. The dataset was formed by extending the MIMIC-CXR benchmark to include 12 new clinical findings (red)by parsing radiology reports.

图1 CXR-LT 2024挑战数据集的长尾分布。该数据集通过扩展MIMIC-CXR基准数据集形成，新增了12种通过解析放射学报告获得的临床发现（红色标注）。

Fig. 2. Representative chest X-rays from the challenge dataset, each demonstrating multiple findings. (a) Includes the Hilum label (new in CXR-LT 2024); (b) shows Fracture(introduced in CXR-LT 2023); and (c) displays original MIMIC-CXR labels (Cardiomegaly, Edema, Lung Opacity).

图2 挑战数据集中具有代表性的胸部X光图像，每张图像均显示多种病变。（a）包含肺门（Hilum）标签（CXR-LT 2024新增）；（b）显示骨折（Fracture，CXR-LT 2023引入）；（c）展示原始MIMIC-CXR标签（心脏扩大、水肿、肺部实变）。

Fig. 3. Comparison of performance on CXR-LT Task 1 data (Section 2.2.1) and goldstandard Task 2 data (Section 2.2.2).

图3 CXR-LT任务1数据（第2.2.1节）与金标准任务2数据（第2.2.1节）的性能对比。

Table

表

Table 1Characteristics of the datasets used in the three tasks

表1 三项任务所用数据集的特征

Table 2Schedule of CXR-LT 2024

表2 CXR-LT 2024的时间安排

Table 3Overview of top-performing CXR-LT 2024 challenge solutions. ENS - ensemble; LRW - loss reweighting; VL - vision-language

表3 CXR-LT 2024挑战赛顶尖解决方案概述

Table 4mAP of top-4 team’s final model on all 40 classes evaluated on the test set in Task 1. mAUROC, mF1,and mECE are also presented, with the numbers in parentheses indicating the rankings based on thecorresponding evaluation metric

表4 前4名团队的最终模型在任务1测试集的全部40个类别上的平均精度均值（mAP）。表中还展示了平均接收者操作特征曲线下面积（mAUC）、平均F1分数（mF1）和平均预期校准误差（mECE），括号中的数字表示基于相应评估指标的排名。

Table 5Long-tailed classification performance on ‘‘head’’, ‘‘medium’’, and ‘‘tail’’ classes byaverage mAP within each category. These categories were determined by the relativefrequency of each class in the training set (denoted in parentheses). The rightmostcolumn denotes the average of head, medium, and tail mAP. The best mAP in eachcolumn appears in bold.

表5 基于“头部”“中等”和“尾部”类别的长尾分类性能（按每个类别内的平均mAP计算）。这些类别根据各类别在训练集中的相对频率确定（括号内为频率标注）。最右侧一列表示头部、中等和尾部类别mAP的平均值。每列中最佳mAP以粗体显示。

Table 6mAP of top-4 team’s final model on all 26 classes evaluated on the Gold standard test set in Task 2.mAUROC, mF1, and mECE are also presented, with the numbers in parentheses indicating the rankingsbased on the corresponding evaluation metric.

表6 前4名团队的最终模型在任务2金标准测试集的全部26个类别上的平均精度均值（mAP）。表中还展示了平均接收者操作特征曲线下面积（mAUC）、平均F1分数（mF1）和平均预期校准误差（mECE），括号中的数字表示基于相应评估指标的排名。

Table 7Performance evaluation of the final models from the top 3 teams on the test set for all five unseen classesin Task 3. mAUROC, mF1, and mECE are also presented, with the numbers in parentheses indicating therankings based on the corresponding evaluation metric.

表7 前3名团队的最终模型在任务3测试集的全部5个未见过类别上的性能评估。表中还展示了平均接收者操作特征曲线下面积（mAUC）、平均F1分数（mF1）和平均预期校准误差（mECE），括号中的数字表示基于相应评估指标的排名。