从配对的图像和文本中对医学视觉表示的对比度学习

论文标题

从配对的图像和文本中对医学视觉表示的对比度学习

Contrastive Learning of Medical Visual Representations from Paired Images and Text

论文作者

Zhang, Yuhao, Jiang, Hang, Miura, Yasuhide, Manning, Christopher D., Langlotz, Curtis P.

论文摘要

学习医学图像的视觉表示（例如X射线）是医学图像理解的核心，但是人类注释的稀缺性阻碍了它的进步。现有的工作通常依赖于从成像网预处理传输的微调权重，由于图像特征截然不同，这是次优的，或从文本报告数据与医学图像配对的基于规则的标签提取，这是不准确的，难以推广。同时，最近的几项研究表明，从自然图像中学习的对比度学习令人兴奋，但由于它们的高层间相似性，我们发现这些方法对医学图像无济于事。我们提出了Convirt，这是一种替代的无监督策略，通过利用自然发生的成对描述性文本来学习医学视觉表示。我们通过两种方式之间双向对比度目标对医学图像进行预处理编码的新方法是域，无关，不需要其他专家输入。我们通过将预处理的权重转移到4个医学图像分类任务和2个零射击检索任务中来测试交通，并证明它会导致图像表示形式，在大多数设置中，它们都超过了强大的基线。值得注意的是，在所有4个分类任务中，我们的方法仅需要10 \％标记的培训数据与Imagenet初始化的培训数据一样，以实现更好或可比较的性能，这表明数据效率卓越。

Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10\% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题