关于标签临床报告的回报率降低

论文标题

关于标签临床报告的回报率降低

On the diminishing return of labeling clinical reports

论文作者

Lamare, Jean-Baptiste, Olatunji, Tobi, Yao, Li

论文摘要

充分的证据表明，通过对越来越大的关于自然语言处理（NLP）问题的培训，可以稳步获得更好的机器学习模型。迄今为止，尚未对医疗NLP进行相同的状态。这项工作表明，情况确实并非总是如此。我们揭示了以某种方式违反直觉的观察，即可以用少量的标记数据获得性能的医疗NLP模型，这与普通信念恰恰相反，这很可能是由于问题的域特异性所致。我们定量地显示了训练数据大小对固定测试集的影响，该测试集由两个最大的公共胸部X射线放射学报告数据集组成，对异常分类的任务。受过训练的模型不仅可以有效地利用培训数据，而且要大量优于最新规则的系统。

Ample evidence suggests that better machine learning models may be steadily obtained by training on increasingly larger datasets on natural language processing (NLP) problems from non-medical domains. Whether the same holds true for medical NLP has by far not been thoroughly investigated. This work shows that this is indeed not always the case. We reveal the somehow counter-intuitive observation that performant medical NLP models may be obtained with small amount of labeled data, quite the opposite to the common belief, most likely due to the domain specificity of the problem. We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets on the task of abnormality classification. The trained models not only make use of the training data efficiently, but also outperform the current state-of-the-art rule-based systems by a significant margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题