论文标题
恢复患者旅程:Twitter上的生物医学实体和关系语料库(熊)
Recovering Patient Journeys: A Corpus of Biomedical Entities and Relations on Twitter (BEAR)
论文作者
论文摘要
医学领域的文本挖掘和信息提取的重点是研究人员产生的科学文本。但是,他们可以直接访问个别患者经历或患者饮食者的互动。社交媒体上提供的信息,例如,患者及其亲戚在科学文本中补充了知识。它反映了患者的旅程及其对发展症状,被诊断和提供治疗,治愈或学会患有医疗状况的过程的主观观点。因此,这种类型的数据的价值是双重的:首先,它可以直接访问人们的观点。其次,它可能涵盖其他地方无法获得的信息,包括自我治疗或自我诊断。指定的实体识别和关系提取是构建非结构化文本中可用信息的方法。但是,现有的医学社交媒体公司集中在一组相对小的实体,关系和特定领域,而不是将患者放在分析的中心。在本文的情况下,我们贡献了一个语料库,并在揭露和建模患者的旅程和经验的动机之后,更详细地进行了注释层。我们标记14个实体类别(包括环境因素,诊断,生化过程,患者的生活质量描述,病原体,医疗状况和治疗)和20个关系类别(例如,阻止,影响,影响,相互作用,原因),其中大多数在社交媒体数据之前尚未考虑过。公开可用的数据集由2,100个推文组成。 6,000个实体和3,000个关系注释。在语料库分析中,我们发现超过80%的文档包含相关实体。超过50%的推文表达关系,我们认为这对于揭示患者关于旅途的叙述至关重要。
Text mining and information extraction for the medical domain has focused on scientific text generated by researchers. However, their direct access to individual patient experiences or patient-doctor interactions can be limited. Information provided on social media, e.g., by patients and their relatives, complements the knowledge in scientific text. It reflects the patient's journey and their subjective perspective on the process of developing symptoms, being diagnosed and offered a treatment, being cured or learning to live with a medical condition. The value of this type of data is therefore twofold: Firstly, it offers direct access to people's perspectives. Secondly, it might cover information that is not available elsewhere, including self-treatment or self-diagnoses. Named entity recognition and relation extraction are methods to structure information that is available in unstructured text. However, existing medical social media corpora focused on a comparably small set of entities and relations and particular domains, rather than putting the patient into the center of analyses. With this paper we contribute a corpus with a rich set of annotation layers following the motivation to uncover and model patients' journeys and experiences in more detail. We label 14 entity classes (incl. environmental factors, diagnostics, biochemical processes, patients' quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation classes (e.g., prevents, influences, interactions, causes) most of which have not been considered before for social media data. The publicly available dataset consists of 2,100 tweets with approx. 6,000 entity and 3,000 relation annotations. In a corpus analysis we find that over 80 % of documents contain relevant entities. Over 50 % of tweets express relations which we consider essential for uncovering patients' narratives about their journeys.