通过对比度学习优化命名实体识别的双重编码器

论文标题

通过对比度学习优化命名实体识别的双重编码器

Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning

论文作者

Zhang, Sheng, Cheng, Hao, Gao, Jianfeng, Poon, Hoifung

论文摘要

我们为指定实体识别（NER）提出了一个双重编码器框架，该框架应用对比学习将候选文本跨度和实体类型映射到同一矢量表示空间中。先前的工作主要将NER作为序列标记或跨度分类。相反，我们将NER视为一个表示的学习问题，它最大程度地提高了实体提及的向量表示及其类型之间的相似性。这使得易于处理嵌套和平坦的ner，并且可以更好地利用嘈杂的自我诉讼信号。 NER对本双重编码器制定的主要挑战在于将非实体跨度与实体提及分开。与大多数先前方法中一样，我们没有明确标记所有非实体跨度为同一类$ \ texttt {forthtt {forth} $（$ \ texttt {o} $），而是引入了新颖的动态阈值损失。实验表明，我们的方法在嵌套且平坦的ner中均具有良好的监督和远距离监督设置，并在通用领域（例如ACE2004，ACE2005）和高价值垂直方面建立了新的艺术状态，例如生物医疗（例如，gemia，genia，ncbi，ncbi，bc5cdr，bc5cdr，bc5cdr）。我们在github.com/microsoft/binder上发布代码。

We present a bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation space. Prior work predominantly approaches NER as sequence labeling or span classification. We instead frame NER as a representation learning problem that maximizes the similarity between the vector representations of an entity mention and its type. This makes it easy to handle nested and flat NER alike, and can better leverage noisy self-supervision signals. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions. Instead of explicitly labeling all non-entity spans as the same class $\texttt{Outside}$ ($\texttt{O}$) as in most prior methods, we introduce a novel dynamic thresholding loss. Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA). We release the code at github.com/microsoft/binder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题