论文标题

SDA:简单的离散扩展用于对比句子表示学习

SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning

论文作者

Zhu, Dongsheng, Mao, Zhenyu, Lu, Jinghui, Zhao, Rui, Tan, Fei

论文摘要

对比学习最近在无监督的句子表示中实现了令人信服的表现。作为基本要素,数据增强协议尚未得到很好的探索。诉诸简单的辍学机制(被视为连续增强)的开创性工作令人惊讶地主导了离散的增强,例如裁剪,单词删除和同义词更换,如报道。为了了解基本的理由,我们重新审视了现有的方法,并试图假设合理数据增强方法的逃避者:语义一致性和表达多样性的平衡。然后,我们制定了三个简单而有效的离散句子增强方案:标点符号插入,模态动词和双重否定。它们在词汇一级起最小的声音,以产生各种形式的句子。此外,标准否定是为了产生负面样本,以减轻对比度学习所涉及的特征抑制。我们对各种数据集进行了语义文本相似性进行了广泛的实验。结果始终如一地支持所提出方法的优越性。我们的关键代码可从https://github.com/zhudongsheng75/sda获得

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源