零件：预先训练的作者形式变压器

论文标题

零件：预先训练的作者形式变压器

PART: Pre-trained Authorship Representation Transformer

论文作者

Huertas-Tato, Javier, Martin, Alejandro, Camacho, David

论文摘要

作者撰写文档刻有识别其文本中的信息：词汇，注册表，标点符号，拼写错误甚至表情符号用法。以前的作品使用手工制作的功能或分类任务来训练其作者身份模型，从而导致室外作者的性能不佳。使用样式测定表示更合适，但这本身就是一个开放的研究挑战。在本文中，我们提出了一部分，一种符合训练的模型，可用于学习\ textbf {作者嵌入}而不是语义。我们将模型培训约150万文本，属于1162个文献作者，17287个博客海报和135个公司电子邮件帐户；具有可识别写作风格的异质套装。我们评估了当前挑战的模型，从而实现了竞争性能。我们还在数据集的测试拆分上评估了我们的模型，该数据集将零射击72.39 \％精度限制在250位作者时，54 \％\％和56 \％\％的精度比Roberta Embeddings高。我们在可用数据集上具有不同数据可视化的代表性评估，观察作者的性别，年龄或职业等特征。

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.

下载PDF全文

下载文献需遵守相关版权规定

论文标题