论文标题
零件:预先训练的作者形式变压器
PART: Pre-trained Authorship Representation Transformer
论文作者
论文摘要
作者撰写文档刻有识别其文本中的信息:词汇,注册表,标点符号,拼写错误甚至表情符号用法。以前的作品使用手工制作的功能或分类任务来训练其作者身份模型,从而导致室外作者的性能不佳。使用样式测定表示更合适,但这本身就是一个开放的研究挑战。在本文中,我们提出了一部分,一种符合训练的模型,可用于学习\ textbf {作者嵌入}而不是语义。我们将模型培训约150万文本,属于1162个文献作者,17287个博客海报和135个公司电子邮件帐户;具有可识别写作风格的异质套装。我们评估了当前挑战的模型,从而实现了竞争性能。我们还在数据集的测试拆分上评估了我们的模型,该数据集将零射击72.39 \%精度限制在250位作者时,54 \%\%和56 \%\%的精度比Roberta Embeddings高。我们在可用数据集上具有不同数据可视化的代表性评估,观察作者的性别,年龄或职业等特征。
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.