多语言亚马逊评论语料库

论文标题

多语言亚马逊评论语料库

The Multilingual Amazon Reviews Corpus

论文作者

Keung, Phillip, Lu, Yichao, Szarvas, György, Smith, Noah A.

论文摘要

我们介绍了多语言评论语料库（MARC），这是一个大规模的亚马逊评论，用于多语言文本分类。该语料库包含在2015年至2019年间收集的英语，日语，德语，法语，西班牙语和中文的评论。每种语言的评论。对于每种语言，培训，开发和测试集分别有200,000、5,000和5,000个评论。我们通过在评论数据上微调多语言的BERT模型来报告监督文本分类和零摄像的跨语性转移学习的基线结果。我们建议使用平均绝对误差（MAE）而不是用于此任务的分类精度，因为MAE解释了评分的序数特性。

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题