论文标题

将深度学习和弦乐内核结合起来,用于瑞士德国推文的本地化

Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets

论文作者

Gaman, Mihaela, Ionescu, Radu Tudor

论文摘要

在这项工作中,我们介绍了Unibuckernel团队在解决2020年Vardial评估活动中介绍的社交媒体品种地理位置任务方面提出的方法。我们仅处理第二个子任务,该子任务针对由近3万瑞士德国jodels组成的数据集。方言识别任务是关于准确预测测试样本的纬度和经度。我们将任务视为双重回归问题,采用各种机器学习方法来预测纬度和经度。从简单的回归模型(例如支持向量回归)到深层神经网络,例如长期短期存储网络和角色级别卷积神经网络,最后,到基于元学习者(例如XGBoost)的集合模型,例如XGBOOST,我们的兴趣集中在一些不同的角度来解决问题,以尝试最小化预测错误。考虑到相同的目标,我们还考虑了许多类型的功能,从高级功能(例如Bert Embeddings)到低级功能,例如字符N-Grams,这些功能已知可以在方言识别中提供良好的结果。我们的经验结果表明,基于字符串内核的手工制作的模型优于深度学习方法。然而,结合手工制作和深度学习模型的合奏模型给出了我们的最佳性能。

In this work, we introduce the methods proposed by the UnibucKernel team in solving the Social Media Variety Geolocation task featured in the 2020 VarDial Evaluation Campaign. We address only the second subtask, which targets a data set composed of nearly 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing a variety of machine learning approaches to predict both latitude and longitude. From simple models for regression, such as Support Vector Regression, to deep neural networks, such as Long Short-Term Memory networks and character-level convolutional neural networks, and, finally, to ensemble models based on meta-learners, such as XGBoost, our interest is focused on approaching the problem from a few different perspectives, in an attempt to minimize the prediction error. With the same goal in mind, we also considered many types of features, from high-level features, such as BERT embeddings, to low-level features, such as characters n-grams, which are known to provide good results in dialect identification. Our empirical results indicate that the handcrafted model based on string kernels outperforms the deep learning approaches. Nevertheless, our best performance is given by the ensemble model that combines both handcrafted and deep learning models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源