论文标题
使用问答论坛数据探索软件可重复使用度量
Exploring Software Reusability Metrics with Q&A Forum Data
论文作者
论文摘要
问答(问答)论坛包含有关软件重复使用的有价值的信息,但是由于其非结构化的免费文本,它们可能会具有挑战性。在这里,我们使用单词嵌入式和机器学习介绍了一种新方法(Lanlan),以利用Stackoverflow中可用的信息。具体来说,我们考虑了两种不同类型的用户通信,描述了软件重用中遇到的困难:“问题报告”指向潜在的缺陷,而“支持请求”要求澄清软件使用情况。从Stackoverflow对16亿个令牌进行了培训,并应用了哪些问答论坛消息(来自两个大型开源项目:Eclipse和Bioconductor)对应于问题报告或支持请求。 Lanlan在接收器操作员曲线(AUROC)下达到了一个超过0.9的面积;它可用于探索软件可重复性指标与用户遇到的困难之间的关系,并预测用户将来会面临的困难数量。问答论坛数据可以帮助提高对软件重复使用的理解,并可以作为评估软件可重用性指标的附加资源。
Question and answer (Q&A) forums contain valuable information regarding software reuse, but they can be challenging to analyse due to their unstructured free text. Here we introduce a new approach (LANLAN), using word embeddings and machine learning, to harness information available in StackOverflow. Specifically, we consider two different kinds of user communication describing difficulties encountered in software reuse: 'problem reports' point to potential defects, while 'support requests' ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens from StackOverflow and applied to identify which Q&A forum messages (from two large open source projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore the relationship between software reusability metrics and difficulties encountered by users, as well as predict the number of difficulties users will face in the future. Q&A forum data can help improve understanding of software reuse, and may be harnessed as an additional resource to evaluate software reusability metrics.