使用问答论坛数据探索软件可重复使用度量

论文标题

使用问答论坛数据探索软件可重复使用度量

Exploring Software Reusability Metrics with Q&A Forum Data

论文作者

Patrick, Matthew T.

论文摘要

问答（问答）论坛包含有关软件重复使用的有价值的信息，但是由于其非结构化的免费文本，它们可能会具有挑战性。在这里，我们使用单词嵌入式和机器学习介绍了一种新方法（Lanlan），以利用Stackoverflow中可用的信息。具体来说，我们考虑了两种不同类型的用户通信，描述了软件重用中遇到的困难：“问题报告”指向潜在的缺陷，而“支持请求”要求澄清软件使用情况。从Stackoverflow对16亿个令牌进行了培训，并应用了哪些问答论坛消息（来自两个大型开源项目：Eclipse和Bioconductor）对应于问题报告或支持请求。 Lanlan在接收器操作员曲线（AUROC）下达到了一个超过0.9的面积；它可用于探索软件可重复性指标与用户遇到的困难之间的关系，并预测用户将来会面临的困难数量。问答论坛数据可以帮助提高对软件重复使用的理解，并可以作为评估软件可重用性指标的附加资源。

Question and answer (Q&A) forums contain valuable information regarding software reuse, but they can be challenging to analyse due to their unstructured free text. Here we introduce a new approach (LANLAN), using word embeddings and machine learning, to harness information available in StackOverflow. Specifically, we consider two different kinds of user communication describing difficulties encountered in software reuse: 'problem reports' point to potential defects, while 'support requests' ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens from StackOverflow and applied to identify which Q&A forum messages (from two large open source projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore the relationship between software reusability metrics and difficulties encountered by users, as well as predict the number of difficulties users will face in the future. Q&A forum data can help improve understanding of software reuse, and may be harnessed as an additional resource to evaluate software reusability metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题