论文标题

通过芭蕾舞框架启用协作数据科学开发

Enabling Collaborative Data Science Development with the Ballet Framework

论文作者

Smith, Micah J., Cito, Jürgen, Lu, Kelvin, Veeramachaneni, Kalyan

论文摘要

虽然开源软件开发模型导致了在构建软件系统方面的成功大规模合作,但数据科学项目经常由个人或小型团队开发。我们描述了扩展数据科学合作的挑战,并提出了一个概念框架和ML编程模型来解决它们。我们在芭蕾舞团中实例化这些想法,这是一个针对功能工程的合作,开源数据科学的轻量级框架,以及基于云的开发环境。使用我们的框架,协作者会逐步提出特征定义,向存储库提出特征定义,每个存储库都经过ML性能评估,并可以自动合并为可执行的功能工程管道。我们利用芭蕾舞与27个合作者对收入预测问题进行案例研究分析,并讨论对协作项目的未来设计师的影响。

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源