论文标题

时间和数据价值

Time and the Value of Data

论文作者

Valavi, Ehsan, Hestness, Joel, Ardalani, Newsha, Iansiti, Marco

论文摘要

经理通常认为,收集更多数据将不断提高其机器学习模型的准确性。但是,我们在本文中认为,当数据随着时间的推移而失去相关性时,收集有限量的最新数据而不是围绕旧数据的无限供应,可能是最佳的。此外,我们认为,增加数据集增加数据集的数量实际上可能会损害模型的准确性。预计,该模型的准确性通过增加数据流(定义为数据收集率)来提高;但是,它需要更频繁地在刷新或再训练机器学习模型方面进行其他权衡。 利用这些结果,我们研究了通过数据模型创造的业务价值如何使用数据缩放,以及数据的库存何时建立可持续的竞争优势。我们认为数据的时间依赖性削弱了数据库存所产生的进入障碍。结果,配备有限(但足够多的)最近数据的竞争公司可以开发更准确的模型。该结果以及旧数据集可能会恶化模型的准确性的事实表明,创建的业务价值不会随着可用数据的库存而扩展,除非公司从其数据存储库中卸下较少的相关数据。因此,公司的增长政策应在历史数据的库存与新数据流之间纳入平衡。 我们通过实验补充了理论结果。在实验中,我们从经验上衡量了下一个单词预测模型的准确性损失,该单词预测模型在各个时间段内在数据集中训练。我们的经验测量结果证实了价值随时间下降的经济意义。例如,在七年后,100MB的文本数据与下一个单词预测任务的当前数据一样有价值。

Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including older datasets may, in fact, damage the model's accuracy. Expectedly, the model's accuracy improves by increasing the flow of data (defined as data collection rate); however, it requires other tradeoffs in terms of refreshing or retraining machine learning models more frequently. Using these results, we investigate how the business value created by machine learning models scales with data and when the stock of data establishes a sustainable competitive advantage. We argue that data's time-dependency weakens the barrier to entry that the stock of data creates. As a result, a competing firm equipped with a limited (yet sufficient) amount of recent data can develop more accurate models. This result, coupled with the fact that older datasets may deteriorate models' accuracy, suggests that created business value doesn't scale with the stock of available data unless the firm offloads less relevant data from its data repository. Consequently, a firm's growth policy should incorporate a balance between the stock of historical data and the flow of new data. We complement our theoretical results with an experiment. In the experiment, we empirically measure the loss in the accuracy of a next word prediction model trained on datasets from various time periods. Our empirical measurements confirm the economic significance of the value decline over time. For example, 100MB of text data, after seven years, becomes as valuable as 50MB of current data for the next word prediction task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源