论文标题
用于监视和改善ML模型的模型断言
Model Assertions for Monitoring and Improving ML Models
论文作者
论文摘要
ML模型越来越多地部署在具有现实世界相互作用(例如车辆)的设置中,但不幸的是,这些模型可能以系统的方式失败。为了防止错误,ML工程团队监视并不断改进这些模型。我们提出了一种新的抽象,即模型断言,该抽象适应了程序主张的经典使用,以监视和改善ML模型。模型断言是模型输入和输出的任意函数,指示何时发生错误,例如,如果对象在视频中迅速更改其类,则可以触发该函数。我们建议在ML系统部署的所有阶段使用模型主张的方法,包括运行时监视,验证标签以及不断改进的ML模型。对于运行时监视,我们表明模型断言会发现高置信误差,其中模型以高置信度返回错误的输出,而基于不确定性的监视技术无法检测到。对于培训,我们提出了两种使用模型主张的方法。首先,我们提出了一种基于匪徒的主动学习算法,该算法可以从断言标记的数据中采样,并表明它可以比基于传统的不确定性方法将标签成本降低多达40%。其次,我们提出了一个用于生成“一致性断言”的API(例如,类更改示例)和弱标签,用于一致性断言失败的输入,并表明这些弱标记可以提高相对模型质量高达46%。我们通过视频,激光雷达和ECG数据评估了四个现实世界任务的模型主张。
ML models are increasingly deployed in settings with real world interactions such as vehicles, but unfortunately, these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously improve these models. We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models. Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring, e.g., a function that triggers if an object rapidly changes its class in a video. We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models. For runtime monitoring, we show that model assertions can find high confidence errors, where a model returns the wrong output with high confidence, which uncertainty-based monitoring techniques would not detect. For training, we propose two methods of using model assertions. First, we propose a bandit-based active learning algorithm that can sample from data flagged by assertions and show that it can reduce labeling costs by up to 40% over traditional uncertainty-based methods. Second, we propose an API for generating "consistency assertions" (e.g., the class change example) and weak labels for inputs where the consistency assertions fail, and show that these weak labels can improve relative model quality by up to 46%. We evaluate model assertions on four real-world tasks with video, LIDAR, and ECG data.