论文标题
机器人使用弱监督的面具数据蒸馏来理解以人为本环境中的上下文信息
Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation
论文作者
论文摘要
人类环境中的上下文信息,例如标志,符号和对象,为机器人提供了重要的信息,以供探索和导航。为了从这些环境中获得的复杂图像中识别和分割上下文信息,使用了数据驱动的方法,例如卷积神经网络(CNN)。但是,这些方法需要大量的人类标记的数据,这些数据很缓慢且耗时。弱监督方法通过生成伪分割标签(PSL)来解决此限制。在本文中,我们介绍了使用未针对上下文分割任务的CNN进行自主生成PSL的新型弱监督的掩模数据蒸馏(Wesupermadd)体系结构;即,经过对象分类,图像字幕等培训的CNN。wesupermadd唯一地使用从稀疏和有限的多样性数据中学习的图像特征生成PSL;在以人为中心的环境(购物中心,杂货店)中的机器人导航任务中常见。我们提出的架构使用了一个新的掩码改进系统,该系统自动搜索具有满足成本限制的前景像素最少的PSL。这消除了对手工启发式规则的需求。广泛的实验成功验证了Wesupermadd在为数据集生成具有各种尺度,字体和透视图的数据集中的PSL的性能。与幼稚,抓取和金字塔方法的比较发现标签和分割质量有显着改善。此外,与经过幼稚PSL训练的训练相比,使用WesupermAdd架构训练的上下文分割CNN可以实现准确性。我们的方法还具有与实际数据集上现有的最新文本检测和细分方法相当的性能,而无需进行分割标签进行培训。
Contextual information in human environments, such as signs, symbols, and objects provide important information for robots to use for exploration and navigation. To identify and segment contextual information from complex images obtained in these environments, data-driven methods such as Convolutional Neural Networks (CNNs) are used. However, these methods require large amounts of human labeled data which are slow and time-consuming to obtain. Weakly supervised methods address this limitation by generating pseudo segmentation labels (PSLs). In this paper, we present the novel Weakly Supervised Mask Data Distillation (WeSuperMaDD) architecture for autonomously generating PSLs using CNNs not specifically trained for the task of context segmentation; i.e., CNNs trained for object classification, image captioning, etc. WeSuperMaDD uniquely generates PSLs using learned image features from sparse and limited diversity data; common in robot navigation tasks in human-centred environments (malls, grocery stores). Our proposed architecture uses a new mask refinement system which automatically searches for the PSL with the fewest foreground pixels that satisfies cost constraints. This removes the need for handcrafted heuristic rules. Extensive experiments successfully validated the performance of WeSuperMaDD in generating PSLs for datasets with text of various scales, fonts, and perspectives in multiple indoor/outdoor environments. A comparison with Naive, GrabCut, and Pyramid methods found a significant improvement in label and segmentation quality. Moreover, a context segmentation CNN trained using the WeSuperMaDD architecture achieved measurable improvements in accuracy compared to one trained with Naive PSLs. Our method also had comparable performance to existing state-of-the-art text detection and segmentation methods on real datasets without requiring segmentation labels for training.