论文标题
系统应该做什么?:用于估计系统操作的操作措施字幕
What Should the System Do Next?: Operative Action Captioning for Estimating System Actions
论文作者
论文摘要
机器人等人的辅助系统需要根据观察结果正确理解周围情况,并为人类输出所需的支持行动。语言是与人类交流的重要渠道之一,机器人必须具有表达其理解和行动计划结果的能力。在这项研究中,我们提出了一项新的操作行动字幕的任务,该任务估计并言用系统在人类辅助领域中采取的行动。我们构建了一个系统,该系统输出了对可能的操作动作的口头描述,该操作将当前状态更改为给定的目标状态。我们收集了一个由两个图像作为观测值组成的数据集,这些图像表达了当前状态和状态随动作的变化,并通过在日常生活中众包来描述将当前状态变为目标状态的动作。然后,我们构建了一个系统,该系统通过标题估算操作措施。由于预计操作动作的标题将包含一些改变状态的动作,因此我们将场景图预测用作辅助任务,因为场景中写的事件图形与状态更改相对应。实验结果表明,我们的系统成功地描述了当前状态和目标状态之间应进行的操作作用。预测场景图的辅助任务改善了估计结果的质量。
Such human-assisting systems as robots need to correctly understand the surrounding situation based on observations and output the required support actions for humans. Language is one of the important channels to communicate with humans, and the robots are required to have the ability to express their understanding and action planning results. In this study, we propose a new task of operative action captioning that estimates and verbalizes the actions to be taken by the system in a human-assisting domain. We constructed a system that outputs a verbal description of a possible operative action that changes the current state to the given target state. We collected a dataset consisting of two images as observations, which express the current state and the state changed by actions, and a caption that describes the actions that change the current state to the target state, by crowdsourcing in daily life situations. Then we constructed a system that estimates operative action by a caption. Since the operative action's caption is expected to contain some state-changing actions, we use scene-graph prediction as an auxiliary task because the events written in the scene graphs correspond to the state changes. Experimental results showed that our system successfully described the operative actions that should be conducted between the current and target states. The auxiliary tasks that predict the scene graphs improved the quality of the estimation results.