论文标题
框架到框架一致的语义细分
Frame-To-Frame Consistent Semantic Segmentation
论文作者
论文摘要
在这项工作中,我们的目标是在视频中的整个框架中暂时一致的语义细分。许多语义分割算法分别处理图像,这会导致不一致的场景解释,导致照明变化,遮挡和其他变化。为了实现时间一致的预测,我们训练卷积神经网络(CNN),该卷积神经网络(CNN)使用卷积长的短期内存(Conslstm)单元格在视频中通过连续帧传播特征。除了时间特征传播外,我们还要惩罚损失功能的矛盾。我们在实验中表明,与单帧预测相比,使用视频信息时的性能会提高。在实施ConvlSTM以在ESPNET上传播槽的功能之后,在CityScapes验证集上的平均联合(MIOU)度量集的平均值从单帧的45.2%增加到57.9%。最重要的是,不一致从4.5%降低到1.3%,减少了71.1%。我们的结果表明,与单帧处理相比,附加的时间信息会产生一个一致且更准确的图像理解的框架。代码和视频可从https://github.com/mrebol/f2f-consistent-semantic-分段获得
In this work, we aim for temporally consistent semantic segmentation throughout frames in a video. Many semantic segmentation algorithms process images individually which leads to an inconsistent scene interpretation due to illumination changes, occlusions and other variations over time. To achieve a temporally consistent prediction, we train a convolutional neural network (CNN) which propagates features through consecutive frames in a video using a convolutional long short term memory (ConvLSTM) cell. Besides the temporal feature propagation, we penalize inconsistencies in our loss function. We show in our experiments that the performance improves when utilizing video information compared to single frame prediction. The mean intersection over union (mIoU) metric on the Cityscapes validation set increases from 45.2 % for the single frames to 57.9 % for video data after implementing the ConvLSTM to propagate features trough time on the ESPNet. Most importantly, inconsistency decreases from 4.5 % to 1.3 % which is a reduction by 71.1 %. Our results indicate that the added temporal information produces a frame-to-frame consistent and more accurate image understanding compared to single frame processing. Code and videos are available at https://github.com/mrebol/f2f-consistent-semantic-segmentation