大型数据集使用有限的群集使用有限标签的时空分类

论文标题

大型数据集使用有限的群集使用有限标签的时空分类

Spatiotemporal Classification with limited labels using Constrained Clustering for large datasets

论文作者

Ravirathinam, Praveen, Ghosh, Rahul, Wang, Ke, Xuan, Keyang, Khandelwal, Ankush, Dugan, Hilary, Hanson, Paul, Kumar, Vipin

论文摘要

通过表示和聚类来创建可分离表示，对于只有几个标签的大型非结构化数据集至关重要。可分离的表示可以导致具有更好分类功能的监督模型，并有助于生成新标记的样品。分析大型数据集的大多数无监督和半监管的方法不会利用现有的少量标签来获得更好的表示形式。在本文中，我们提出了一个时空聚类范式，该范例使用空间和时间特征结合在一起，并结合了约束损失来产生可分离的表示。我们在新发布的数据集Realsat上展示了这种方法的工作，这是全球680,000多个湖泊的地表水动力学数据集，使其在生态和可持续性方面是必不可少的数据集。使用此大型未标记数据集，我们首先显示时空表示与仅在空间或时间表示相比如何更好。然后，我们展示如何使用少量标签使用受约束的损失来学习更好的表示。我们结论一下，我们的方法如何使用很少的标签可以从未标记的数据中挑选出新的标记样品，这些样本可用于增强监督的方法，从而导致更好的分类。

Creating separable representations via representation learning and clustering is critical in analyzing large unstructured datasets with only a few labels. Separable representations can lead to supervised models with better classification capabilities and additionally aid in generating new labeled samples. Most unsupervised and semisupervised methods to analyze large datasets do not leverage the existing small amounts of labels to get better representations. In this paper, we propose a spatiotemporal clustering paradigm that uses spatial and temporal features combined with a constrained loss to produce separable representations. We show the working of this method on the newly published dataset ReaLSAT, a dataset of surface water dynamics for over 680,000 lakes across the world, making it an essential dataset in terms of ecology and sustainability. Using this large unlabelled dataset, we first show how a spatiotemporal representation is better compared to just spatial or temporal representation. We then show how we can learn even better representation using a constrained loss with few labels. We conclude by showing how our method, using few labels, can pick out new labeled samples from the unlabeled data, which can be used to augment supervised methods leading to better classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题