A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation
Baiyu Chen, Li Zhao, Sixian Chan
FAIML 2024
Abstract
This paper introduces a novel architecture for the video object segmentation (VOS) challenge to achieve greater label efficiency. Previous studies have primarily tackled this problem through either match-based or propagate-based architectures, relying on fully annotated datasets. In contrast, we propose the spatiotemporal mask autoencoder (STMAE), a novel VOS architecture constructed using annotations solely from the first frame. Specifically, STMAE generates a precise mask by initially aggregating a coarse mask from previous frames based on visual correspondence provided by an image encoder and then reconstructing it. We further propose a one-shot training strategy to learn general object representations for VOS using only the first frame mask. This strategy incorporates a reconstruction loss that guides the network to reconstruct the first frame mask from the spatiotemporal aggregation. Finally, extensive experiments conducted on the DAVIS and YouTube-VOS datasets demonstrate that STMAE achieves remarkable performance while effectively addressing the labor-intensive annotation issue.
Method
Comparison
- DAVIS 2016
- DAVIS 2017
- YouTube-VOS 2018