Gibi's Pouch
173 words
1 minutes
A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation#

Baiyu Chen, Li Zhao, Sixian Chan

FAIML 2024

Abstract#

This paper introduces a novel architecture for the video object segmentation (VOS) challenge to achieve greater label efficiency. Previous studies have primarily tackled this problem through either match-based or propagate-based architectures, relying on fully annotated datasets. In contrast, we propose the spatiotemporal mask autoencoder (STMAE), a novel VOS architecture constructed using annotations solely from the first frame. Specifically, STMAE generates a precise mask by initially aggregating a coarse mask from previous frames based on visual correspondence provided by an image encoder and then reconstructing it. We further propose a one-shot training strategy to learn general object representations for VOS using only the first frame mask. This strategy incorporates a reconstruction loss that guides the network to reconstruct the first frame mask from the spatiotemporal aggregation. Finally, extensive experiments conducted on the DAVIS and YouTube-VOS datasets demonstrate that STMAE achieves remarkable performance while effectively addressing the labor-intensive annotation issue.

Method#

  • Spatiotemporal Mask Autoencoder#

  • One-shot Training#

Comparison#

  • DAVIS 2016
  • DAVIS 2017
  • YouTube-VOS 2018

Limitations#

A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation
https://supgb.github.io/posts/stmae/
Author
Baiyu Chen
Published at
2024-04-14