Authors: Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano
Abstract: We propose sorting patch representations across views as a novel
self-supervised learning signal to improve pretrained representations. To this
end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that
enforces patch-level nearest neighbor consistency across a student and teacher
model, relative to reference batches. Our method leverages a differentiable
sorting method applied on top of pretrained representations, such as
DINOv2-registers to bootstrap the learning signal and further improve upon
them. This dense post-pretraining leads to superior performance across various
models and datasets, despite requiring only 19 hours on a single GPU. We
demonstrate that this method generates high-quality dense feature encoders and
establish several new state-of-the-art results: +5.5% and + 6% for
non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and
+7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.
Source: http://arxiv.org/abs/2408.11054v1