Authors: Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma
Abstract: DINO and DINOv2 are two model families being widely used to learn
representations from unlabeled imagery data at large scales. Their learned
representations often enable state-of-the-art performance for downstream tasks,
such as image classification and segmentation. However, they employ many
empirically motivated design choices and their training pipelines are highly
complex and unstable — many hyperparameters need to be carefully tuned to
ensure that the representations do not collapse — which poses considerable
difficulty to improving them or adapting them to new domains. In this work, we
posit that we can remove most such-motivated idiosyncrasies in the pre-training
pipelines, and only need to add an explicit coding rate term in the loss
function to avoid collapse of the representations. As a result, we obtain
highly simplified variants of the DINO and DINOv2 which we call SimDINO and
SimDINOv2, respectively. Remarkably, these simplified models are more robust to
different design choices, such as network architecture and hyperparameters, and
they learn even higher-quality representations, measured by performance on
downstream tasks, offering a Pareto improvement over the corresponding DINO and
DINOv2 models. This work highlights the potential of using simplifying design
principles to improve the empirical practice of deep learning.
Source: http://arxiv.org/abs/2502.10385v1