Do Pre-trained Vision-Language Models Encode Object States?

Authors: Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, Chen Sun

Abstract: For a vision-language model (VLM) to understand the physical world, such as
cause and effect, a first step is to capture the temporal dynamics of the
visual world, for example how the physical states of objects evolve over time
(e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs
pre-trained on web-scale data learn to encode object states, which can be
extracted with zero-shot text prompts. We curate an object state recognition
dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models
trained with contrastive and generative objectives. We observe that while these
state-of-the-art vision-language models can reliably perform object
recognition, they consistently fail to accurately distinguish the objects’
physical states. Through extensive experiments, we identify three areas for
improvements for VLMs to better encode object states, namely the quality of
object localization, the architecture to bind concepts to objects, and the
objective to learn discriminative visual and language encoders on object
states. Data and code are released.

Source: http://arxiv.org/abs/2409.10488v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these