OLIVE: Object Level In-Context Visual Embeddings (2024)

Timothy Ossowski1, Junjie Hu1,2
1Department of Computer Science, 2Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison, WI, USA
ossowski@wisc.edu, junjie.hu@wisc.edu

Abstract

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object level understanding and grounding. In terms of modeling, existing VLMs implicitly align text tokens with image patch tokens, which is ineffective for embedding alignment at the same granularity and inevitably introduces noisy spurious background features. Additionally, these models struggle when generalizing to unseen visual concepts and may not be reliable for domain-specific tasks without further fine-tuning. To address these limitations, we propose a novel method to prompt large language models with in-context visual object vectors, thereby enabling controllable object level reasoning. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Furthermore, we propose region-level retrieval using our object representations, facilitating rapid adaptation to new objects without additional training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance, while also offering zero-shot generalization and robustness to visually challenging contexts.111Our code and models are available at https://github.com/tossowski/OLIVE

1 Introduction

Despite the popularity, many existing VLMs such as LLaVA Liu etal. (2023), MiniGPT4 Zhu etal. (2023a), and mPLUG-OWL Ye etal. (2023) handle the entire image for visual understanding, leading to two major shortcomings. First, these VLMs use a visual transformer to split an image into a grid of image patches and embed them into a lengthy array of image patch embeddings that have object level features scattered around different positions of the array. This leads to the different granularity between the image patch tokens and text tokens, further creating difficulty in aligning and grounding visual objects to text concepts. Second, feeding all image patch embeddings to the large language model (LLM) decoder is problematic due to the resulting long context and inefficiency of including in-context examples from multiple images.

To improve fine-grained visual alignment, recent region-based VLMs are pre-trained to integrate object level information into the LLM decoder. GPT4ROI Zhang etal. (2023b) pre-trains LLMs to understand ROIAlign featuresHe etal. (2017) extracted from bounding boxes. Other similar methods such as Shikra Chen etal. (2023) or Kosmos-2 Peng etal. (2024) ground and refer to objects using text in multimodal referential dialogues. FERRET You etal. (2023) and ViP-LLaVA Cai etal. (2023) further support free-form shapes as referring input by summarizing visual features sampled within the region of interest. Although these methods provide improvement to object level reasoning, they still fail at recognizing unseen/rare objects and are sensitive to spurious background features, as shown in ยง5. Even powerful closed-source multimodal models such as GPT4V are unreliable to deploy in high-stakes domain-specific situations such as the medical domain Senkaiahliyan etal. (2023).

A straightforward way to handle generalization to unseen visual content is to integrate a retrieval component. Methods such as REVEAL Hu etal. (2023) and MuRAG Chen etal. (2022) provide retrieved multimodal facts as supplementary context to help VLMs generalize to new concepts without further training. However, these models do not consider object level retrieval and in-context prediction. Models such as Flamingo Alayrac etal. (2022) and Qwen-VL Bai etal. (2023) allow for in-context examples from multiple images, yet do not support object level retrieval and reasoning.

To address the above issues, we propose to encode object level in-context visual embeddings (OLIVE) to enhance LLMs with region-level reasoning capabilities. Critically, we omit lengthy image patch features and encode visual object embeddings by a lightweight encoder of 20 million parameters, allowing for faster training and direct connection to existing LLMs. This preserves the full functionality of the original LLMs, while also introducing novel multimodal reasoning abilities. Furthermore, our object level retrieval module allows for more precise queries and retrieved information to help the model adapt to domain-specific tasks with limited training data. Our contributions are summarized below and in Table 1:

  • โ€ข

    We propose a lightweight object encoder that can be connected to existing LLMs to enable controllable object level multimodal reasoning with free-form input annotations.

  • โ€ข

    Our model omits image patch features and summarizes object features into a single vector, significantly reducing context length for more efficient training and inference, and allowing for in-context examples from multiple images.

  • โ€ข

    We conduct extensive experiments with region-retrieval of object level features and showcase rapid adaptation to unseen visual concepts.

ModelFree-form Visual PromptsFree-form Text promptsVisual GeneralizationGenerative ApproachMulti-Image
Ferretโœ“โœ“โœ—โœ“โœ—
Flamingoโœ—โœ“โœ“โœ“โœ“
GPT4ROIโœ—โœ“โœ—โœ“โœ—
GLAMMโœ—โœ“โœ—โœ“โœ—
RegionCLIPโœ—โœ“โœ—โœ—โœ—
Llama-Adapter v2โœ—โœ—โœ—โœ“โœ—
ViP-LLAVAโœ“โœ—โœ—โœ“โœ—
OLIVEโœ“โœ“โœ“โœ“โœ“

2 Preliminaries

Generative VLM Architecture

Recent generative VLMs (e.g., LLaVA, BLIP-2) adopt a similar architecture that connects a pre-trained visual encoder ฯ•vsubscriptitalic-ฯ•๐‘ฃ\phi_{v}italic_ฯ• start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and a pre-trained language model decoder ฯ•tsubscriptitalic-ฯ•๐‘ก\phi_{t}italic_ฯ• start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through a lightweight fusion neural network, denoted as ฯ•csubscriptitalic-ฯ•๐‘\phi_{c}italic_ฯ• start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Specifically, the fusion module first uses a projection function to map a visual feature ๐ฏโˆˆ๐’ฑ๐ฏ๐’ฑ\mathbf{v}\in\mathcal{V}bold_v โˆˆ caligraphic_V to the text embedding space ๐’ณ๐’ณ\mathcal{X}caligraphic_X of the language model decoder, and then fuse the visual and text embeddings as input to language model decoder. Formally, given an image v๐‘ฃvitalic_v and a text prompt x๐‘ฅxitalic_x, the decoder takes in the combined feature ๐ฑ๐ฑ\mathbf{x}bold_x to autoregressively predict the output y๐‘ฆyitalic_y.

๐ฑt=subscript๐ฑ๐‘กabsent\displaystyle\mathbf{x}_{t}=bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =TxtEmbedโข(x;ฯ•t)โˆˆ๐’ณTxtEmbed๐‘ฅsubscriptitalic-ฯ•๐‘ก๐’ณ\displaystyle\texttt{TxtEmbed}(x;\phi_{t})\in\mathcal{X}TxtEmbed ( italic_x ; italic_ฯ• start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) โˆˆ caligraphic_X(1)
๐ฏ=๐ฏabsent\displaystyle\mathbf{v}=bold_v =ImgEncoderโข(v;ฯ•v)โˆˆ๐’ฑImgEncoder๐‘ฃsubscriptitalic-ฯ•๐‘ฃ๐’ฑ\displaystyle\texttt{ImgEncoder}(v;\phi_{v})\in\mathcal{V}ImgEncoder ( italic_v ; italic_ฯ• start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) โˆˆ caligraphic_V(2)
๐ฑv=subscript๐ฑ๐‘ฃabsent\displaystyle\mathbf{x}_{v}=bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT =Projectโข(๐ฏ;ฯ•c)โˆˆ๐’ณProject๐ฏsubscriptitalic-ฯ•๐‘๐’ณ\displaystyle\texttt{Project}(\mathbf{v};\phi_{c})\in\mathcal{X}Project ( bold_v ; italic_ฯ• start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) โˆˆ caligraphic_X(3)
๐ฑ=๐ฑabsent\displaystyle\mathbf{x}=bold_x =Fuseโข(๐ฑv,๐ฑt;ฯ•c)Fusesubscript๐ฑ๐‘ฃsubscript๐ฑ๐‘กsubscriptitalic-ฯ•๐‘\displaystyle\texttt{Fuse}(\mathbf{x}_{v},\mathbf{x}_{t};\phi_{c})Fuse ( bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_ฯ• start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )(4)
pvlmโข(y|v,x)=subscript๐‘vlmconditional๐‘ฆ๐‘ฃ๐‘ฅabsent\displaystyle p_{\text{vlm}}(y|v,x)=italic_p start_POSTSUBSCRIPT vlm end_POSTSUBSCRIPT ( italic_y | italic_v , italic_x ) =โˆj=0|y|pฯ•tโข(yj|๐ฑ,y<j)superscriptsubscriptproduct๐‘—0๐‘ฆsubscript๐‘subscriptitalic-ฯ•๐‘กconditionalsubscript๐‘ฆ๐‘—๐ฑsubscript๐‘ฆabsent๐‘—\displaystyle\prod_{j=0}^{|y|}p_{\phi_{t}}(y_{j}|\mathbf{x},y_{<j})โˆ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_ฯ• start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x , italic_y start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT )(5)

Different from prior fusion modules (e.g., linear projection in LLaVA, gated cross-attention in Flamingo, and Q-former in BLIP-2) that project the whole image features, we propose an object level encoder (ยง3.1) that captures fine-grained region features and speeds up training and inference.

Visual Instruction Tuning

We adopt a similar visual instruction-tuning approach as Liu etal. (2023) by fine-tuning parts of the VLM parameters (e.g., ฯ•csubscriptitalic-ฯ•๐‘\phi_{c}italic_ฯ• start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and/or ฯ•tsubscriptitalic-ฯ•๐‘ก\phi_{t}italic_ฯ• start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) on instruction-following data. The training objective is based on maximum likelihood estimation for next-token predictions given the input image and the text prompt. Different from prior work using pure text prompts, our object encoder and retrieval module (ยง3.1, ยง3.2) enables the usage of code-switched prompt sequence mixing text tokens and image object tokens, and the rapid adaptation to unseen domains via in-context prediction.

3 Method

OLIVE: Object Level In-Context Visual Embeddings (1)

This section as well as Figure1 highlights the main components of our method. We first design an object encoder (ยง3.1) to learn visual object embeddings in a shared vision-text space, then apply a similarity search over object embeddings to retrieve relevant visual objects (ยง3.2), and finally construct a code-switch multimodal prompt to integrate the retrieved object information for generation (ยง3.3).

3.1 Object Encoder

Following popular region-grounded models such as FERRETYou etal. (2023), we allow for free-form annotation of objects using the object segmentation mask ๐จmasksubscript๐จmask\mathbf{o}_{\text{mask}}bold_o start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT as input. Specifically, we first encode an image v๐‘ฃvitalic_v with a vision transformer Dosovitskiy etal. (2020) to obtain patch-level features ๐ฏ๐ฏ\mathbf{v}bold_v:

๐ฏ๐ฏ\displaystyle\mathbf{v}bold_v=ImgEncoderโข(v;ฯ•v)โˆˆโ„(n2+1)ร—d,absentImgEncoder๐‘ฃsubscriptitalic-ฯ•๐‘ฃsuperscriptโ„superscript๐‘›21๐‘‘\displaystyle=\texttt{ImgEncoder}(v;\phi_{v})\in\mathbb{R}^{(n^{2}+1)\times d},= ImgEncoder ( italic_v ; italic_ฯ• start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) โˆˆ blackboard_R start_POSTSUPERSCRIPT ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ร— italic_d end_POSTSUPERSCRIPT ,(6)

where n๐‘›nitalic_n is the grid size and d๐‘‘ditalic_d is the dimension of hidden states. To further obtain an object level feature ๐ฏobjsubscript๐ฏobj\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT from the image, we first extract a subset of the image features ๐ฏmaskedsubscript๐ฏmasked\mathbf{v}_{\text{masked}}bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT corresponding to the binary object segmentation mask ๐จmasksubscript๐จmask\mathbf{o}_{\text{mask}}bold_o start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT:

๐ฏmaskedsubscript๐ฏmasked\displaystyle\mathbf{v}_{\text{masked}}bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT=๐ฏโข[Flattenโข(๐จmask)]โˆˆโ„lร—dabsent๐ฏdelimited-[]Flattensubscript๐จmasksuperscriptโ„๐‘™๐‘‘\displaystyle=\mathbf{v}[\texttt{Flatten}(\mathbf{o}_{\text{mask}})]\in\mathbb%{R}^{l\times d}= bold_v [ Flatten ( bold_o start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ) ] โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_l ร— italic_d end_POSTSUPERSCRIPT(7)

where ๐จmasksubscript๐จmask\mathbf{o}_{\text{mask}}bold_o start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT is a nร—n๐‘›๐‘›n\times nitalic_n ร— italic_n binary matrix, indicating the corresponding image patches occupied by an object in the image, and l๐‘™litalic_l denotes the number of the occupied patches. These segmentation masks can be created by automatic segmentation tools such as SAMKirillov etal. (2023) or provided by human selection on the image. The segmentation mask is first flattened and used to select object patches ๐ฏmaskedsubscript๐ฏmasked\mathbf{v}_{\text{masked}}bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT from ๐ฏ๐ฏ\mathbf{v}bold_v. Finally, we obtain the object embedding by compressing ๐ฏmaskedsubscript๐ฏmasked\mathbf{v}_{\text{masked}}bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT into a single vector ๐ฏobjsubscript๐ฏobj\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT.

๐ฏobjsubscript๐ฏobj\displaystyle\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT=ObjectEncoderโข(๐ฏmasked;ฯ•c)โˆˆโ„d,absentObjectEncodersubscript๐ฏmaskedsubscriptitalic-ฯ•๐‘superscriptโ„๐‘‘\displaystyle=\texttt{ObjectEncoder}(\mathbf{v}_{\text{masked}};\phi_{c})\in%\mathbb{R}^{d},= ObjectEncoder ( bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT ; italic_ฯ• start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,(8)

where the object encoder uses a lightweight 2-layer transformer that acts similar to a visual resamplerZou etal. (2023a); Li etal. (2023), followed by a learnable linear layer to further project the visual representation to the text space Liu etal. (2023).

3.2 Visual Object Retrieval

In many cases, the object of interest does not resemble anything seen during training. With our visual object embeddings, we can easily perform object level retrieval to match an open class of visual objects and integrate the retrieved information into the language decoder for predicting unseen or rare objects from specific domains (e.g., biomedicine). To this end, we assume access to a retrieval set โ„›={(๐จi,di,vi)}i=1mโ„›subscriptsuperscriptsubscript๐จ๐‘–subscript๐‘‘๐‘–subscript๐‘ฃ๐‘–๐‘š๐‘–1\mathcal{R}=\{(\mathbf{o}_{i},d_{i},v_{i})\}^{m}_{i=1}caligraphic_R = { ( bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, where each triple consists of an objectโ€™s segmentation mask ๐จisubscript๐จ๐‘–\mathbf{o}_{i}bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the objectโ€™s text description disubscript๐‘‘๐‘–d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the image visubscript๐‘ฃ๐‘–v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT containing this object. To retrieve relevant objects from โ„›โ„›\mathcal{R}caligraphic_R, we use a similar object encoding as ยง3.1 except that we use the mean pooling of ๐ฏmaskedsubscript๐ฏmasked\mathbf{v}_{\text{masked}}bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT as the object encoder in Eq.(9), since this simple strategy does not require any learnable parameters for projection to the text embedding space and visual object embeddings can be pre-computed before any fine-tuning. However, we use a learnable object encoder in Eq.(8) to connect object embeddings to the LM decoder during instruction-tuning for text generation (ยง3.3).

๐ฏobjsubscript๐ฏobj\displaystyle\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT=MeanPoolโข(๐ฏmasked)โˆˆโ„d,absentMeanPoolsubscript๐ฏmaskedsuperscriptโ„๐‘‘\displaystyle=\texttt{MeanPool}(\mathbf{v}_{\text{masked}})\in\mathbb{R}^{d},= MeanPool ( bold_v start_POSTSUBSCRIPT masked end_POSTSUBSCRIPT ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,(9)

During retrieval, we compute a query vector ๐ฏquerysubscript๐ฏquery\mathbf{v}_{\text{query}}bold_v start_POSTSUBSCRIPT query end_POSTSUBSCRIPT for a given object, and compute the cosine similarity scores between ๐ฏquerysubscript๐ฏquery\mathbf{v}_{\text{query}}bold_v start_POSTSUBSCRIPT query end_POSTSUBSCRIPT and all the visual object embeddings from โ„›โ„›\mathcal{R}caligraphic_R to obtain the top k๐‘˜kitalic_k closest triples, denoted as ๐’ฆ={(๐จi,di,vi)}i=1k๐’ฆsuperscriptsubscriptsubscript๐จ๐‘–subscript๐‘‘๐‘–subscript๐‘ฃ๐‘–๐‘–1๐‘˜\mathcal{K}=\{(\mathbf{o}_{i},d_{i},v_{i})\}_{i=1}^{k}caligraphic_K = { ( bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

3.3 In-context Prompt Construction

As the visual object embeddings are projected into the text embedding space of the LM decoder, this allows us to construct a code-switched prompt that mixes visual objects with text tokens for the LM decoder (e.g., Llama 2Touvron etal. (2023)). In addition, as our object encoder compresses a visual object into a single vector ๐ฏobjsubscript๐ฏobj\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT, this significantly shortens the length of the visual tokens that the LM decoder needs to fuse with text tokens. Therefore, we can easily integrate multiple retrieved object embeddings into the prompt to augment the LM decoder for in-context text generation. Specifically, we define a special vocabulary token [obj] which can be inserted flexibly in the user prompt x๐‘ฅxitalic_x. For example, the user can ask โ€œ[obj] Describe this part of the image" to perform region-level description. The embedding of this token is directly replaced with its corresponding visual object embedding. Formally, given a text prompt x๐‘ฅxitalic_x that contains indexed [obj] tokens referring to an object ๐ฏobjsubscript๐ฏobj\mathbf{v}_{\text{obj}}bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT of interest in an image v๐‘ฃvitalic_v and its relevant objects in ๐’ฆ๐’ฆ\mathcal{K}caligraphic_K, we define a prompting function that replaces the text embedding of [obj] with its corresponding visual object embedding, and integrates the top k๐‘˜kitalic_k most similar objects ๐’ฆ๐’ฆ\mathcal{K}caligraphic_K as in-context examples. For example, a prompt with retrieved in-context examples can be โ€œThe top [k] related objects are: [obj_โข1_1\_1_ 1] is a [label],โ€ฆ[obj_โขk_๐‘˜\_k_ italic_k] is a [label]. [obj_โขquery_query\_\text{query}_ query] What is this?โ€. We provide more details about in-context prompt templates and construction in Appendix A.

๐ฑ=Promptโข(x,๐ฏobj,๐’ฆ)๐ฑPrompt๐‘ฅsubscript๐ฏobj๐’ฆ\displaystyle\mathbf{x}=\texttt{Prompt}(x,\mathbf{v}_{\text{obj}},\mathcal{K})bold_x = Prompt ( italic_x , bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT , caligraphic_K )(10)

Finally, we feed the multimodal prompt ๐ฑ๐ฑ\mathbf{x}bold_x into the LM decoder for text generation following Eq.(5). Note that compared to prior VLMs (e.g., LLaVA) that directly fuse the patch-level features ๐ฏ๐ฏ\mathbf{v}bold_v of the whole image (Eq.6) with object information scattering around different positions in ๐ฏ๐ฏ\mathbf{v}bold_v, our object encoding is computationally more efficient and speeds up the training that involves multiple in-context objects in the multimodal prompt.

4 ExperimentalSettings

In this section, we first describe two main object-level tasks for evaluation (ยง4.1) together with the datasets used (ยง4.2). Finally, we describe three variants of our model (ยง4.3), the training details ยง4.4), and the other baselines in comparison (ยง4.5).

4.1 Object-level Tasks

Referring Object Classification

Given an object referred by its image location (e.g. segmentation mask/bounding box), the model is instructed to generate a text that predicts the objectโ€™s class label in a predefined label set, ๐’žโˆˆ{c1,c2,โ€ฆโขcn}๐’žsubscript๐‘1subscript๐‘2โ€ฆsubscript๐‘๐‘›\mathcal{C}\in\{c_{1},c_{2},...c_{n}\}caligraphic_C โˆˆ { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , โ€ฆ italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. We provide the ground truth segmentation mask to eliminate localization errors and focus on evaluating the modelsโ€™ understanding of image objects.

Referring Expression Generation

Given an input image object referred by a segmentation mask, the model is instructed to generate a natural language expression which semantically matches multiple ground-truth references โ„›โˆˆ{r1,r2,โ€ฆโขrm}โ„›subscript๐‘Ÿ1subscript๐‘Ÿ2โ€ฆsubscript๐‘Ÿ๐‘š\mathcal{R}\in\{r_{1},r_{2},...r_{m}\}caligraphic_R โˆˆ { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , โ€ฆ italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. We use METEOR Banerjee and Lavie (2005) and CIDEr Vedantam etal. (2015) score for evaluating generated description quality.

4.2 Datasets

This section describes the different datasets used in our experiments, with more details in Appendix 4.

Common Objects in Context (COCO)

Lin etal. (2014) is a popular visual reasoning dataset with over 800,000 object-level annotations for 80 categories of objects. We use it to train our model to understand region input since it contains high-quality segmentation annotations. We use the standardized train and validation 2017 splits for the detection task, and discard a few (<<<1%) small segmentation annotations that fail to be converted into a binary mask. Following Zhong etal. (2022), we evaluate in the setting where ground-truth segmentations are provided as input to eliminate localization errors. We use the standard metric of mean average precision (mAP) for object detection using the COCO API,222https://github.com/cocodataset/cocoapi as well as overall accuracy.

refCOCOg

Kazemzadeh etal. (2014) is a variant of the COCO dataset with about 50,000 annotations for objects and their description. We use the data to train our model to describe image regions and use their standardized train/validation split.

ChestX-Ray8 (CXR8)

Wang etal. (2017) is a medical dataset consisting of 108,948 frontal-view X-ray images. The image annotations for the 8 possible pathologies are text-mined from the radiology reports using NLP tools. A small subset of 984 images contains bounding box annotation of the pathology. We use this subset for our zero-shot domain adaptation experiments, splitting the data into 16% retrieval set and 84% test data. The retrieval set consists of 20 examples of each pathology, and we use overall accuracy as the evaluation metric.

4.3 OLIVE Variants

OLIVE-R(Retrieval-only)

This retrieval-only method predicts the answer to the user question by taking a majority vote of the top k๐‘˜kitalic_k retrieved examples. For simplicity, we fix k=5๐‘˜5k=5italic_k = 5 for this setting unless otherwise specified and analyze the effect of k๐‘˜kitalic_k in Figure 6. Although simple, this baseline proves to be effective and provides salient additional context as described in ยง4.3. However, this discriminative model does not allow for free-form text generation for tasks such as region captioning.

OLIVE-G(Generative-only)

This model is trained to generate free-form text based solely on the user question and corresponding object features. We omit the retrieved information to observe the capability of the standalone object representations. We find that even without retrieval, the model can learn to perform more challenging object-level tasks such as region description. The final decoder input can be expressed as a variant of Eq. (10):

๐ฑ=Promptโข(x,๐ฏobj).๐ฑPrompt๐‘ฅsubscript๐ฏobj\displaystyle\mathbf{x}=\texttt{Prompt}(x,\mathbf{v}_{\text{obj}}).bold_x = Prompt ( italic_x , bold_v start_POSTSUBSCRIPT obj end_POSTSUBSCRIPT ) .(11)

OLIVE-RG(Full)

Our full model generates text outputs based on in-context object examples from retrieval. The multimodal in-context prompt is constructed using Eq. (10). This prompt includes the retrieved object features, their labels, and their similarity scores. The exact construction can be found in Appendix A. The top k๐‘˜kitalic_k retrieved multimodal documents in ๐’ฆ๐’ฆ\mathcal{K}caligraphic_K are obtained using the same retrieval described inยง3.2 and ordered in increasing relevance score. Both OLIVE-G and OLIVE-RG use greedy decoding for text generation.

4.4 Training Details

Our model uses a frozen ViT-L/14 vision transformer from a CLIP model to obtain patch-level features. For our LLM backbone, we use either Llama 2-7B or GPT-2 (124M)Radford etal. (2019). The LLM is finetuned with LoRA Hu etal. (2021) as we find this improves model performance. We use the train splits of two different region-level datasets (i.e., COCO, refCOCOg) as our training data for their respective tasks, and evaluate models on their corresponding validation splits because their test data does not have object-level annotation. More details are in Table 7 and we leave the other hyperparameter search to future exploration. We additionally find that we can train a multi-task model by combining the datasets for all object-level tasks (Details in Appendix E).

4.5 Other Baselines in Comparison

CLIP

Radford etal. (2021) Contrastive Language Image Pretraining learns a joint vision-language space between images and their matching captions. We use this method for zero shot object classification by predicting the target with the highest cosine similarity to the cropped region.

BioMedCLIP

Zhang etal. (2023a)The authors train a CLIP model aligned to biomedical image-text pairs, achieving state of the art on a variety of medical tasks. We use this model as a baseline for object classification in the medical domain.

RegionCLIP

Zhong etal. (2022) This model learns region-text level alignment through soft-labels obtained from CLIP. We use it for referring object detection based on ROIAlign features.

Kosmos 2

Peng etal. (2024) This generative VLM trains a LLM decoder to perform a variety of visual grounding tasks from their newly introduced grounded image-text (GRIT) dataset. We compare with their results on referring expression generation on the refCOCOg dataset.

Flamingo

Alayrac etal. (2022) This generative model learns to connect frozen visual features and LLMs by training on interleaved image-text data. We evaluate Flamingoโ€™s few-shot performance on referring expression generation on cropped image regions. We use an open-source implementation trained on the multimodal C4 Zhu etal. (2023b) and LAION-2b Schuhmann etal. (2022) datasets.

5 Results and Analysis

OLIVE: Object Level In-Context Visual Embeddings (2)

\renewrobustcmd

Method TypePre-Training DataMethodAccuracy
ClassificationNoneOLIVE-R33.5
PMC-15BioMedCLIP32.5
PMC-15BioMedCLIPcโขrโขoโขpsubscriptBioMedCLIP๐‘๐‘Ÿ๐‘œ๐‘\text{BioMedCLIP}_{crop}BioMedCLIP start_POSTSUBSCRIPT italic_c italic_r italic_o italic_p end_POSTSUBSCRIPT23.3
CLIP400MCLIP14.0
NoneRandom Guess12.5
CLIP400MCLIPcโขrโขoโขpsubscriptCLIP๐‘๐‘Ÿ๐‘œ๐‘\text{CLIP}_{crop}CLIP start_POSTSUBSCRIPT italic_c italic_r italic_o italic_p end_POSTSUBSCRIPT11.2
GenerativeCOCOOLIVE-RG31.2
C4 + LAION-2bFlamingo-9B12.5
COCOOLIVE-G0.0

5.1 Referring Object Classification

Unseen Object Classification

One of the benefits of our retrieval augmented system is its rapid generalization to unseen visual concepts. We estimate this capability by training on the COCO dataset and evaluating object classification on an unseen medical dataset which has drastically different types of images and limited training data. Table 2 illustrates the performance of our method on the CXR8 dataset in either a classification or generative setting. Even with as little as 20 examples per class in the medical retrieval set, OLIVE-R achieves competitive performance compared to domain-adapted models (i.e., BioMedCLIP), which we hypothesize is because of our region-level retrieval and in-distribution retrieval set. We also note that our generative approach OLIVE-RG can utilize the retrieved in-context examples and achieve similar performance to BioMedCLIP, despite only being trained on COCO images. Without retrieval, the generative model fails catastrophically with 0%percent00\%0 % accuracy, and zero-shot CLIP achieves about the same performance as random guessing.

Rare Object Classification

We also investigate our modelโ€™s performance on rare, but seen objects. Figure 3 shows our methodโ€™s performance on the top 5 rarest classes in the COCO dataset. For OLIVE-G and OLIVE-RG, we use a 224 pixel resolution visual encoder to match the CLIP visual encoder. OLIVE-G tends to have lower performances on the rare classes. However, when combining retrieval with parameterized methods in OLIVE-RG and OLIVE-RG-336px, the performance on rare classes improves significantly, with OLIVE-RG-336px performing better than CLIP on all rare classes. OLIVE-RG also achieves better performance on three out of five classes despite being trained on less data. Our modelโ€™s overall performance can be found in Appendix 5.

OLIVE: Object Level In-Context Visual Embeddings (3)

5.2 Referring Expression Generation

Captioning Unseen Objects

In addition to referring object classification, we investigate our modelโ€™s ability to caption out-of-distribution objects. Figure 2 illustrates an example of asking our model to describe animals not seen during training. Without retrieval, OLIVE-G fails to describe the shark and turtle. However, after manually adding just 5 labeled objects of turtles and sharks to the existing retrieval set, OLIVE-RG accurately describes the object and provides supporting examples for its prediction. The label description for each object in the retrieval set is only the name of the animal, but the model generates additional characteristics in its description. Appendix B shows more examples of zero-shot adaptation to unseen visual concepts in the object classification setting.

OLIVE: Object Level In-Context Visual Embeddings (4)

Challenging Visual Context

To test the quality of the representations generated from our object encoder, we qualitatively evaluate our model prediction in adversarial visual contexts. Figure 7 shows a white dog and a black cat in a โ€œyin-yangโ€ shape. We observe that free-form annotation allows for more precise user queries and object descriptions, and illustrates other properties such as scene content awareness and patch-level details as shown in Appendix B. While many VLMs can accurately understand normal scenes, Figure 4 illustrates an example in which an object-level representation may be necessary, with recent works struggling to caption the snowboarder on the beach. The detailed performance of our model on the refCOCOg captioning task can be found in Appendix 6.

5.3 In-context Example Size

OLIVE: Object Level In-Context Visual Embeddings (5)

Since our method omits image patch features and compresses object information into a single vector, it can process many objects from different images at once. In Figure 5, we highlight the difference in context length for various methods when prompted with multimedia examples. We assume an average prompt length of 30 accompanying each in-context image example for all models. Even approaches designed for interleaved image-text data such as Flamingo insert multiple latent vectors for each image, incurring a higher cost than our approach.

5.4 Sensitivity on Retrieval: Coverage and k๐‘˜kitalic_k

OLIVE: Object Level In-Context Visual Embeddings (6)

In Figure 6 we analyze the effect of changing the size of the object retrieval set as well as the number of retrieved examples, k๐‘˜kitalic_k. To thoroughly test various settings, we evaluate the retrieval-only based approach (OLIVE-R) on the validation split of the COCO dataset using different sized subsets of the training data for retrieval. We ensure the retrieval set contains an equal amount of each object class when possible. Our results indicate that the optimal value of k๐‘˜kitalic_k depends on the size of the retrieval set. With a small retrieval dataset (red), performance is lower and the optimal k๐‘˜kitalic_k tends to be smaller. Larger retrieval sets (blue, green) benefit from retrieving more examples and have greater performance.

OLIVE: Object Level In-Context Visual Embeddings (7)

5.5 Object Vector Visualization

Having a single vector representation for each object allows for visualization using dimensionality reduction. In Figure 8, we perform principal component analysis (PCA) on the hidden states of object vectors at different layers in the LLM decoder. We plot 200 examples from each of 10 object categories and note several patterns. First, objects from the same class tend to appear together, even though they appear in different visual contexts. This suggests that the object encoder has semantic understanding of the visual concepts. Second, the object vectors naturally form hierarchical clusters where objects from the same super class such as vehicle, animal, or fruit have overlapping clusters. Lastly, the clustering appears similar across all layers, with only minor variations.

OLIVE: Object Level In-Context Visual Embeddings (8)

OLIVE: Object Level In-Context Visual Embeddings (9)

OLIVE: Object Level In-Context Visual Embeddings (10)

6 Related Work

Grounding in Language and Vision

A popular approach for aligning vision and language embeddings is contrastive learning methods such as CLIP and ALIGN Li etal. (2021). However, these methods align the entire image representation, leading to poor reasoning on image details for downstream vision language tasks. RegionCLIP Zhong etal. (2022) and GLIP Li etal. (2022) address this issue by proposing fine-grained alignment with region-text pairs during pretraining. GLIPv2 Zhang etal. (2022) further improves the pretraining and alignment by introducing localization, detection, and other tasks. Another recent popular approach involves training models on automatically curated region-level data from image-caption pairs Peng etal. (2024). Many other works focus on region-level alignment during pretraining for greater vision-language understanding You etal. (2023); Chen etal. (2023); Zeng etal. (2022b, a). More generally, a recent study Bugliarello etal. (2023) shows that VLMs with fine-grained object-level pretraining such as X-VLM Zeng etal. (2022a) have better reasoning ability. Other works align vision and language using regularization or loss to create relation aware cross attention between modalities Pandey etal. (2023); Ren etal. (2021).

Visual Resampling

Visual resampling is a popular technique to compress long sequences of image features into a few rich vector representations. This is achieved by constructing a fixed amount of learnable vectors that attend to the visual features through cross-attention layers. Models such as BLIP Li etal. (2023) first explore this idea to connect frozen vision features to LLMs efficiently by summarizing the content of the image. Other methods including X-Decoder Zou etal. (2023a) or SEEM Zou etal. (2023b) use resampling to encode various types of prompts or intents which improve the LLM decoding ability. Additionally, works such as Flamingo Alayrac etal. (2022) and Qwen-VL Bai etal. (2023) show that multiple images can be inserted in-context to the prompt by compressing image features with resamplers, enabling few shot capabilities. Our work visually resamples object representations for object-conditioned text generation, and only uses a single vector for the representation. This allows for more fine-grained reasoning and longer in-context prompting.

Retrieval Augmented VLMs

In the text domain, learning to retrieve relevant documents to enhance the LLM query Guu etal. (2020) has been explored extensively Wang etal. (2023). Recent VLM works follow a similar approach to retrieve multimodal documents to improve performance on knowledge-intensive tasks and improve generalization to rare situations. Gao etal. (2022) summarizes visual content into natural language to use as a query for dense passage retrieval. MuRAG Chen etal. (2022) proposes a multimodal image-text memory bank to help models answer challenging knowledge-based visual questions such as โ€œWhat shape is the pediment on top of the white house?" REVEAL Hu etal. (2023) and RA-VQA Lin and Byrne (2022) learns a trainable multimodal retriever similar to REALM Guu etal. (2020) during pretraining to fetch relevant documents to answer questions, achieving state of the art performance on datasets such as VQAv2 Antol etal. (2015) and OKVQA Schwenk etal. (2022). To the best of our knowledge, we are the first to integrate region-level retrieval with LLMs, in which the multimodal documents are indexed by object-level visual features.

7 Conclusion

We present a simple approach to insert object level visual embeddings into large language model decoders, enabling object level reasoning with flexible prompt structure. Our object encoder compresses fine-grained region level information into a single vector, enabling in-context prompting with objects from multiple images and more efficient training and inference. In addition, we introduce the idea of region retrieval, which allows for precise queries free of image background noise and rapid generalization to rare and unseen objects with no parameter updates. We hope our method may help researchers design vision language models which can adapt to their needs by simply updating the retrieval set or object encoder, while also being responsive to varying user intents using LLM prompting techniques.

8 Limitations

While our approach provides a flexible way for users to supply object-level prompts, it does not output bounding boxes or other region-level grounding. This may be addressed in future research by further finetuning on region-level instruction tuning data as done in FERRET You etal. (2023), GLAMM Rasheed etal. (2023), and other region-level VLM pretraining. At the moment, we also do not explore generic image tasks such as VQA or image captioning. However, a potential solution is to use our object encoder to connect to existing VLMs (e.g. LLaVA) which excel at these tasks. Lastly, our results in the retrieval setting depend on the quality of the retrieved examples. Curating a high-quality retrieval set at the object-level can be challenging. However, existing tools such as GLIPv2 Zhang etal. (2022) allows for semi-automatic generation of region-level data as used in KOSMOS-2 Peng etal. (2024) in developing the GRIT dataset.

9 Ethical Considerations

Biases From Pretrained LLMs

Since our model uses existing pretrained LLMs such as Llama 2 or GPT2, it may inherent some of the social biases or toxicity acquired during their pretraining stages. While Llama 2 undergoes extensive alignment to ideal human values through reinforcement learning from human feedback (RLHF) Griffith etal. (2013), some of these toxic behaviors may still be present in the morally aligned model. We make sure to only use images of common objects in the COCO dataset, which do not contain any of these biases or violent scenes to the best of our knowledge. Nevertheless, further testing to ensure the impartiality of the model may be necessary before deploying in widespread technologies.

Domain Adaptation

Some of our experiments involve evaluating our model in a data-scarce domain in a zero shot manner with in-context prompting. While this is a promising direction for efficient domain adaptation, users should take caution in directly using model prediction, as this is a challenging task due to distribution shift. We encourage human-in-the-loop interaction to sanity check the outputs. Different from other ICL prompting methods, we provide retrieved examples and similarity scores which can help determine the trustworthiness of the model prediction, which may be valuable for high-risk domains such as medicine.

Acknowledgement

Ossowski and Hu are supported by the Wisconsin Alumni Research Foundation and the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number R01EB033782. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr,Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds,etal. 2022.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems,35:23716โ€“23736.
  • Antol etal. (2015)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,CLawrence Zitnick, and Devi Parikh. 2015.Vqa: Visual question answering.In Proceedings of the IEEE international conference on computervision, pages 2425โ€“2433.
  • Bai etal. (2023)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, JunyangLin, Chang Zhou, and Jingren Zhou. 2023.Qwen-vl: A frontier large vision-language model with versatileabilities.arXiv preprint arXiv:2308.12966.
  • Banerjee and Lavie (2005)Satanjeev Banerjee and Alon Lavie. 2005.Meteor: An automatic metric for mt evaluation with improvedcorrelation with human judgments.In Proceedings of the acl workshop on intrinsic and extrinsicevaluation measures for machine translation and/or summarization, pages65โ€“72.
  • Bugliarello etal. (2023)Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, LisaAnne Hendricks,and Aida Nematzadeh. 2023.Measuring progress in fine-grained vision-and-language understanding.In Proceedings of the 61st Annual Meeting of the Associationfor Computational Linguistics.
  • Cai etal. (2023)MuCai, Haotian Liu, SivaKarthik Mustikovela, GregoryP Meyer, Yuning Chai,Dennis Park, and YongJae Lee. 2023.Making large multimodal models understand arbitrary visual prompts.arXiv preprint arXiv:2312.00784.
  • Chen etal. (2023)Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao.2023.Shikra: Unleashing multimodal llmโ€™s referential dialogue magic.arXiv preprint arXiv:2306.15195.
  • Chen etal. (2022)Wenhu Chen, Hexiang Hu, XiChen, Pat Verga, and WilliamW Cohen. 2022.Murag: Multimodal retrieval-augmented generator for open questionanswering over images and text.In Proceedings of the 2022 Conference on Empirical Methods inNatural Language Processing.
  • Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, etal. 2020.An image is worth 16x16 words: Transformers for image recognition atscale.arXiv preprint arXiv:2010.11929.
  • Gao etal. (2022)Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, YingNian Wu, and PremNatarajan. 2022.Transform-retrieve-generate: Natural language-centricoutside-knowledge visual question answering.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 5067โ€“5077.
  • Griffith etal. (2013)Shane Griffith, Kaushik Subramanian, Jonathan Scholz, CharlesL Isbell, andAndreaL Thomaz. 2013.Policy shaping: Integrating human feedback with reinforcementlearning.Advances in neural information processing systems, 26.
  • Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020.Retrieval augmented language model pre-training.In International Conference on Machine Learning, pages3929โ€“3938. PMLR.
  • He etal. (2017)Kaiming He, Georgia Gkioxari, Piotr Dollรกr, and Ross Girshick. 2017.Mask r-cnn.In Proceedings of the IEEE international conference on computervision, pages 2961โ€“2969.
  • Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685.
  • Hu etal. (2023)Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun,Cordelia Schmid, DavidA Ross, and Alireza Fathi. 2023.Reveal: Retrieval-augmented visual-language pre-training withmulti-source multimodal knowledge memory.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 23369โ€“23379.
  • Kazemzadeh etal. (2014)Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014.Referitgame: Referring to objects in photographs of natural scenes.In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP), pages 787โ€“798.
  • Kirillov etal. (2023)Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, LauraGustafson, Tete Xiao, Spencer Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.2023.Segment anything.arXiv preprint arXiv:2304.02643.
  • Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023.Blip-2: Bootstrapping language-image pre-training with frozen imageencoders and large language models.In International Conference on Machine Learning.
  • Li etal. (2021)Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong,and Steven ChuHong Hoi. 2021.Align before fuse: Vision and language representation learning withmomentum distillation.Advances in neural information processing systems,34:9694โ€“9705.
  • Li etal. (2022)LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li,Yiwu Zhong, Lijuan Wang, LuYuan, Lei Zhang, Jenq-Neng Hwang, etal. 2022.Grounded language-image pre-training.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 10965โ€“10975.
  • Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollรกr, and CLawrence Zitnick. 2014.Microsoft coco: Common objects in context.In Computer Visionโ€“ECCV 2014: 13th European Conference,Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages740โ€“755. Springer.
  • Lin and Byrne (2022)Weizhe Lin and Bill Byrne. 2022.Retrieval augmented visual question answering with outside knowledge.In Proceedings of the 2022 Conference on Empirical Methods inNatural Language Processing.
  • Liu etal. (2023)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee. 2023.Visual instruction tuning.arXiv preprint arXiv:2304.08485.
  • Pandey etal. (2023)Rohan Pandey, Rulin Shao, PaulPu Liang, Ruslan Salakhutdinov, andLouis-Philippe Morency. 2023.Cross-modal attention congruence regularization for vision-languagerelation alignment.In Proceedings of the 61st Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages 5444โ€“5455,Toronto, Canada. Association for Computational Linguistics.
  • Peng etal. (2024)Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, andFuru Wei. 2024.Kosmos-2: Grounding multimodal large language models to the world.In Proceedings of the Twelfth International Conference onLearning Representations (ICLR).
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,etal. 2021.Learning transferable visual models from natural languagesupervision.In International conference on machine learning, pages8748โ€“8763. PMLR.
  • Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, IlyaSutskever, etal. 2019.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9.
  • Rasheed etal. (2023)Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan,Hisham Cholakkal, RaoM Anwer, Erix Xing, Ming-Hsuan Yang, and FahadS Khan.2023.Glamm: Pixel grounding large multimodal model.arXiv preprint arXiv:2311.03356.
  • Ren etal. (2021)Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, AnYang, Jingren Zhou,XuSun, and Hongxia Yang. 2021.Learning relation alignment for calibrated cross-modal retrieval.arXiv preprint arXiv:2105.13868.
  • Schuhmann etal. (2022)Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, RossWightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, MitchellWortsman, etal. 2022.Laion-5b: An open large-scale dataset for training next generationimage-text models.Advances in Neural Information Processing Systems,35:25278โ€“25294.
  • Schwenk etal. (2022)Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, andRoozbeh Mottaghi. 2022.A-okvqa: A benchmark for visual question answering using worldknowledge.In European Conference on Computer Vision, pages 146โ€“162.Springer.
  • Senkaiahliyan etal. (2023)Senthujan Senkaiahliyan, Augustin Toma, Jun Ma, An-Wen Chan, Andrew Ha, KevinRAn, Hrishikesh Suresh, Barry Rubin, and BoWang. 2023.Gpt-4v (ision) unsuitable for clinical care and education: Aclinician-evaluated assessment.medRxiv, pages 2023โ€“11.
  • Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
  • Vedantam etal. (2015)Ramakrishna Vedantam, CLawrenceZitnick, and Devi Parikh. 2015.Cider: Consensus-based image description evaluation.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 4566โ€“4575.
  • Wang etal. (2023)Liang Wang, Nan Yang, and Furu Wei. 2023.Learning to retrieve in-context examples for large language models.arXiv preprint arXiv:2307.07164.
  • Wang etal. (2017)Xiaosong Wang, Yifan Peng, LeLu, Zhiyong Lu, Mohammadhadi Bagheri, andRonaldM Summers. 2017.Chestx-ray8: Hospital-scale chest x-ray database and benchmarks onweakly-supervised classification and localization of common thorax diseases.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 2097โ€“2106.
  • Wu etal. (2022)Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan,and Lijuan Wang. 2022.Grit: A generative region-to-text transformer for objectunderstanding.arXiv preprint arXiv:2212.00280.
  • Ye etal. (2023)Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, JunyangWang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, YuanhongXu, Hehong Chen, Junfeng Tian, Qian Qi, ji*zhang, and Fei Huang. 2023.mplug-owl: Modularizationempowers large language models with multimodality.
  • You etal. (2023)Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang,Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2023.Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704.
  • Yu etal. (2017)Licheng Yu, Hao Tan, Mohit Bansal, and TamaraL Berg. 2017.A joint speaker-listener-reinforcer model for referring expressions.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 7282โ€“7290.
  • Zareian etal. (2021)Alireza Zareian, KevinDela Rosa, DerekHao Hu, and Shih-Fu Chang. 2021.Open-vocabulary object detection using captions.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 14393โ€“14402.
  • Zeng etal. (2022a)Yan Zeng, Xinsong Zhang, and Hang Li. 2022a.Multi-grained vision language pre-training: Aligning texts withvisual concepts.In Proceedings of the Thirty-ninth International Conference onMachine Learning.
  • Zeng etal. (2022b)Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and WangchunshuZhou. 2022b.x2superscript๐‘ฅ2x^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-vlm: All-in-one pre-trained model for vision-language tasks.arXiv preprint arXiv:2211.12402.
  • Zhang etal. (2022)Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, XiyangDai, Lijuan Wang, LuYuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022.Glipv2: Unifying localization and vision-language understanding.Advances in Neural Information Processing Systems,35:36067โ€“36080.
  • Zhang etal. (2023a)Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston,Rajesh Rao, MuWei, Naveen Valluri, Cliff Wong, etal. 2023a.Large-scale domain-specific pretraining for biomedicalvision-language processing.arXiv preprint arXiv:2303.00915.
  • Zhang etal. (2023b)Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, KaiChen, and Ping Luo. 2023b.Gpt4roi: Instruction tuning large language model onregion-of-interest.arXiv preprint arXiv:2307.03601.
  • Zhong etal. (2022)Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella,LiunianHarold Li, Luowei Zhou, Xiyang Dai, LuYuan, Yin Li, etal. 2022.Regionclip: Region-based language-image pretraining.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 16793โ€“16803.
  • Zhu etal. (2023a)Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.2023a.Minigpt-4: Enhancing vision-language understanding with advancedlarge language models.arXiv preprint arXiv:2304.10592.
  • Zhu etal. (2023b)Wanrong Zhu, Jack Hessel, Anas Awadalla, SamirYitzhak Gadre, Jesse Dodge, AlexFang, Youngjae Yu, Ludwig Schmidt, WilliamYang Wang, and Yejin Choi.2023b.Multimodal c4: An open, billion-scale corpus of images interleavedwith text.In Advances in Neural Information Processing Systems (D&B).
  • Zou etal. (2023a)Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, XiyangDai, Harkirat Behl, Jianfeng Wang, LuYuan, etal. 2023a.Generalized decoding for pixel, image, and language.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 15116โ€“15127.
  • Zou etal. (2023b)Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, andYongJae Lee. 2023b.Segment everything everywhere all at once.In Advances in Neural Information Processing Systems (poster).

Appendix A Appendix

TaskICL Prompt for Retrieved ExamplesVanilla Prompts
ObjectClassification You are a helpful vision assistant trained to help people analyze images.The top [k] related objects are:[obj] is a [label] with confidence [score][obj] is a [label] with confidence [score]โ‹ฎ[vanilla prompt] [obj] What is this? Answer in 1-2 words[obj] What is this object? Answer with a short word or phrase.[obj] Identify this object.Here is an object [obj]. What is this? Answer with a short word or phrase.
RegionDescription You are a helpful vision assistant trained to help people analyze images.The top [k] related objects are:[obj] is a [label] with confidence [score][obj] is a [label] with confidence [score]โ‹ฎ[vanilla prompt] [obj] Briefly describe this image region.[obj] Describe this part of the image.[obj] Share some details about about whatโ€™s happening here in the image.[obj] Break down what you see in this particular part of the picture.[obj] Describe what you notice in this area of the picture.

Table 3 contains all the prompts we use to instruct the LLM decoder.

Appendix B Qualitative Examples

OLIVE: Object Level In-Context Visual Embeddings (11)
OLIVE: Object Level In-Context Visual Embeddings (12)
OLIVE: Object Level In-Context Visual Embeddings (13)
OLIVE: Object Level In-Context Visual Embeddings (14)

Here we include several selected examples showcasing the strengths and weaknesses of our approach.

Visual Concept Generalization

In Figure 9 we demonstrate more examples of rapid generalization to new visual concepts. Many existing methods confidently predict concepts from their pretraining, while ours can predict new concepts on the fly.

Scene Content Awareness

Even though our object representation involves masking out image patch features from other parts of the image, we have observed that the object vector still contains information about its surroundings. Figure 10 illustrates this phenomenon, where OLIVE can include the cow in its description, despite not including any image patches corresponding to the cow in the user selection.

Patch level Detail

Our method also can identify and describe small objects at the patch level. Figure 11 shows an example of object classification on smaller objects.

Describing Partially Visible Objects

We notice that our model can make mistakes when describing occluded or partially visible objects as seen in Figure 12. We hypothesize that the training data of refCOCOg does not include these kinds of image regions, which also limits its availability in retrieval data. This may be addressed with larger-scale pre-training on data such as GRIT which likely includes more occluded objects.

Errors in Detailed Description

While our model can identify the object most of the time, it sometimes gets minor details incorrect. For example, the colors of a shirt or other piece of clothing are seen in Figure 12. This may be due to the extreme compression we learn into a single vector. Future work may consider visually resampling the object features into more than just 1 latent vector for detailed captioning, but still use the single vector representation for retrieval.

Appendix C Dataset Information

DatasetTrain SplitValidation SplitRetrieval Set Train SplitRetrieval Set Test SplitNumber of Classes
COCO8495863632084958684958680
refCOCOg448225000849586849586-
CXR8-824-1608

Table 4 provides more details on the dataset splits used in our training and evaluation. Our COCO train and validation splits are slightly smaller than normal because of our approach of using segmentation masks. We decide to omit some excessively small segmentations which account for less than 1% of the data. For tasks that require training (COCO and refCOCOg), we use the train split of the COCO object detection dataset as our retrieval data. We make sure to omit the closest match when training object detection on COCO with retrieval to avoid label leakage. We also confirm that no images are repeated in the validation split from the training split for both datasets.

Appendix D Referring Object Classification

Method TypeMethodAccuracymAP
ClassificationOLIVE-R64.140.5
CLIP ViT-L/1440.945.1
RegionCLIP RN50-61.4
OVR-44.5
GenerativeOLIVE-G (GPT2)76.660.4
OLIVE-G (Llama 2)76.860.3
OLIVE-RG (GPT2)74.857.5
OLIVE-RG (Llama 2)74.156.2

This task requires the LLM to predict the object class label given a ground truth input annotation (e.g. bounding box, segmentation, etc). We follow a similar evaluation protocol used in Zhong etal. (2022) and Zareian etal. (2021), in which the ground truth annotation is supplied to avoid localization error. Table 5 shows the overall referring object classification accuracy and mAP 333To simplify the calculation, we assigned a confidence score of 1 to each prediction. Reported mAP may be lower than the true value when using more accurate probabilities. for our methods. We observe several findings. First, although retrieved examples help with domain adaption and rare objects, it does not improve the overall in-domain performance. Second, both the LLama 2 and GPT2 baseline have similar performances on the task, suggesting that even smaller models can learn vision-language grounding. Lastly, even our retrieval-only baseline, which requires no training, has better accuracy than some parameterized methods such as CLIP.

OLIVE: Object Level In-Context Visual Embeddings (15)

Appendix E Multi-Task Model

We also explore the possibility of training a multi-task model using a similar curriculum learning strategy to LLaVA Liu etal. (2023). We first train the model on the referring object classification task to perform the object-word level alignment. The model is then trained on the referring expression generation task, and finally on an object instruction following dataset Cai etal. (2023) with many different tasks. For each stage of training, we formulate the task in an instruction-following manner through the prompts in Table 3. This allows the model to be responsive to many different user intents (Figure 13)

Appendix F Referring Expression Generation

MethodMETEORCIDEr
OLIVE-G (Llama 2)16.564.0
OLIVE-RG (Llama 2)16.667.7
OLIVE-G (GPT2)16.470.9
OLIVE-RG (GPT2)17.075.0
SLR Yu etal. (2017)15.459.2
SLR+Rerank Yu etal. (2017)15.966.2
GLAMM Rasheed etal. (2023)16.2105.0
GRIT Wu etal. (2022)15.271.6
Kosmos 2 (zero shot)12.260.3
Kosmos 2 (fewshot k = 2)13.862.2
Kosmos 2 (fewshot k = 4)14.162.2
Flamingo-9B (zero shot)9.234.3
Flamingo-9B (fewshot k = 2)10.236.2
Flamingo-9B (fewshot k = 4)12.339.6

We study our modelโ€™s overall performance on referring expression generation by quantitatively evaluating our model on the RefCOCOg validation set shown in Table 6. Several findings can be observed. First, including retrieved multimodal documents results in slightly better performance. Second, the size of the LLM can be modified without much performance change, with GPT2 performing slightly better than Llama 2. Third, having global image context contained in the object representation is important, as methods that crop the image region (e.g. Flamingo) perform worse.

Appendix G Training Hyperparameters

We provide the detailed training hyperparameters in Table7.

HyperparameterClassificationGeneration
Epochs15
Batch Size44
Training Stepsโˆผsimilar-to\simโˆผ 200,000โˆผsimilar-to\simโˆผ 56,030
Learning Rate2e-52e-5
OptimizerAdamAdam
GPU UsedGTX 3090GTX 3090
Train Time (hours)247.5

OLIVE: Object Level In-Context Visual Embeddings (2024)
Top Articles
Latest Posts
Article information

Author: Merrill Bechtelar CPA

Last Updated:

Views: 6016

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Merrill Bechtelar CPA

Birthday: 1996-05-19

Address: Apt. 114 873 White Lodge, Libbyfurt, CA 93006

Phone: +5983010455207

Job: Legacy Representative

Hobby: Blacksmithing, Urban exploration, Sudoku, Slacklining, Creative writing, Community, Letterboxing

Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.