OLIVE: Object Level In-Context Visual Embeddings (2024)

Timothy Ossowski¹, Junjie Hu^1,2
¹Department of Computer Science, ²Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison, WI, USA
ossowski@wisc.edu, junjie.hu@wisc.edu

Abstract

Recent generalist vision-language models (VLMs) have demonstrated impressive reasoning capabilities across diverse multimodal tasks. However, these models still struggle with fine-grained object level understanding and grounding. In terms of modeling, existing VLMs implicitly align text tokens with image patch tokens, which is ineffective for embedding alignment at the same granularity and inevitably introduces noisy spurious background features. Additionally, these models struggle when generalizing to unseen visual concepts and may not be reliable for domain-specific tasks without further fine-tuning. To address these limitations, we propose a novel method to prompt large language models with in-context visual object vectors, thereby enabling controllable object level reasoning. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Furthermore, we propose region-level retrieval using our object representations, facilitating rapid adaptation to new objects without additional training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance, while also offering zero-shot generalization and robustness to visually challenging contexts.¹¹1Our code and models are available at https://github.com/tossowski/OLIVE

1 Introduction

Despite the popularity, many existing VLMs such as LLaVA Liu etal. (2023), MiniGPT4 Zhu etal. (2023a), and mPLUG-OWL Ye etal. (2023) handle the entire image for visual understanding, leading to two major shortcomings. First, these VLMs use a visual transformer to split an image into a grid of image patches and embed them into a lengthy array of image patch embeddings that have object level features scattered around different positions of the array. This leads to the different granularity between the image patch tokens and text tokens, further creating difficulty in aligning and grounding visual objects to text concepts. Second, feeding all image patch embeddings to the large language model (LLM) decoder is problematic due to the resulting long context and inefficiency of including in-context examples from multiple images.

To improve fine-grained visual alignment, recent region-based VLMs are pre-trained to integrate object level information into the LLM decoder. GPT4ROI Zhang etal. (2023b) pre-trains LLMs to understand ROIAlign featuresHe etal. (2017) extracted from bounding boxes. Other similar methods such as Shikra Chen etal. (2023) or Kosmos-2 Peng etal. (2024) ground and refer to objects using text in multimodal referential dialogues. FERRET You etal. (2023) and ViP-LLaVA Cai etal. (2023) further support free-form shapes as referring input by summarizing visual features sampled within the region of interest. Although these methods provide improvement to object level reasoning, they still fail at recognizing unseen/rare objects and are sensitive to spurious background features, as shown in §5. Even powerful closed-source multimodal models such as GPT4V are unreliable to deploy in high-stakes domain-specific situations such as the medical domain Senkaiahliyan etal. (2023).

A straightforward way to handle generalization to unseen visual content is to integrate a retrieval component. Methods such as REVEAL Hu etal. (2023) and MuRAG Chen etal. (2022) provide retrieved multimodal facts as supplementary context to help VLMs generalize to new concepts without further training. However, these models do not consider object level retrieval and in-context prediction. Models such as Flamingo Alayrac etal. (2022) and Qwen-VL Bai etal. (2023) allow for in-context examples from multiple images, yet do not support object level retrieval and reasoning.

To address the above issues, we propose to encode object level in-context visual embeddings (OLIVE) to enhance LLMs with region-level reasoning capabilities. Critically, we omit lengthy image patch features and encode visual object embeddings by a lightweight encoder of 20 million parameters, allowing for faster training and direct connection to existing LLMs. This preserves the full functionality of the original LLMs, while also introducing novel multimodal reasoning abilities. Furthermore, our object level retrieval module allows for more precise queries and retrieved information to help the model adapt to domain-specific tasks with limited training data. Our contributions are summarized below and in Table 1:

•
We propose a lightweight object encoder that can be connected to existing LLMs to enable controllable object level multimodal reasoning with free-form input annotations.
•
Our model omits image patch features and summarizes object features into a single vector, significantly reducing context length for more efficient training and inference, and allowing for in-context examples from multiple images.
•
We conduct extensive experiments with region-retrieval of object level features and showcase rapid adaptation to unseen visual concepts.

Model	Free-form Visual Prompts	Free-form Text prompts	Visual Generalization	Generative Approach	Multi-Image
Ferret	✓	✓	✗	✓	✗
Flamingo	✗	✓	✓	✓	✓
GPT4ROI	✗	✓	✗	✓	✗
GLAMM	✗	✓	✗	✓	✗
RegionCLIP	✗	✓	✗	✗	✗
Llama-Adapter v2	✗	✗	✗	✓	✗
ViP-LLAVA	✓	✗	✗	✓	✗
OLIVE	✓	✓	✓	✓	✓

2 Preliminaries

Generative VLM Architecture

Recent generative VLMs (e.g., LLaVA, BLIP-2) adopt a similar architecture that connects a pre-trained visual encoder $\phi_{v}$ and a pre-trained language model decoder $\phi_{t}$ through a lightweight fusion neural network, denoted as $\phi_{c}$ . Specifically, the fusion module first uses a projection function to map a visual feature $\mathbf{v}\in\mathcal{V}$ to the text embedding space $\mathcal{X}$ of the language model decoder, and then fuse the visual and text embeddings as input to language model decoder. Formally, given an image $v$ and a text prompt $x$ , the decoder takes in the combined feature $\mathbf{x}$ to autoregressively predict the output $y$ .

$\displaystyle\mathbf{x}_{t}=$	$\displaystyle\texttt{TxtEmbed}(x;\phi_{t})\in\mathcal{X}$	(1)
$\displaystyle\mathbf{v}=$	$\displaystyle\texttt{ImgEncoder}(v;\phi_{v})\in\mathcal{V}$	(2)
$\displaystyle\mathbf{x}_{v}=$	$\displaystyle\texttt{Project}(\mathbf{v};\phi_{c})\in\mathcal{X}$	(3)
$\displaystyle\mathbf{x}=$	$\displaystyle\texttt{Fuse}(\mathbf{x}_{v},\mathbf{x}_{t};\phi_{c})$	(4)
$\displaystyle p_{\text{vlm}}(y\|v,x)=$	$\displaystyle\prod_{j=0}^{\|y\|}p_{\phi_{t}}(y_{j}\|\mathbf{x},y_{<j})$	(5)

Different from prior fusion modules (e.g., linear projection in LLaVA, gated cross-attention in Flamingo, and Q-former in BLIP-2) that project the whole image features, we propose an object level encoder (§3.1) that captures fine-grained region features and speeds up training and inference.

Visual Instruction Tuning

We adopt a similar visual instruction-tuning approach as Liu etal. (2023) by fine-tuning parts of the VLM parameters (e.g., $\phi_{c}$ and/or $\phi_{t}$ ) on instruction-following data. The training objective is based on maximum likelihood estimation for next-token predictions given the input image and the text prompt. Different from prior work using pure text prompts, our object encoder and retrieval module (§3.1, §3.2) enables the usage of code-switched prompt sequence mixing text tokens and image object tokens, and the rapid adaptation to unseen domains via in-context prediction.

3 Method

OLIVE: Object Level In-Context Visual Embeddings (1)

This section as well as Figure1 highlights the main components of our method. We first design an object encoder (§3.1) to learn visual object embeddings in a shared vision-text space, then apply a similarity search over object embeddings to retrieve relevant visual objects (§3.2), and finally construct a code-switch multimodal prompt to integrate the retrieved object information for generation (§3.3).

3.1 Object Encoder

Following popular region-grounded models such as FERRETYou etal. (2023), we allow for free-form annotation of objects using the object segmentation mask $\mathbf{o}_{\text{mask}}$ as input. Specifically, we first encode an image $v$ with a vision transformer Dosovitskiy etal. (2020) to obtain patch-level features $\mathbf{v}$ :

\displaystyle\mathbf{v}

\displaystyle=\texttt{ImgEncoder}(v;\phi_{v})\in\mathbb{R}^{(n^{2}+1)\times d},

(6)

where $n$ is the grid size and $d$ is the dimension of hidden states. To further obtain an object level feature $\mathbf{v}_{\text{obj}}$ from the image, we first extract a subset of the image features $\mathbf{v}_{\text{masked}}$ corresponding to the binary object segmentation mask $\mathbf{o}_{\text{mask}}$ :

\displaystyle\mathbf{v}_{\text{masked}}

\displaystyle=\mathbf{v}[\texttt{Flatten}(\mathbf{o}_{\text{mask}})]\in\mathbb%{R}^{l\times d}

(7)

where $\mathbf{o}_{\text{mask}}$ is a $n\times n$ binary matrix, indicating the corresponding image patches occupied by an object in the image, and $l$ denotes the number of the occupied patches. These segmentation masks can be created by automatic segmentation tools such as SAMKirillov etal. (2023) or provided by human selection on the image. The segmentation mask is first flattened and used to select object patches $\mathbf{v}_{\text{masked}}$ from $\mathbf{v}$ . Finally, we obtain the object embedding by compressing $\mathbf{v}_{\text{masked}}$ into a single vector $\mathbf{v}_{\text{obj}}$ .

3.2 Visual Object Retrieval

In many cases, the object of interest does not resemble anything seen during training. With our visual object embeddings, we can easily perform object level retrieval to match an open class of visual objects and integrate the retrieved information into the language decoder for predicting unseen or rare objects from specific domains (e.g., biomedicine). To this end, we assume access to a retrieval set $\mathcal{R}=\{(\mathbf{o}_{i},d_{i},v_{i})\}^{m}_{i=1}$ , where each triple consists of an object’s segmentation mask $\mathbf{o}_{i}$ , the object’s text description $d_{i}$ and the image $v_{i}$ containing this object. To retrieve relevant objects from $\mathcal{R}$ , we use a similar object encoding as §3.1 except that we use the mean pooling of $\mathbf{v}_{\text{masked}}$ as the object encoder in Eq.(9), since this simple strategy does not require any learnable parameters for projection to the text embedding space and visual object embeddings can be pre-computed before any fine-tuning. However, we use a learnable object encoder in Eq.(8) to connect object embeddings to the LM decoder during instruction-tuning for text generation (§3.3).

\displaystyle\mathbf{v}_{\text{obj}}

\displaystyle=\texttt{MeanPool}(\mathbf{v}_{\text{masked}})\in\mathbb{R}^{d},

(9)

During retrieval, we compute a query vector $\mathbf{v}_{\text{query}}$ for a given object, and compute the cosine similarity scores between $\mathbf{v}_{\text{query}}$ and all the visual object embeddings from $\mathcal{R}$ to obtain the top $k$ closest triples, denoted as $\mathcal{K}=\{(\mathbf{o}_{i},d_{i},v_{i})\}_{i=1}^{k}$ .

3.3 In-context Prompt Construction

As the visual object embeddings are projected into the text embedding space of the LM decoder, this allows us to construct a code-switched prompt that mixes visual objects with text tokens for the LM decoder (e.g., Llama 2Touvron etal. (2023)). In addition, as our object encoder compresses a visual object into a single vector $\mathbf{v}_{\text{obj}}$ , this significantly shortens the length of the visual tokens that the LM decoder needs to fuse with text tokens. Therefore, we can easily integrate multiple retrieved object embeddings into the prompt to augment the LM decoder for in-context text generation. Specifically, we define a special vocabulary token [obj] which can be inserted flexibly in the user prompt $x$ . For example, the user can ask “[obj] Describe this part of the image" to perform region-level description. The embedding of this token is directly replaced with its corresponding visual object embedding. Formally, given a text prompt $x$ that contains indexed [obj] tokens referring to an object $\mathbf{v}_{\text{obj}}$ of interest in an image $v$ and its relevant objects in $\mathcal{K}$ , we define a prompting function that replaces the text embedding of [obj] with its corresponding visual object embedding, and integrates the top $k$ most similar objects $\mathcal{K}$ as in-context examples. For example, a prompt with retrieved in-context examples can be “The top [k] related objects are: [obj $\_1$ ] is a [label],…[obj $\_k$ ] is a [label]. [obj $\_\text{query}$ ] What is this?”. We provide more details about in-context prompt templates and construction in Appendix A.

\displaystyle\mathbf{x}=\texttt{Prompt}(x,\mathbf{v}_{\text{obj}},\mathcal{K})

(10)

Finally, we feed the multimodal prompt $\mathbf{x}$ into the LM decoder for text generation following Eq.(5). Note that compared to prior VLMs (e.g., LLaVA) that directly fuse the patch-level features $\mathbf{v}$ of the whole image (Eq.6) with object information scattering around different positions in $\mathbf{v}$ , our object encoding is computationally more efficient and speeds up the training that involves multiple in-context objects in the multimodal prompt.

4 ExperimentalSettings

In this section, we first describe two main object-level tasks for evaluation (§4.1) together with the datasets used (§4.2). Finally, we describe three variants of our model (§4.3), the training details §4.4), and the other baselines in comparison (§4.5).

4.1 Object-level Tasks

Referring Object Classification

Given an object referred by its image location (e.g. segmentation mask/bounding box), the model is instructed to generate a text that predicts the object’s class label in a predefined label set, $\mathcal{C}\in\{c_{1},c_{2},...c_{n}\}$ . We provide the ground truth segmentation mask to eliminate localization errors and focus on evaluating the models’ understanding of image objects.

Referring Expression Generation

Given an input image object referred by a segmentation mask, the model is instructed to generate a natural language expression which semantically matches multiple ground-truth references $\mathcal{R}\in\{r_{1},r_{2},...r_{m}\}$ . We use METEOR Banerjee and Lavie (2005) and CIDEr Vedantam etal. (2015) score for evaluating generated description quality.

4.2 Datasets

This section describes the different datasets used in our experiments, with more details in Appendix 4.

Common Objects in Context (COCO)

Lin etal. (2014) is a popular visual reasoning dataset with over 800,000 object-level annotations for 80 categories of objects. We use it to train our model to understand region input since it contains high-quality segmentation annotations. We use the standardized train and validation 2017 splits for the detection task, and discard a few ( $<$ 1%) small segmentation annotations that fail to be converted into a binary mask. Following Zhong etal. (2022), we evaluate in the setting where ground-truth segmentations are provided as input to eliminate localization errors. We use the standard metric of mean average precision (mAP) for object detection using the COCO API,²²2https://github.com/cocodataset/cocoapi as well as overall accuracy.

refCOCOg

Kazemzadeh etal. (2014) is a variant of the COCO dataset with about 50,000 annotations for objects and their description. We use the data to train our model to describe image regions and use their standardized train/validation split.

ChestX-Ray8 (CXR8)

Wang etal. (2017) is a medical dataset consisting of 108,948 frontal-view X-ray images. The image annotations for the 8 possible pathologies are text-mined from the radiology reports using NLP tools. A small subset of 984 images contains bounding box annotation of the pathology. We use this subset for our zero-shot domain adaptation experiments, splitting the data into 16% retrieval set and 84% test data. The retrieval set consists of 20 examples of each pathology, and we use overall accuracy as the evaluation metric.

4.3 OLIVE Variants

OLIVE-R(Retrieval-only)

This retrieval-only method predicts the answer to the user question by taking a majority vote of the top $k$ retrieved examples. For simplicity, we fix $k=5$ for this setting unless otherwise specified and analyze the effect of $k$ in Figure 6. Although simple, this baseline proves to be effective and provides salient additional context as described in §4.3. However, this discriminative model does not allow for free-form text generation for tasks such as region captioning.

OLIVE-G(Generative-only)

This model is trained to generate free-form text based solely on the user question and corresponding object features. We omit the retrieved information to observe the capability of the standalone object representations. We find that even without retrieval, the model can learn to perform more challenging object-level tasks such as region description. The final decoder input can be expressed as a variant of Eq. (10):

\displaystyle\mathbf{x}=\texttt{Prompt}(x,\mathbf{v}_{\text{obj}}).

(11)

OLIVE-RG(Full)

Our full model generates text outputs based on in-context object examples from retrieval. The multimodal in-context prompt is constructed using Eq. (10). This prompt includes the retrieved object features, their labels, and their similarity scores. The exact construction can be found in Appendix A. The top $k$ retrieved multimodal documents in $\mathcal{K}$ are obtained using the same retrieval described in§3.2 and ordered in increasing relevance score. Both OLIVE-G and OLIVE-RG use greedy decoding for text generation.

4.4 Training Details

Our model uses a frozen ViT-L/14 vision transformer from a CLIP model to obtain patch-level features. For our LLM backbone, we use either Llama 2-7B or GPT-2 (124M)Radford etal. (2019). The LLM is finetuned with LoRA Hu etal. (2021) as we find this improves model performance. We use the train splits of two different region-level datasets (i.e., COCO, refCOCOg) as our training data for their respective tasks, and evaluate models on their corresponding validation splits because their test data does not have object-level annotation. More details are in Table 7 and we leave the other hyperparameter search to future exploration. We additionally find that we can train a multi-task model by combining the datasets for all object-level tasks (Details in Appendix E).

4.5 Other Baselines in Comparison

CLIP

Radford etal. (2021) Contrastive Language Image Pretraining learns a joint vision-language space between images and their matching captions. We use this method for zero shot object classification by predicting the target with the highest cosine similarity to the cropped region.

BioMedCLIP

Zhang etal. (2023a)The authors train a CLIP model aligned to biomedical image-text pairs, achieving state of the art on a variety of medical tasks. We use this model as a baseline for object classification in the medical domain.

RegionCLIP

Zhong etal. (2022) This model learns region-text level alignment through soft-labels obtained from CLIP. We use it for referring object detection based on ROIAlign features.

Kosmos 2

Peng etal. (2024) This generative VLM trains a LLM decoder to perform a variety of visual grounding tasks from their newly introduced grounded image-text (GRIT) dataset. We compare with their results on referring expression generation on the refCOCOg dataset.

Flamingo

Alayrac etal. (2022) This generative model learns to connect frozen visual features and LLMs by training on interleaved image-text data. We evaluate Flamingo’s few-shot performance on referring expression generation on cropped image regions. We use an open-source implementation trained on the multimodal C4 Zhu etal. (2023b) and LAION-2b Schuhmann etal. (2022) datasets.

5 Results and Analysis

OLIVE: Object Level In-Context Visual Embeddings (2)

\renewrobustcmd

Method Type	Pre-Training Data	Method	Accuracy
Classification	None	OLIVE-R	33.5
	PMC-15	BioMedCLIP	32.5
	PMC-15	$\text{BioMedCLIP}_{crop}$	23.3
	CLIP400M	CLIP	14.0
	None	Random Guess	12.5
	CLIP400M	$\text{CLIP}_{crop}$	11.2
Generative	COCO	OLIVE-RG	31.2
	C4 + LAION-2b	Flamingo-9B	12.5
	COCO	OLIVE-G	0.0

5.1 Referring Object Classification

Unseen Object Classification

One of the benefits of our retrieval augmented system is its rapid generalization to unseen visual concepts. We estimate this capability by training on the COCO dataset and evaluating object classification on an unseen medical dataset which has drastically different types of images and limited training data. Table 2 illustrates the performance of our method on the CXR8 dataset in either a classification or generative setting. Even with as little as 20 examples per class in the medical retrieval set, OLIVE-R achieves competitive performance compared to domain-adapted models (i.e., BioMedCLIP), which we hypothesize is because of our region-level retrieval and in-distribution retrieval set. We also note that our generative approach OLIVE-RG can utilize the retrieved in-context examples and achieve similar performance to BioMedCLIP, despite only being trained on COCO images. Without retrieval, the generative model fails catastrophically with $0\%$ accuracy, and zero-shot CLIP achieves about the same performance as random guessing.

Rare Object Classification

We also investigate our model’s performance on rare, but seen objects. Figure 3 shows our method’s performance on the top 5 rarest classes in the COCO dataset. For OLIVE-G and OLIVE-RG, we use a 224 pixel resolution visual encoder to match the CLIP visual encoder. OLIVE-G tends to have lower performances on the rare classes. However, when combining retrieval with parameterized methods in OLIVE-RG and OLIVE-RG-336px, the performance on rare classes improves significantly, with OLIVE-RG-336px performing better than CLIP on all rare classes. OLIVE-RG also achieves better performance on three out of five classes despite being trained on less data. Our model’s overall performance can be found in Appendix 5.

OLIVE: Object Level In-Context Visual Embeddings (3)

5.2 Referring Expression Generation

Captioning Unseen Objects

In addition to referring object classification, we investigate our model’s ability to caption out-of-distribution objects. Figure 2 illustrates an example of asking our model to describe animals not seen during training. Without retrieval, OLIVE-G fails to describe the shark and turtle. However, after manually adding just 5 labeled objects of turtles and sharks to the existing retrieval set, OLIVE-RG accurately describes the object and provides supporting examples for its prediction. The label description for each object in the retrieval set is only the name of the animal, but the model generates additional characteristics in its description. Appendix B shows more examples of zero-shot adaptation to unseen visual concepts in the object classification setting.

OLIVE: Object Level In-Context Visual Embeddings (4)

Challenging Visual Context

To test the quality of the representations generated from our object encoder, we qualitatively evaluate our model prediction in adversarial visual contexts. Figure 7 shows a white dog and a black cat in a “yin-yang” shape. We observe that free-form annotation allows for more precise user queries and object descriptions, and illustrates other properties such as scene content awareness and patch-level details as shown in Appendix B. While many VLMs can accurately understand normal scenes, Figure 4 illustrates an example in which an object-level representation may be necessary, with recent works struggling to caption the snowboarder on the beach. The detailed performance of our model on the refCOCOg captioning task can be found in Appendix 6.

5.3 In-context Example Size

OLIVE: Object Level In-Context Visual Embeddings (5)

Since our method omits image patch features and compresses object information into a single vector, it can process many objects from different images at once. In Figure 5, we highlight the difference in context length for various methods when prompted with multimedia examples. We assume an average prompt length of 30 accompanying each in-context image example for all models. Even approaches designed for interleaved image-text data such as Flamingo insert multiple latent vectors for each image, incurring a higher cost than our approach.

5.4 Sensitivity on Retrieval: Coverage and $k$

OLIVE: Object Level In-Context Visual Embeddings (6)

In Figure 6 we analyze the effect of changing the size of the object retrieval set as well as the number of retrieved examples, $k$ . To thoroughly test various settings, we evaluate the retrieval-only based approach (OLIVE-R) on the validation split of the COCO dataset using different sized subsets of the training data for retrieval. We ensure the retrieval set contains an equal amount of each object class when possible. Our results indicate that the optimal value of $k$ depends on the size of the retrieval set. With a small retrieval dataset (red), performance is lower and the optimal $k$ tends to be smaller. Larger retrieval sets (blue, green) benefit from retrieving more examples and have greater performance.

OLIVE: Object Level In-Context Visual Embeddings (7)

5.5 Object Vector Visualization

Having a single vector representation for each object allows for visualization using dimensionality reduction. In Figure 8, we perform principal component analysis (PCA) on the hidden states of object vectors at different layers in the LLM decoder. We plot 200 examples from each of 10 object categories and note several patterns. First, objects from the same class tend to appear together, even though they appear in different visual contexts. This suggests that the object encoder has semantic understanding of the visual concepts. Second, the object vectors naturally form hierarchical clusters where objects from the same super class such as vehicle, animal, or fruit have overlapping clusters. Lastly, the clustering appears similar across all layers, with only minor variations.

OLIVE: Object Level In-Context Visual Embeddings (8)

OLIVE: Object Level In-Context Visual Embeddings (9)

6 Related Work

Grounding in Language and Vision

A popular approach for aligning vision and language embeddings is contrastive learning methods such as CLIP and ALIGN Li etal. (2021). However, these methods align the entire image representation, leading to poor reasoning on image details for downstream vision language tasks. RegionCLIP Zhong etal. (2022) and GLIP Li etal. (2022) address this issue by proposing fine-grained alignment with region-text pairs during pretraining. GLIPv2 Zhang etal. (2022) further improves the pretraining and alignment by introducing localization, detection, and other tasks. Another recent popular approach involves training models on automatically curated region-level data from image-caption pairs Peng etal. (2024). Many other works focus on region-level alignment during pretraining for greater vision-language understanding You etal. (2023); Chen etal. (2023); Zeng etal. (2022b, a). More generally, a recent study Bugliarello etal. (2023) shows that VLMs with fine-grained object-level pretraining such as X-VLM Zeng etal. (2022a) have better reasoning ability. Other works align vision and language using regularization or loss to create relation aware cross attention between modalities Pandey etal. (2023); Ren etal. (2021).

Visual Resampling

Visual resampling is a popular technique to compress long sequences of image features into a few rich vector representations. This is achieved by constructing a fixed amount of learnable vectors that attend to the visual features through cross-attention layers. Models such as BLIP Li etal. (2023) first explore this idea to connect frozen vision features to LLMs efficiently by summarizing the content of the image. Other methods including X-Decoder Zou etal. (2023a) or SEEM Zou etal. (2023b) use resampling to encode various types of prompts or intents which improve the LLM decoding ability. Additionally, works such as Flamingo Alayrac etal. (2022) and Qwen-VL Bai etal. (2023) show that multiple images can be inserted in-context to the prompt by compressing image features with resamplers, enabling few shot capabilities. Our work visually resamples object representations for object-conditioned text generation, and only uses a single vector for the representation. This allows for more fine-grained reasoning and longer in-context prompting.

Retrieval Augmented VLMs

In the text domain, learning to retrieve relevant documents to enhance the LLM query Guu etal. (2020) has been explored extensively Wang etal. (2023). Recent VLM works follow a similar approach to retrieve multimodal documents to improve performance on knowledge-intensive tasks and improve generalization to rare situations. Gao etal. (2022) summarizes visual content into natural language to use as a query for dense passage retrieval. MuRAG Chen etal. (2022) proposes a multimodal image-text memory bank to help models answer challenging knowledge-based visual questions such as “What shape is the pediment on top of the white house?" REVEAL Hu etal. (2023) and RA-VQA Lin and Byrne (2022) learns a trainable multimodal retriever similar to REALM Guu etal. (2020) during pretraining to fetch relevant documents to answer questions, achieving state of the art performance on datasets such as VQAv2 Antol etal. (2015) and OKVQA Schwenk etal. (2022). To the best of our knowledge, we are the first to integrate region-level retrieval with LLMs, in which the multimodal documents are indexed by object-level visual features.

7 Conclusion

We present a simple approach to insert object level visual embeddings into large language model decoders, enabling object level reasoning with flexible prompt structure. Our object encoder compresses fine-grained region level information into a single vector, enabling in-context prompting with objects from multiple images and more efficient training and inference. In addition, we introduce the idea of region retrieval, which allows for precise queries free of image background noise and rapid generalization to rare and unseen objects with no parameter updates. We hope our method may help researchers design vision language models which can adapt to their needs by simply updating the retrieval set or object encoder, while also being responsive to varying user intents using LLM prompting techniques.

8 Limitations

While our approach provides a flexible way for users to supply object-level prompts, it does not output bounding boxes or other region-level grounding. This may be addressed in future research by further finetuning on region-level instruction tuning data as done in FERRET You etal. (2023), GLAMM Rasheed etal. (2023), and other region-level VLM pretraining. At the moment, we also do not explore generic image tasks such as VQA or image captioning. However, a potential solution is to use our object encoder to connect to existing VLMs (e.g. LLaVA) which excel at these tasks. Lastly, our results in the retrieval setting depend on the quality of the retrieved examples. Curating a high-quality retrieval set at the object-level can be challenging. However, existing tools such as GLIPv2 Zhang etal. (2022) allows for semi-automatic generation of region-level data as used in KOSMOS-2 Peng etal. (2024) in developing the GRIT dataset.

9 Ethical Considerations

Biases From Pretrained LLMs

Since our model uses existing pretrained LLMs such as Llama 2 or GPT2, it may inherent some of the social biases or toxicity acquired during their pretraining stages. While Llama 2 undergoes extensive alignment to ideal human values through reinforcement learning from human feedback (RLHF) Griffith etal. (2013), some of these toxic behaviors may still be present in the morally aligned model. We make sure to only use images of common objects in the COCO dataset, which do not contain any of these biases or violent scenes to the best of our knowledge. Nevertheless, further testing to ensure the impartiality of the model may be necessary before deploying in widespread technologies.

Domain Adaptation

Some of our experiments involve evaluating our model in a data-scarce domain in a zero shot manner with in-context prompting. While this is a promising direction for efficient domain adaptation, users should take caution in directly using model prediction, as this is a challenging task due to distribution shift. We encourage human-in-the-loop interaction to sanity check the outputs. Different from other ICL prompting methods, we provide retrieved examples and similarity scores which can help determine the trustworthiness of the model prediction, which may be valuable for high-risk domains such as medicine.

Acknowledgement

Ossowski and Hu are supported by the Wisconsin Alumni Research Foundation and the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number R01EB033782. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr,Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds,etal. 2022.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems,35:23716–23736.
Antol etal. (2015)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,CLawrence Zitnick, and Devi Parikh. 2015.Vqa: Visual question answering.In Proceedings of the IEEE international conference on computervision, pages 2425–2433.
Bai etal. (2023)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, JunyangLin, Chang Zhou, and Jingren Zhou. 2023.Qwen-vl: A frontier large vision-language model with versatileabilities.arXiv preprint arXiv:2308.12966.
Banerjee and Lavie (2005)Satanjeev Banerjee and Alon Lavie. 2005.Meteor: An automatic metric for mt evaluation with improvedcorrelation with human judgments.In Proceedings of the acl workshop on intrinsic and extrinsicevaluation measures for machine translation and/or summarization, pages65–72.
Bugliarello etal. (2023)Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, LisaAnne Hendricks,and Aida Nematzadeh. 2023.Measuring progress in fine-grained vision-and-language understanding.In Proceedings of the 61st Annual Meeting of the Associationfor Computational Linguistics.
Cai etal. (2023)MuCai, Haotian Liu, SivaKarthik Mustikovela, GregoryP Meyer, Yuning Chai,Dennis Park, and YongJae Lee. 2023.Making large multimodal models understand arbitrary visual prompts.arXiv preprint arXiv:2312.00784.
Chen etal. (2023)Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao.2023.Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195.
Chen etal. (2022)Wenhu Chen, Hexiang Hu, XiChen, Pat Verga, and WilliamW Cohen. 2022.Murag: Multimodal retrieval-augmented generator for open questionanswering over images and text.In Proceedings of the 2022 Conference on Empirical Methods inNatural Language Processing.
Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, etal. 2020.An image is worth 16x16 words: Transformers for image recognition atscale.arXiv preprint arXiv:2010.11929.
Gao etal. (2022)Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, YingNian Wu, and PremNatarajan. 2022.Transform-retrieve-generate: Natural language-centricoutside-knowledge visual question answering.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 5067–5077.
Griffith etal. (2013)Shane Griffith, Kaushik Subramanian, Jonathan Scholz, CharlesL Isbell, andAndreaL Thomaz. 2013.Policy shaping: Integrating human feedback with reinforcementlearning.Advances in neural information processing systems, 26.
Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020.Retrieval augmented language model pre-training.In International Conference on Machine Learning, pages3929–3938. PMLR.
He etal. (2017)Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017.Mask r-cnn.In Proceedings of the IEEE international conference on computervision, pages 2961–2969.
Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685.
Hu etal. (2023)Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun,Cordelia Schmid, DavidA Ross, and Alireza Fathi. 2023.Reveal: Retrieval-augmented visual-language pre-training withmulti-source multimodal knowledge memory.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 23369–23379.
Kazemzadeh etal. (2014)Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014.Referitgame: Referring to objects in photographs of natural scenes.In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP), pages 787–798.
Kirillov etal. (2023)Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, LauraGustafson, Tete Xiao, Spencer Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.2023.Segment anything.arXiv preprint arXiv:2304.02643.
Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023.Blip-2: Bootstrapping language-image pre-training with frozen imageencoders and large language models.In International Conference on Machine Learning.
Li etal. (2021)Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong,and Steven ChuHong Hoi. 2021.Align before fuse: Vision and language representation learning withmomentum distillation.Advances in neural information processing systems,34:9694–9705.
Li etal. (2022)LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li,Yiwu Zhong, Lijuan Wang, LuYuan, Lei Zhang, Jenq-Neng Hwang, etal. 2022.Grounded language-image pre-training.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 10965–10975.
Lin etal. (2014)Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and CLawrence Zitnick. 2014.Microsoft coco: Common objects in context.In Computer Vision–ECCV 2014: 13th European Conference,Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages740–755. Springer.
Lin and Byrne (2022)Weizhe Lin and Bill Byrne. 2022.Retrieval augmented visual question answering with outside knowledge.In Proceedings of the 2022 Conference on Empirical Methods inNatural Language Processing.
Liu etal. (2023)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee. 2023.Visual instruction tuning.arXiv preprint arXiv:2304.08485.
Pandey etal. (2023)Rohan Pandey, Rulin Shao, PaulPu Liang, Ruslan Salakhutdinov, andLouis-Philippe Morency. 2023.Cross-modal attention congruence regularization for vision-languagerelation alignment.In Proceedings of the 61st Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages 5444–5455,Toronto, Canada. Association for Computational Linguistics.
Peng etal. (2024)Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, andFuru Wei. 2024.Kosmos-2: Grounding multimodal large language models to the world.In Proceedings of the Twelfth International Conference onLearning Representations (ICLR).
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,etal. 2021.Learning transferable visual models from natural languagesupervision.In International conference on machine learning, pages8748–8763. PMLR.
Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, IlyaSutskever, etal. 2019.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9.
Rasheed etal. (2023)Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan,Hisham Cholakkal, RaoM Anwer, Erix Xing, Ming-Hsuan Yang, and FahadS Khan.2023.Glamm: Pixel grounding large multimodal model.arXiv preprint arXiv:2311.03356.
Ren etal. (2021)Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, AnYang, Jingren Zhou,XuSun, and Hongxia Yang. 2021.Learning relation alignment for calibrated cross-modal retrieval.arXiv preprint arXiv:2105.13868.
Schuhmann etal. (2022)Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, RossWightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, MitchellWortsman, etal. 2022.Laion-5b: An open large-scale dataset for training next generationimage-text models.Advances in Neural Information Processing Systems,35:25278–25294.
Schwenk etal. (2022)Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, andRoozbeh Mottaghi. 2022.A-okvqa: A benchmark for visual question answering using worldknowledge.In European Conference on Computer Vision, pages 146–162.Springer.
Senkaiahliyan etal. (2023)Senthujan Senkaiahliyan, Augustin Toma, Jun Ma, An-Wen Chan, Andrew Ha, KevinRAn, Hrishikesh Suresh, Barry Rubin, and BoWang. 2023.Gpt-4v (ision) unsuitable for clinical care and education: Aclinician-evaluated assessment.medRxiv, pages 2023–11.
Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Vedantam etal. (2015)Ramakrishna Vedantam, CLawrenceZitnick, and Devi Parikh. 2015.Cider: Consensus-based image description evaluation.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 4566–4575.
Wang etal. (2023)Liang Wang, Nan Yang, and Furu Wei. 2023.Learning to retrieve in-context examples for large language models.arXiv preprint arXiv:2307.07164.
Wang etal. (2017)Xiaosong Wang, Yifan Peng, LeLu, Zhiyong Lu, Mohammadhadi Bagheri, andRonaldM Summers. 2017.Chestx-ray8: Hospital-scale chest x-ray database and benchmarks onweakly-supervised classification and localization of common thorax diseases.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 2097–2106.
Wu etal. (2022)Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan,and Lijuan Wang. 2022.Grit: A generative region-to-text transformer for objectunderstanding.arXiv preprint arXiv:2212.00280.
Ye etal. (2023)Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, JunyangWang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, YuanhongXu, Hehong Chen, Junfeng Tian, Qian Qi, ji*zhang, and Fei Huang. 2023.mplug-owl: Modularizationempowers large language models with multimodality.
You etal. (2023)Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang,Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2023.Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704.
Yu etal. (2017)Licheng Yu, Hao Tan, Mohit Bansal, and TamaraL Berg. 2017.A joint speaker-listener-reinforcer model for referring expressions.In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 7282–7290.
Zareian etal. (2021)Alireza Zareian, KevinDela Rosa, DerekHao Hu, and Shih-Fu Chang. 2021.Open-vocabulary object detection using captions.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 14393–14402.
Zeng etal. (2022a)Yan Zeng, Xinsong Zhang, and Hang Li. 2022a.Multi-grained vision language pre-training: Aligning texts withvisual concepts.In Proceedings of the Thirty-ninth International Conference onMachine Learning.
Zeng etal. (2022b)Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and WangchunshuZhou. 2022b. $x^{2}$ -vlm: All-in-one pre-trained model for vision-language tasks.arXiv preprint arXiv:2211.12402.
Zhang etal. (2022)Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, XiyangDai, Lijuan Wang, LuYuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022.Glipv2: Unifying localization and vision-language understanding.Advances in Neural Information Processing Systems,35:36067–36080.
Zhang etal. (2023a)Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston,Rajesh Rao, MuWei, Naveen Valluri, Cliff Wong, etal. 2023a.Large-scale domain-specific pretraining for biomedicalvision-language processing.arXiv preprint arXiv:2303.00915.
Zhang etal. (2023b)Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, KaiChen, and Ping Luo. 2023b.Gpt4roi: Instruction tuning large language model onregion-of-interest.arXiv preprint arXiv:2307.03601.
Zhong etal. (2022)Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella,LiunianHarold Li, Luowei Zhou, Xiyang Dai, LuYuan, Yin Li, etal. 2022.Regionclip: Region-based language-image pretraining.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 16793–16803.
Zhu etal. (2023a)Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.2023a.Minigpt-4: Enhancing vision-language understanding with advancedlarge language models.arXiv preprint arXiv:2304.10592.
Zhu etal. (2023b)Wanrong Zhu, Jack Hessel, Anas Awadalla, SamirYitzhak Gadre, Jesse Dodge, AlexFang, Youngjae Yu, Ludwig Schmidt, WilliamYang Wang, and Yejin Choi.2023b.Multimodal c4: An open, billion-scale corpus of images interleavedwith text.In Advances in Neural Information Processing Systems (D&B).
Zou etal. (2023a)Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, XiyangDai, Harkirat Behl, Jianfeng Wang, LuYuan, etal. 2023a.Generalized decoding for pixel, image, and language.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 15116–15127.
Zou etal. (2023b)Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, andYongJae Lee. 2023b.Segment everything everywhere all at once.In Advances in Neural Information Processing Systems (poster).

Appendix A Appendix

Task	ICL Prompt for Retrieved Examples	Vanilla Prompts
ObjectClassification	You are a helpful vision assistant trained to help people analyze images.The top [k] related objects are:[obj] is a [label] with confidence [score][obj] is a [label] with confidence [score]⋮[vanilla prompt]	[obj] What is this? Answer in 1-2 words[obj] What is this object? Answer with a short word or phrase.[obj] Identify this object.Here is an object [obj]. What is this? Answer with a short word or phrase.
RegionDescription	You are a helpful vision assistant trained to help people analyze images.The top [k] related objects are:[obj] is a [label] with confidence [score][obj] is a [label] with confidence [score]⋮[vanilla prompt]	[obj] Briefly describe this image region.[obj] Describe this part of the image.[obj] Share some details about about what’s happening here in the image.[obj] Break down what you see in this particular part of the picture.[obj] Describe what you notice in this area of the picture.

Table 3 contains all the prompts we use to instruct the LLM decoder.

Appendix B Qualitative Examples

OLIVE: Object Level In-Context Visual Embeddings (11)

OLIVE: Object Level In-Context Visual Embeddings (12)

OLIVE: Object Level In-Context Visual Embeddings (13)

OLIVE: Object Level In-Context Visual Embeddings (14)

Here we include several selected examples showcasing the strengths and weaknesses of our approach.

Visual Concept Generalization

In Figure 9 we demonstrate more examples of rapid generalization to new visual concepts. Many existing methods confidently predict concepts from their pretraining, while ours can predict new concepts on the fly.

Scene Content Awareness

Even though our object representation involves masking out image patch features from other parts of the image, we have observed that the object vector still contains information about its surroundings. Figure 10 illustrates this phenomenon, where OLIVE can include the cow in its description, despite not including any image patches corresponding to the cow in the user selection.

Patch level Detail

Our method also can identify and describe small objects at the patch level. Figure 11 shows an example of object classification on smaller objects.

Describing Partially Visible Objects

We notice that our model can make mistakes when describing occluded or partially visible objects as seen in Figure 12. We hypothesize that the training data of refCOCOg does not include these kinds of image regions, which also limits its availability in retrieval data. This may be addressed with larger-scale pre-training on data such as GRIT which likely includes more occluded objects.

Errors in Detailed Description

While our model can identify the object most of the time, it sometimes gets minor details incorrect. For example, the colors of a shirt or other piece of clothing are seen in Figure 12. This may be due to the extreme compression we learn into a single vector. Future work may consider visually resampling the object features into more than just 1 latent vector for detailed captioning, but still use the single vector representation for retrieval.

Appendix C Dataset Information

Dataset	Train Split	Validation Split	Retrieval Set Train Split	Retrieval Set Test Split	Number of Classes
COCO	849586	36320	849586	849586	80
refCOCOg	44822	5000	849586	849586	-
CXR8	-	824	-	160	8

Table 4 provides more details on the dataset splits used in our training and evaluation. Our COCO train and validation splits are slightly smaller than normal because of our approach of using segmentation masks. We decide to omit some excessively small segmentations which account for less than 1% of the data. For tasks that require training (COCO and refCOCOg), we use the train split of the COCO object detection dataset as our retrieval data. We make sure to omit the closest match when training object detection on COCO with retrieval to avoid label leakage. We also confirm that no images are repeated in the validation split from the training split for both datasets.

Appendix D Referring Object Classification

Method Type	Method	Accuracy	mAP
Classification	OLIVE-R	64.1	40.5
	CLIP ViT-L/14	40.9	45.1
	RegionCLIP RN50	-	61.4
	OVR	-	44.5
Generative	OLIVE-G (GPT2)	76.6	60.4
	OLIVE-G (Llama 2)	76.8	60.3
	OLIVE-RG (GPT2)	74.8	57.5
	OLIVE-RG (Llama 2)	74.1	56.2

This task requires the LLM to predict the object class label given a ground truth input annotation (e.g. bounding box, segmentation, etc). We follow a similar evaluation protocol used in Zhong etal. (2022) and Zareian etal. (2021), in which the ground truth annotation is supplied to avoid localization error. Table 5 shows the overall referring object classification accuracy and mAP ³³3To simplify the calculation, we assigned a confidence score of 1 to each prediction. Reported mAP may be lower than the true value when using more accurate probabilities. for our methods. We observe several findings. First, although retrieved examples help with domain adaption and rare objects, it does not improve the overall in-domain performance. Second, both the LLama 2 and GPT2 baseline have similar performances on the task, suggesting that even smaller models can learn vision-language grounding. Lastly, even our retrieval-only baseline, which requires no training, has better accuracy than some parameterized methods such as CLIP.

OLIVE: Object Level In-Context Visual Embeddings (15)

Appendix E Multi-Task Model

We also explore the possibility of training a multi-task model using a similar curriculum learning strategy to LLaVA Liu etal. (2023). We first train the model on the referring object classification task to perform the object-word level alignment. The model is then trained on the referring expression generation task, and finally on an object instruction following dataset Cai etal. (2023) with many different tasks. For each stage of training, we formulate the task in an instruction-following manner through the prompts in Table 3. This allows the model to be responsive to many different user intents (Figure 13)

Appendix F Referring Expression Generation

Method	METEOR	CIDEr
OLIVE-G (Llama 2)	16.5	64.0
OLIVE-RG (Llama 2)	16.6	67.7
OLIVE-G (GPT2)	16.4	70.9
OLIVE-RG (GPT2)	17.0	75.0
SLR Yu etal. (2017)	15.4	59.2
SLR+Rerank Yu etal. (2017)	15.9	66.2
GLAMM Rasheed etal. (2023)	16.2	105.0
GRIT Wu etal. (2022)	15.2	71.6
Kosmos 2 (zero shot)	12.2	60.3
Kosmos 2 (fewshot k = 2)	13.8	62.2
Kosmos 2 (fewshot k = 4)	14.1	62.2
Flamingo-9B (zero shot)	9.2	34.3
Flamingo-9B (fewshot k = 2)	10.2	36.2
Flamingo-9B (fewshot k = 4)	12.3	39.6

We study our model’s overall performance on referring expression generation by quantitatively evaluating our model on the RefCOCOg validation set shown in Table 6. Several findings can be observed. First, including retrieved multimodal documents results in slightly better performance. Second, the size of the LLM can be modified without much performance change, with GPT2 performing slightly better than Llama 2. Third, having global image context contained in the object representation is important, as methods that crop the image region (e.g. Flamingo) perform worse.

Appendix G Training Hyperparameters

We provide the detailed training hyperparameters in Table7.

Hyperparameter	Classification	Generation
Epochs	1	5
Batch Size	4	4
Training Steps	$\sim$ 200,000	$\sim$ 56,030
Learning Rate	2e-5	2e-5
Optimizer	Adam	Adam
GPU Used	GTX 3090	GTX 3090
Train Time (hours)	24	7.5