Open-Vocabulary Detection

You, Fri May 10 2024 • computer vision

Object detection is a computer vision task that involves locating and classifying objects within images or videos. This is done by training a network with image examples containing annotations of the objects on the image and their respective classes, the classes are pre-defined such as "car", "person", "dog", etc. This approach is an unimodal approach, since it uses only the image medium to extract information about the objects, the model is learning to identify them in term of form, shape, color.

The last few years has seen an unprecedented advance in multi-modal approaches, models that combine multiple modalities text, image, sound, etc. These models are used for example for text-to-image generation, visual question answering or natural language for visual reasoning. These approaches evolved with the advances of Foundation Models, that is, large models pre-trained on a vast quantity of data at scale (often by self-supervised learning or semi-supervised learning). These foundation models initially came from the field of Natural Language Processing (NLP) through the development oftext models based on the transformer architecture, but as it became possible to use not only text but images on the transformer, new techniques arrived to process image and text simultaneously, the result is often referred in the literature as Vision-Language Models (VLMs). The combination of these modalities bring many benefits:

Richer understanding: combining modalities allows the model to learn from many sources and make connections between modalities that enhance the understading of the scene or object.
Increased training data: Greatly increase the amount of data you can use for training by including text, it allows models to learn from noisy data freely available on the internet like image-caption pairs.
Reduced pre-processing: Being able to use noisy data to train the model greatly reduced the amount of work involved in pre-processing the data for training the model.
Better Generalization: Learning from multiple modalities can lead to better generalization across diverse data, these models are often more adaptable to unseen scenarios and variations in input.
Improved Robustness: By fusing information from multiple sources, the system becomes more robust to variations, noise, and challenges specific to individual modalities.

One recent setting that has emerged from VLM in that of Open-Vocabulary Object Detection (OVD) where a detector is trained to identify and categorize not only a well known list of classes, but unseen objects defined by an unbounded vocabulary. These models usually consist of taking an image encoder that was pre-trained in conjunction with a text encoder and attach to it a detector with a modified head that contains an attention mechanism. This will be further explained in the sequence.

Open-Vocabulary Detection General Architecture

The majority of OVD models are based on the model CLIP [^2], the CLIP model consists of two encoders (image and text) with same output dimension, the model is trained to correctly match each image in a set with its corresponding text description, this is called contrastive learning. The matching of image and text embeddings is calculated through the cosine similarity, this way we are effectively training both encoders to share the same embedding space. The CLIP model can be seen below:

Once we have a pair of contrastively pre-trained image and text encoder we need to modify it to allow to calculate similarity score between not only an entire image and a text prompt (as it's in the original CLIP model) but to calculate the similarity of a region in an image with respect to a text. We use the pre-trained image encoder as the backbone of an object detector, and in the last layer of the classification head, instead of k-way classification the head is modified so that the classification features can be projected directly upon pre-calculated text embeddings of the classes.

This allows one to encode literally any text with the text encoder and using the scalar product quantify how similar is the text with the image regions proposed by the detector, i.e. "red dog", "well-dressed man", "old looking car", etc. The general architecture can be seen below:

Advances

It's important to emphasize that all that was altered in the base detector is the backbone and the classification head, this means that this approach can be applied to a wide range of detectors architectures in the most creative ways, in the current literature we can cite the following models, and some contributions they offered, a more detailed look into these models and others without needing to read the original papers this can be seen in my master thesis report.

RegionCLIP [^4] (Zhong et al. 2021) develops a region-text pre-training that leverages pre-trained VLMs on image-caption data and also train the classification head using distillation from a teacher Vision-Language Model (VLM). It uses a FasterRCNN architecture.
F-VLM [^5] (Kuo et al. 2022) uses a frozen pre-trained VLM and trains only the detector’s head using an object detection dataset. It changes the detector’s head during inference to allow for classes not seen during training. Using a frozen language model (LM) has the downside that the vision model is “forced” into the language-model “mould”.
OWL-ViT [^6] (Minderer et al. 2022) employs a Vision-Transformer (ViT) for the image encoder. It finetunes an open-vocabulary detector with an encoder pre-trained in contrastive manner in various detection/grounding datasets. To adapt the image encoder for detection, they remove the token pooling and final projection layer of the ViT while box coordinates are obtained by passing the token representations through a small fully connected multi-layer neural network.
UniDetector [^7] (Wang et al. 2023) proposes to uses multiple dataset with heterogenous label spaces to train the detector. To promote the generalization to novel categories it does probability calibration and proposes a decoupled training that trains separetely the class-agnostic Region Proposal Network (RPN) and the class-aware classification head.

OVD models acquire a much larger vocabulary by learning from low-cost image-caption pairs, this makes it well adapted to detection in the open-world where novel unseen classes can arrive any time compared to other object detection methods like. It is also very adapted to anomaly detection, since these models are pre-trained with a large amount of data, they offer superior robustness and can be better transferred to real life applications [^3].

New approaches are being studied everyday, each with its drawbacks and advantages. One can pre-train a model on a large image-caption dataset and use it as the backbone of a detector taking advantage of the dense feature obtained during the pre-training, which benefits Out-Of-Distribution detection (i.e. F-VLM), or to directly pre-train it on the grounding task (like RegionCLIP). It is possible to apply either transformers-based (i.e. OWL-ViT) or CNN-based architectures. The backbone can also be fine-tuned for object detection (i.e. UniDetector) which increases performance for In-Distribution classes but can distort the features for Out-of-Distribution detection. Also training the detector in two stage can be a good alternative to decouple the problem into a localization and a classification one, approaching it from two fronts.

Conclusion

Overall Open-Vobulary Detection models offers a very good use of pre-trained multi-modal models that are becoming increasingly potent and available online for free. They offer superior generalization than uni-modal image models because they learn deeper meaning and context from text. They can be used for text to image search, object recognition, anomaly detection and visual grounding.

This approach is very recent and it was the subject of my master thesis at ENSTA Paris, so if you want a more indepth look at this very exciting new field of computer vision I propose you to check my master thesis report.

[^1]: Goldman, Sharon (13 September 2022). "Foundation models: 2022's AI paradigm shift". VentureBeat.

[^2]: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URL http://arxiv.org/abs/2103.00020. arXiv:2103.00020 [cs].

[^3] Stanislav Fort, Jie Ren, and Balaji Lakshminarayanan. Exploring the Limits of Out- of-Distribution Detection, July 2021. URL http://arxiv.org/abs/2106.03004. arXiv:2106.03004 [cs].

[^4] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. RegionCLIP: Region-based Language-Image Pretraining, December 2021. URL http://arxiv.org/abs/2112.09106. arXiv:2112.09106.

[^5] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS. 2023.

[^6] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weis- senborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple Open-Vocabulary Object Detection with Vision Transformers.

[^7] Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, and Shengjin Wang. Detecting Everything in the Open World: Towards Universal Object Detection, March 2023. URL http://arxiv.org/abs/2303.11749. arXiv:2303.11749

[^8] Pietro TANURE. Anomaly Detection through Vision-Language Models, 2023.