Generalist YOLO

Abstract

Background Statement:

  • Generalist models, capable of handling multiple modalities and tasks simultaneously, are currently one of the hottest research topics.

Problem Statement:

  • However, due to interference between different tasks during the training process, existing generalist models require a very large decoder to achieve good results in various tasks, which makes real-time prediction difficult for current generalist models.

Purpose:

  • This paper introduces Generalist YOLO, which takes a significant step towards real-time prediction systems for visual language generalist models.

Methods:

  • The proposed Generalist YOLO uses a unified encoder to reduce conflicts between different tasks, thereby decreasing the complexity required by the decoder.
  • It also introduces a primary-secondary co-attention mechanism that allows different tasks to learn together more effectively, achieving high efficiency and high accuracy.
  • We propose a semantically consistent asymmetric training strategy, allowing various tasks to benefit from performance improvements brought by the latest research results in various fields.

Results:

  • The proposed Generalist YOLO achieves excellent results on various vision and language tasks based on MS COCO.

Contribution:

  • While maintaining high accuracy across all tasks, it is 135 times faster than existing generalist models.
  • The source code is released on GitHub at GeneralistYOLO.

Introduction

  • Multi-task Visual Language Model (VLM) is an important component for building Artificial General Intelligence (AGI)
  • VLM includes two types of tasks:
    1. describing an unique attribute of the targets, such as image classification, object detection, instance segmentation, semantic segmentation.
    2. As for language tasks, one needs to describe the correlation between multiple objects and scenes, such as image captioning, visual grounding, visual question answering.
  • Generalist Model: usually refers to a VLM capable of handling different vision language tasks

Artificial Intelligence Tasks

For example recall from NLP - Natural Language Processing, we have tasks like machine translation or question answering and so on.

  • Shortcomings of today VLM (2025):
    1. They need to rely on powerful pre-training models, common ones such as large dataset pre-trained models, task-specific pre-trained models, and vision-language foundation models,
    2. they need to have large-sized task-specific decoders, and the reason is that different types of tasks require their own decoder
    3. they require long inference times since most methods have inference speeds less than 1 fps, some even less than 0.1 fps.
  • The authors think that these problem are given by the design of the encoder-decoder and of the attention, so they put emphasis on creating the previously cited Unified Encoder (i.e a “generalist” encoder capable of handling multiple tasks) and Primary-Secondary Co-Attention Mechanism (to exchange information on various tasks)
  • ???? We found that due to the differences in semantic levels between different tasks, the most advanced training methods designed for various tasks cannot be applied in the process of training generalist models.
  • “The proposed Generalist YOLO does not rely on pretrained models and additional datasets, and achieves the accuracy of state-of-the-art generalist models” -> Ok if the authors wrote this, it means that other proposed solutions use other pretrained models and additional pre-training or finetuning datasets

Other related works are generalists models, foundation models and multi-task models, so i assume that Generalist YOLO is a generalist, foundation and multi-task model with these capabilities.

Generalists Models

Foundation Models

Multi-Task Models

Proposed Models

Generalist YOLO

Unified Encoder

“Unified” i think maybe it refer to the fact that before this model, you have to use different encoders for differents inputs

Primary-Secondary Co-Attention Mechanism

Maybe the model is using two attention mechanism at the same time? One primary attention mechanism and one secondary. So i think it would be interesting what is the relation of the two, if the model uses one in some occasion and the other in some other occasions

Semantically Consistent Asymmetric Training

Semantically: with regard to meaning Semantically Consistent: consistent with regard to meaning Asymmetric: ???? Training: maybe the training is different from how you normally train a NN. I understand that semantically may refer to a capability image-to-text, that should be consistent so maybe it’s something like self-evaluation for LLM. Is the training supervised or unsupervised?

Experiments

Experimental Settings

  • MS COCO 2017 dataset
    • Kaggle link to dataset: 2017-2017
    • Description: The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
    • Original paper introducing the dataset: 1405.0312

Comparison with (other) State-of-the-arts (foundation models)

Ablation Studies

Semantically consistent asymmetric training

Primary-secondary co-attention mechanism

Relaxed Optimizer

Unified encoder

Inference Time Comparison

Qualitative Visualization Results

Conclusion