Category: Analysis
Daniel Mercer
Share
Listen On

Meta FAIR and researchers from New York University have conducted a systematic study on how multimodal AI models can be trained from scratch. Their findings challenge several widely held assumptions in the field.

Large language models have defined the era of foundation models. However, in their paper “Beyond Language Modeling,” the researchers argue that text is ultimately a lossy compression of reality. Referencing Plato’s allegory of the cave, they suggest that language models have learned to describe the shadows on the wall without ever seeing the objects that cast them. In addition, high-quality text data is finite and may eventually become exhausted.

Examples of the four training data types: plain text, image-text pairs, action-driven video sequences, and raw video. | Image: Tong et al.
Examples of the four training data types: plain text, image-text pairs, action-driven video sequences, and raw video. | Image: Tong et al. 

To address this limitation, the team trained a single model entirely from scratch. It combines standard next-token prediction for language with a diffusion method called Flow Matching for visual data. Training data included text, raw video, image-text pairs, and action-conditioned videos. A key methodological choice was to avoid building on an existing language model, preventing previously learned knowledge from biasing the results.

Above, the model architecture that combines text and image prediction in a single model. Below, the five axes studied. | Image: Tong et al.
Above, the model architecture that combines text and image prediction in a single model. Below, the five axes studied. | Image: Tong et al. 

One visual encoder is enough for both understanding and generation

Previous approaches such as Janus or BAGEL use separate visual encoders for image understanding and image generation. According to the study, this separation may not be necessary.

A Representation Autoencoder (RAE) based on the SigLIP 2 image model outperformed traditional VAE encoders in both image generation and visual understanding, while maintaining language performance at the level of a pure text model.

RAE based on SigLIP 2 outperforms VAE-based encoders in image generation and visual understanding without compromising speech performance. | Image: Tong et al.
RAE based on SigLIP 2 outperforms VAE-based encoders in image generation and visual understanding without compromising speech performance. | Image: Tong et al. 

Instead of two separate processing paths, a single encoder can handle both tasks, simplifying the architecture significantly.

Another common assumption is that vision and language compete with each other inside a model. The study suggests otherwise. Training on raw video data without text annotations did not harm language performance. In fact, on a validation dataset, the model trained on text and video slightly outperformed the text-only baseline.

More text improves image generation: For every visual token budget, additional text reduces diffusion loss and increases the GenEval score beyond the purely visual baseline. | ​​Image: Tong et al.
More text improves image generation: For every visual token budget, additional text reduces diffusion loss and increases the GenEval score beyond the purely visual baseline. | ​​Image: Tong et al

An interesting synergy effect also appeared. When 20 billion VQA tokens (visual question answering data) were combined with 80 billion tokens from video, image-text pairs (MetaCLIP), or text, the resulting model outperformed another model trained on 100 billion VQA tokens alone.

World modeling emerges naturally

The researchers also tested whether the model could predict future visual states. Given a current image and a navigation instruction, the model had to predict the next visual frame. Actions were encoded directly as text, meaning no architectural changes were required.

The results suggest that world-modeling abilities emerge primarily from general multimodal training, rather than from specific navigation datasets. With just one percent of task-specific data, the model achieved competitive performance.

The model generates image sequences based on keyboard inputs (W, A, D) or natural language instructions such as
The model generates image sequences based on keyboard inputs (W, A, D) or natural language instructions such as "get out of the shadow!", without ever having seen such inputs during training. | Image: Tong et al. 

The system could even respond to natural language commands such as “Get out of the shadow!” and generate appropriate image sequences, despite never having encountered such instructions during training.

Mixture-of-Experts learns specialization automatically

The study also explored Mixture-of-Experts (MoE) architectures. In this approach, each input token is routed only to a subset of specialized network modules rather than activating the entire model. This reduces computational cost while increasing total model capacity.

In a model with 13.5 billion parameters, of which only 1.5 billion are active per token, the MoE architecture outperformed both dense models and manually designed separation strategies.

Interestingly, the model developed specialization on its own. It assigned significantly more experts to language than to vision. Early layers were dominated by text specialists, while deeper layers increasingly contained visual and multimodal experts.

The model naturally develops a specialization: Early layers are dominated by text experts, while in deeper layers the proportion of visual and multimodal experts increases. | Image: Tong et al.
The model naturally develops a specialization: Early layers are dominated by text experts, while in deeper layers the proportion of visual and multimodal experts increases. | Image: Tong et al. 

Another notable finding is that image understanding and image generation activated the same experts, with correlations of at least 0.90 across all layers. The researchers interpret this as evidence supporting Rich Sutton’s “Bitter Lesson” — that learning from large datasets often outperforms carefully engineered solutions.

Vision requires far more data than language

Training AI models always involves a trade-off between model size and the amount of training data. The well-known Chinchilla scaling laws suggest that for language models, both should grow at roughly the same rate.

However, when applying scaling laws to a combined vision-language model, the researchers discovered a major asymmetry. For language, the traditional balance still holds. For vision, the optimal strategy shifts heavily toward more data rather than a larger model.

As models scale up, the difference becomes dramatic. Starting from a base model with 1 billion parameters, the study estimates that the relative demand for vision data compared with language data increases 14-fold at 100 billion parameters and 51-fold at 1 trillion parameters.

The scaling laws for vision and language differ fundamentally: language follows an almost balanced chinchilla scaling, while vision requires significantly more data. | Image: Tong et al.
The scaling laws for vision and language differ fundamentally: language follows an almost balanced chinchilla scaling, while vision requires significantly more data. | Image: Tong et al. 

This imbalance is difficult to resolve in traditional dense models, where every parameter is active during each computation step.

However, the Mixture-of-Experts architecture helps mitigate the issue. Because only a fraction of experts are activated per token, the model can have a very large total parameter count without proportional increases in computational cost. This allows language to benefit from high parameter capacity while vision benefits from large datasets. According to the study, MoE reduces the scaling asymmetry between the two modalities by roughly half.

The researchers emphasize that their work focuses on pretraining only. Finetuning and reinforcement learning were not examined in detail. Nevertheless, the findings suggest that the boundary between multimodal models and world models may increasingly blur.

Massive volumes of unlabeled video data remain largely unused today. The study indicates that these datasets could be incorporated into AI training without harming language performance, potentially opening the door to more powerful multimodal systems in the future.

AI Research Contributor
Daniel Mercer is an AI research contributor specializing in large language models, benchmarking, and multimodal systems. He writes about model capabilities, limitations, and real-world performance across leading AI assistants and platforms.

Recent Podcasts

Adobe Reinvents Document Work with Acrobat Studio and AI

Adobe has fundamentally reimagined document workflows with the launch of Acrobat Studio, a...

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

Artificial intelligence is becoming the main role model for Generation Alpha. 2026 may mark a...

AI as a Toy: Why Humanity Always Misuses New Technology First

Artificial intelligence could, in theory, help solve all of humanity’s problems. Stop wars, cure...