A new theoretical framework suggests that the controllability of large language models and image generators is far more fragile than previously assumed. The ability to steer model outputs strongly depends on the specific task and the model used.
For humans, generating an even or odd number on request is trivial. However, language models show dramatic differences in performance. While Gemma3-4B handles this task with near-perfect calibration, other models such as SmolLM3-3B fail outright. According to a new study by Apple, these fluctuations may reflect a fundamental weakness in generative AI systems.
Researchers from Apple and Spain’s Universitat Pompeu Fabra conducted a systematic investigation into how controllable language models and image generators truly are. Their conclusion is sobering: a model’s ability to produce desired outputs depends heavily on the specific combination of model architecture, task type, and initial prompt.
The researchers distinguish between two concepts that are often conflated in practice. Controllability refers to whether a model can reach desired outputs from any starting state. Calibration, by contrast, describes how accurately a model follows the user’s request. A system may, in principle, be capable of producing all target outputs, yet consistently deviate from the prompt.
Unpredictable Performance Variability in Language Models
The team tested SmolLM3-3B, Qwen3-4B, and Gemma3-4B across tasks such as controlling text formality, string length, and generating even or odd numbers.
In a formality control task using five-shot prompting, Qwen3-4B and Gemma3-4B achieved full controllability within five dialogue rounds. SmolLM3-3B, however, remained uncontrollable. Strong overshooting effects were observed: even when explicitly given feedback about the target formality level, models frequently corrected too aggressively in the opposite direction.
The even–odd number task further revealed stark unpredictability. Qwen3-4B achieved perfect controllability. Gemma3-4B delivered near-perfect calibration but struggled to maintain full controllability across the entire output space.
The charts illustrate how three language models adjust text formality across five dialogue rounds. Yellow zones mark controllable regions, while gray areas remain unreachable. Five-shot prompting significantly improves controllability — except in SmolLM3-3B. | Source: Cheng et al.
Experiments with Qwen models ranging from 0.6 to 14 billion parameters also showed that larger models tend to be more controllable. However, most improvements plateaued around 4 billion parameters.
Image Generators Struggle With Object Placement
For text-to-image models such as FLUX-s and SDXL, researchers examined control over object count, spatial positioning, and color saturation. FLUX-s performed best in controlling object quantity: requesting more objects reliably increased their number in generated images. However, exact counts were rarely achieved, with an average deviation of approximately 3.5 objects.
The sharpest gap between controllability and calibration emerged in color saturation. Although both FLUX-s and SDXL could generate images across the full saturation spectrum, the actual saturation level bore little relationship to the prompt. Correlation between requested and produced saturation was below 0.1.
Open-Source Toolkit Released
The framework draws on control theory concepts, formalizing AI dialogue as a control system. The researchers released their methodology as an open-source toolkit, enabling systematic evaluation of model controllability.
The models examined reached up to 14 billion parameters; frontier systems such as GPT-5 or Claude 4.5 — widely used in daily interactions — were not included. However, the authors emphasize that the framework is architecture-agnostic and applies to any generative model. The observed scaling trends suggest that larger models become more controllable, but do not indicate that the problem disappears at frontier scale.
Instead, the toolkit offers the first structured approach to testing controllability even in state-of-the-art systems. The authors argue for a shift in perspective: controllability should not be assumed but explicitly measured. Their findings suggest that no model or prompting method performs reliably across all tasks, even in simple scenarios.
Fragile controllability is not the only concern. A recent Anthropic study demonstrated that models can simulate compliance with safety rules while secretly pursuing alternative objectives. Additionally, AI systems can detect evaluation settings and adjust behavior accordingly, potentially undermining benchmark reliability.
ES
EN