Training Priors Predict Text-To-Image Model Performance

Since the publication of this article in 2023, various reinforcement learning methods that provide direct feedback towards this type of problem have apparently helped diffusion models (and similar architectures) better generalize. I still expect the same underlying issues to persist within the models, perhaps in more muted and non-obvious ways. It does not seem we have found a systematic fix for this type of issue.

Text-to-image models can generate "astronaut riding a horse" but struggle with "horse riding an astronaut." Why? We tested whether this reflects training priors—the frequency of subject–verb–object (SVO) patterns in the model's training data—using Stable Diffusion 2.1.

The core finding: the more often an SVO triad appears in training data, the better the model generates an aligned image—and the worse it handles the flipped ordering. This suggests a "mix-and-match" mechanism rather than true compositional generalization. The model stitches together familiar patterns rather than reasoning about abstract relations.

Individual term frequencies matter too. How often a word appears as an agent versus a patient in training has a significant effect on generation quality. A "ball" is common as an object (patient) but rarely as a subject (agent)—so "ball chasing a dog" fails not just because the triad is rare, but because "ball" as agent is itself unusual.

Dog chasing a ball — *Left:* The common triad yields a correct image. *Right:* The flipped prompt produces two dogs—the model cannot depict "ball" as an agent.

Ball chasing a dog — *Left:* The common triad yields a correct image. *Right:* The flipped prompt produces two dogs—the model cannot depict "ball" as an agent.

Experimental design#

Each prompt encodes a triad ⟨subject, verb, object⟩. For a given triad, the more frequent ordering is the default and the reverse is flipped. We estimate SVO counts from LAION captions and regress alignment ratings against these counts.

Term	Definition
SVO	Estimated count of the relation ⟨s, v, o⟩
OVS	Estimated count of the flipped relation ⟨o, v, s⟩
Sxx, xVx, xxO	Frequency of each term in its given role
Oxx, xxS	Frequency of each term in the opposite role

All counts are log-transformed: log₁₀(count + 1). We fit a linear model: Alignment ~ SVO + OVS + Sxx + xVx + xxO + Oxx + xxS, with N=5 crowdsourced ratings per image, disaggregated. Alignment is measured on a 5-point Likert scale normalized to [0, 1]. "Success" is defined as alignment ≥ 0.75.

Our hypotheses: Forward—increased SVO frequency improves alignment. Backward—increased OVS frequency hurts alignment (the model defaults to the more common ordering).

Dataset#

We use Stable Diffusion 2.1 trained on LAION (aesthetic score ≥ 4.5, ~1.37B image–text pairs). SpaCy parses ~10% of captions (~134M sentences) to extract ~50M unique SVO triads. We curate ~769 triads with counts, generated images, and crowdsourced alignment ratings. Each prompt follows the template: “A photograph of a {subject} {verb} a {object}.” Ratings are collected via SurgeAI with 5 annotators per image.

Partition	N	Mean	Median
SVO Isolated	755	0.56	0.75
OVS Isolated	835	0.44	0.25
Default	1635	0.50	0.50
Flipped	1635	0.46	0.25

Forward: increased SVO increases alignment#

When OVS = 0 (isolated SVO triads), we see a strong positive effect: each order-of-magnitude increase in SVO count yields +0.31 alignment. After SVO > 10², 18 of 26 prompts succeed (alignment ≥ 0.75). The correlation is 0.421 (p < 10⁻⁷).

Term	Effect	p
SVO	+0.31	0.00
xVx	+0.12	0.01
Oxx	−0.18	0.00
xxS	−0.23	0.00

SVO frequency (log-scaled) versus alignment for isolated SVO triads (OVS = 0). The dashed trend line shows the positive relationship. At low frequencies, outcomes are scattered; above 10², most prompts succeed.

Backward: role typicality drives failure#

When SVO = 0 (isolated OVS triads), alignment is generally poor. Surprisingly, the OVS count itself is not significant (p = 0.38). Instead, the drop is better explained by role typicality: the term xxS (how often the object word appears as a subject elsewhere) has a strong negative effect (−0.19, p < 0.01). This means "ball" fails as an agent not because "ball chasing dog" competes with "dog chasing ball," but because "ball" is almost never seen in the subject role at all in training.

Term	Effect	p
OVS	−0.06	0.38
xxO	+0.19	0.00
xxS	−0.19	0.00

OVS frequency versus alignment for isolated OVS triads (SVO = 0). The weak negative trend is not statistically significant—role typicality of individual terms is a stronger predictor than the flipped triad count.

Interaction: both frequencies nonzero#

When both SVO and OVS are nonzero, we split into default (SVO > OVS) and flipped (OVS > SVO) partitions, each with 1,635 ratings.

Term	Default		Flipped
Term	Effect	p	Effect	p
SVO	+0.25	0.00	+0.14	0.04
OVS	−0.40	0.00	−0.11	0.11
Oxx	−0.13	0.00	−0.06	0.01
xxS	−0.18	0.00	+0.04	0.36

For default orderings, both SVO and OVS are large and significant—the model benefits from seeing the relation but is also hurt by familiarity with the reverse. For flipped orderings, the effects are weaker: SVO is still positive but OVS is only directionally correct and not significant. This asymmetry suggests the model's generative process is more sensitive to default patterns. Flipped prompts may face a "floor effect" where alignment is already low regardless of counts.

Takeaways#

Training priors shape text-to-image outputs in multiple, measurable ways. Frequency effects are stronger for default (more common) orderings than for flipped ones. There is no strong evidence that ⟨horse, ride, astronaut⟩ fails because ⟨astronaut, ride, horse⟩ is common. Instead, multiple training statistics—triad counts and individual term role frequencies—jointly determine alignment.

These findings mirror patterns in human language processing, where frequency and typicality modulate comprehension and production. See Mahowald et al. (2022) for a discussion of how distributional statistics in language relate to cognitive processing. The model's behavior is consistent with a mechanism that relies heavily on surface co-occurrence statistics rather than abstract compositional reasoning.

Citation#

@article{lovering-pavlick-2023-training,
  title   = {Training Data Priors Predict Text-To-Image Model Performance},
  author  = {Lovering, Charles and Pavlick, Ellie},
  journal = {arXiv preprint arXiv:2306.01755},
  year    = {2023},
  url     = {https://arxiv.org/abs/2306.01755}
}