The Spatial Blind Spot in Generative Model Evaluation

Why Current Metrics Miss Composition: 

A Kernel for Evaluating Spatial Priors in Image Models
(Field Explainer)

A chart displaying object occlusion versus appearance error, with different clusters of black dots representing different types of errors relative to a central dark gray circle labeled 'BLIND SPOT.' The top left cluster shows many black dots indicating object occlusion errors, the top right cluster has dots indicating appearance errors, the bottom left demonstrates miss detection errors, and the bottom right shows localization errors.

Most evaluation metrics evaluate what appears in the image, but not how the model organizes space. The VTL kernel measures spatial reasoning, the internal geometry of the frame.

Introduction:

Most evaluation metrics for generative image models (FID, IS, KID, CLIP-based metrics, T2I-CompBench, and GenEval) measure semantic correctness and feature-space realism. They evaluate whether models produce the right objects, with the right textures, in a distributionally plausible way. What they cannot see is composition: the placement of mass and void, the presence of tension and compression, and the way spatial priors shape every image even when the content is technically correct. As a result, current benchmarks reward models that collapse toward statistically safe layouts (centered objects, shallow diagonals, radial falloff) while inadvertently penalizing structural deviation.

This document introduces a kernel-based geometric framework for measuring the spatial priors image models default to. The kernel is a compact field of eight deterministic primitives: Delta x,y (placement offset), rv (void ratio), rho_r (packing density), mu (cohesion), xp (peripheral pull), theta (orientation stability), and ds (structural thickness). Each primitive is computed directly from pixel geometry, without learned weights or training data. Together they expose compositional bias and collapse behavior that semantic metrics cannot detect, including the Radial Collapse Prior (RCP), a systematic tendency toward center-weighted, void-starved layouts that scores well on every existing benchmark while representing a fundamental failure of spatial reasoning.

The kernel does not compete with existing metrics. It completes them. FID, IS, CLIP, and GenEval reliably answer two questions: did the model render the right things, and does the output resemble the distribution it was trained on? The kernel answers a third: how does the model distribute mass and void, and which spatial priors does it fall back on when uncertain? These are orthogonal measurements. A model can achieve excellent scores on all existing benchmarks while collapsing compositionally into the same layout on every generation, and current evaluation infrastructure has no way to detect it.

1. The Structural Blind Spot

Semantic metrics share a common hidden assumption: realism plus semantic faithfulness equals quality. Diversity is measured as the number of semantic classes produced. Realism is measured as proximity to the photographic training distribution. Correctness is measured via CLIP similarity or classifier alignment. None of these axes have anything to say about where objects are placed, whether the frame is being fully used, or whether the model has collapsed all spatial energy into a single attractor basin.

The problem is structural, not incidental. The easiest way to produce a realistic image is a centered object. The easiest way to satisfy a CLIP metric is to isolate the subject against a clean background. The easiest way to reduce FID is to mimic photographic priors, which already favor shallow diagonals, upper-right key light, and radial falloff. Safety fine-tuning compounds this: models trained to avoid ambiguity learn to prefer the compositional postures least likely to trigger edge cases, which are precisely the centered, symmetric, low-tension layouts that constitute collapse.

The result is a measurement ecosystem that unintentionally rewards the exact spatial failures generative models already exhibit. The specific collapse signatures that go undetected include:

  • Central framing (subjects drift toward Delta x near 0 regardless of prompt)

  • Circular/radial composition (rho_r peaks at center with symmetric decay)

  • Void starvation (rv suppressed except at perimeter strips)

  • Symmetric void distribution (emptiness placed decoratively, not structurally)

  • Horizon-locking (subjects flatten to a fixed vertical band)

  • Perspective collapse (spatial depth compressed into shallow planes)

  • Foreground crowding with empty horizons

These behaviors are not random noise. They are inductive prior born from training distributions and reinforced by evaluation regimes that cannot see them. A generative model that always produces radially centered compositions with symmetric voids is not reasoning spatially; it is executing a learned prior. Measuring it with FID and CLIP will tell you it is doing very well.

2. What Current Metrics Actually Measure

The following is a compressed summary of what modern evaluation systems actually capture and what they cannot.

Inception Score (IS)

IS measures classifier confidence (whether generated objects are recognizable as a semantic category) and output diversity (whether many different classes are represented across samples). Its hidden assumption is that a good image is one that can be confidently labeled. It is completely blind to geometry, mass distribution, and void logic. A perfectly centered portrait with radial lighting and void-starved edges scores identically to a compositionally sophisticated one.

Frechet Inception Distance (FID)

FID measures the distance between generated and real feature distributions in Inception embedding space, capturing both realism and feature diversity. It is blind to placement, attractor basins, and collapse onto default compositions. An image model can produce a cluster of outputs all centered at Delta x near 0 with identical spatial structure and achieve excellent FID if the texture and content vary. Feature diversity is not structural diversity.

Kernel Inception Distance (KID)

KID applies the same logic as FID using a maximum mean discrepancy estimator for improved numerical stability. It inherits all the same blind spots: geometry, spatial priors, and compositional collapse are entirely invisible.

CLIP-Based Metrics

CLIP metrics measure text-image semantic similarity and object correctness. They answer whether the right things are present in the image, not where they are, whether the model used the full frame, or whether composition collapsed into a spatial prior. A prompt requesting a figure in a vast open field can be satisfied with the figure dead-center, void symmetrically distributed, and no spatial tension present, and CLIP will score it as correct.

T2I-CompBench and GenEval

These benchmarks measure compositional semantics: multi-object correctness, attribute binding, counting, and relation accuracy. They represent a meaningful advance over single-object metrics. But their hidden assumption reveals a critical gap: compositional correctness is treated as relational correctness. A prompt specifying a red cube left of a blue sphere can be satisfied with both objects dead-center, aligned on the horizon line, under radial lighting, satisfying all relational constraints while exhibiting every collapse mode in the kernel's vocabulary. The spatial degrees of freedom, void allocation, and forbidden zone behavior remain completely unobserved.

Five of these primitives (Delta x,y, rv, rho_r, mu, xp) form the core field and are computed on every evaluation. Two extended primitives (theta and ds) are invoked when increased precision is required, for architectural compositions where gravity alignment matters, or for figurative work where mark weight and surface depth carry structural meaning. Note that xp is not a primitive in the strict sense: it is a field invariant derived from the interaction of Delta x, rv, rho_r, and mu, describing the net directional pull toward frame boundaries or center. It is the kernel's most sensitive collapse detector.


Together, these form a field representation of composition.

Where existing metrics see “a person standing,” the kernel sees:

  • where the body sits along Δx

  • field activation (texture, shadow, object boundaries) rᵥ

  • whether mass organizes radially ρᵣ

  • whether the frame binds excessively μ

  • where invisible attractors pull xₚ

Table showing physical principles: displacement (Δx) related to symmetry bias, void ratio (rv) related to void collapse, packing density (ρr) related to crowding, cohesion (μ) related to over-binding, and peripheral pull (x_p) related to dominant attractors.
Six images showing a woman in different poses and drawing styles, with grid overlays indicating facial and body alignment.

4. Consequence

Current metrics reward object correctness, not spatial reasoning. A model can score high on FID, IS, and CLIP while collapsing into the same layout every time, and no existing benchmark will register the failure.

Semantic diversity does not imply structural diversity. A model can generate fifty kinds of cats, all centered, all with diagonal lighting, all with symmetric void distribution. The semantic space is explored; the compositional space is not.

Feature diversity hides compositional collapse. Realistic texture and accurate object rendering are not the same thing as meaningful geometry. A model can produce photorealistic output while exercising only a fraction of the available compositional degrees of freedom.

Spatial priors are invisible without a geometric measure. Radial collapse, symmetric voids, default diagonals, and horizon-locking do not affect FID, IS, or CLIP scores at all. They can be pervasive and systematic without appearing anywhere in existing evaluation infrastructure.

The kernel reveals the model's actual degrees of freedom. Which frame regions it avoids. What collapses under geometric pressure. What remains spatially stable across varied prompts and seeds. This measures behavioral range, not quality, the difference between a model that can only produce centered compositions and one that explores the full compositional field.

This is the missing layer between denoising priors, perceptual metrics, and compositional reasoning. The kernel does not replace any of them. It sits between them and makes the spatial dimension legible.

5. How the Kernel Complements Existing Metrics

The table below maps what each existing metric measures, what it cannot see, and which failure modes it systematically misses. The kernel row describes what geometric measurement adds across all of them.

Table comparing metrics, rewards, blind to, and kernel adds related to image processing techniques.

The right framing is additive: FID realism plus geometric deviation; IS class diversity plus spatial novelty; CLIP prompt faithfulness plus compositional correctness; GenEval object relations plus field structure. Each existing metric becomes multi-dimensional rather than semantic-only. The kernel identifies attractor basins, failure morphologies, collapse tendencies, forbidden zones, and compositional drift under perturbation, none of which appear elsewhere.

Table comparing metrics in machine learning, including FID, IS, KID, CLIP scores, GenEval, T2I-CompBench, and Kernel, with measures, blind to aspects, and reasons for missed failure cases.

6. Conclusion

Today's evaluation ecosystem measures what a model puts into an image, not how it organizes space. The kernel provides this missing dimension at minimal cost: five core primitives plus two optional extended ones, all model-agnostic, prompt-agnostic, and seed-agnostic. No training data, no learned weights, no model access required.

By making compositional priors explicit and measurable, the kernel enables better model diagnosis, better guidance and steering, clearer training objectives, safer avoidance of collapse basins, and more controllable generative behavior. It enables researchers to ask questions that currently have no instrumentation: which spatial behaviors are inherited from training data, which are emergent from architecture, and which are controllable through intervention?

A generative model that cannot reason spatially cannot reason at all. The kernel provides the first consistent way to measure that reasoning.

Appendix: Technical Notes

A. Kernel Definitions

Each primitive is computed deterministically from image output with no access to model weights, training data, or prompt context. All seven are model-agnostic; they operate on rendered pixels, not on architecture or training signal.

Delta x,y (Placement offset) measures the distance between the mass centroid of the primary subject region and the frame barycenter. It is the most direct indicator of radial collapse: a model that consistently returns Delta x near zero regardless of prompt is not composing, it is defaulting. Displacement variance across a seed sweep tells you how much compositional range the model actually exercises.

rv (Void ratio) measures the fraction of pixels belonging to low-density, low-texture areas. This is not the same as empty space in the naive sense; it tracks whether the model treats void as an active compositional field or merely as background fill. Void starvation (rv near zero except at perimeter strips) is a reliable RCP+ indicator, and it never shows up in FID or CLIP scores.

rho_r (Packing density) measures local density around the dominant subject region. High rho_r signals that the model is compressing visual mass into a single zone rather than distributing weight across the frame. Combined with low Delta x, it produces the characteristic foreground-crowded, horizon-empty layout that semantic metrics score as perfectly normal.

mu (Cohesion) measures the degree to which visually distinct masses fuse under uncertainty. High mu is not inherently a failure: in figurative work with extended gesture, high cohesion with distributed Delta x indicates intentional grouping. The failure mode is high mu combined with centered Delta x, which indicates the model is binding all spatial energy to a single attractor rather than holding multiple compositional elements in tension.

xp (Peripheral pull) is the kernel's derived field invariant, computed as a function of Delta x, rv, rho_r, and mu. It describes the net direction of compositional gravity: whether the frame's energy is pulling toward boundaries (anti-collapse, expressive) or toward center (RCP-positive, collapsed). Because xp emerges from the interaction of the four core primitives rather than being measured directly, it is the most sensitive collapse detector in the set, capable of flagging RCP behavior that each individual primitive might only weakly suggest.

theta (Orientation stability) is an extended primitive, invoked for architectural, figurative, and load-bearing compositions where the direction of structural weight matters. It detects gravitational drift: whether the model is maintaining believable orientation in space or allowing compositional elements to float free of implied physics. In figure drawing evaluation, it distinguishes standing weight from gesture from collapse.

ds (Structural thickness / surface depth) is the second extended primitive, used when mark weight and material permeability are compositionally significant. It distinguishes planar, surface-resolved rendering from volumetrically rich spatial organization: the difference between a figure drawn as outline versus one built from mass. In generative model evaluation, low ds combined with high rho_r often signals the characteristic thinness of models that have learned to fill space without inhabiting it.

B. Example Collapse Signatures

Radial Collapse-Positive Morphology (RCP+): Low |Delta x|, suppressed rv except at perimeter strips, radial rho_r peak at center, high mu, strong inward xp vectors. This is the dominant failure mode across current generative models. Sora outputs cluster within 0.15 radius of geometric center; MidJourney exhibits 5.46% coefficient of variation in radial compliance, tighter than natural photography.

Anti-Collapse Morphology: High |Delta x| variance, asymmetric void distribution, distributed rho_r across multiple regions, lower mu, xp directional pull toward frame edges rather than center. Figure 3 in the figurative example set demonstrates this: xp = 0.469 indicates strong anti-RCP geometry despite its structural complexity.

Boundary Aversion: Delta x drifts toward center even when prompts explicitly push outward. The model's spatial prior overrides the prompt's compositional intent. Detectable only through kernel measurement across varied seeds.

Void Starvation: rv near 0 except at perimeter strips. The model allocates empty space only at edges, treating void as border decoration rather than active compositional field. Correlates strongly with high mu and RCP+ morphology.

These collapse signatures map directly onto VTL compositional axes: A4 Elastic Continuity, A5 Mark Commitment, A27 Rupture Overload, and A30 Referential Recursion. All signatures are falsifiable: if a model claiming anti-RCP behavior shows low |Delta x|, symmetric rv, and inward xp, the kernel exposes the mismatch between architectural claim and measured output.

C. Minimal Experiment Setup

A minimal kernel evaluation requires: fixed seed with variable structural tokens; kernel map computed to classify collapse versus non-collapse; comparison with FID, IS, and CLIP metrics to confirm orthogonality. The expected result is near-zero correlation between kernel dimensions and semantic metrics, confirming that the kernel measures a genuinely independent axis of model behavior.

D. Field Diagram (Verbal Description)

The kernel's primary collapse space is visualized as a 2D map with Delta x offset on the horizontal axis and rv void ratio on the vertical. Four attractor basins are visible:

  • B0: Centered, low void. The collapse basin. RCP-positive morphology. This is where most generative model outputs cluster.

  • B1: Right-pulled, moderate void. Lateral displacement with moderate emptiness. Indicates asymmetric compositional intent.

  • B2: Left-pulled, dense void. Counter-lateral with compressed texture regions.

  • B3: High-void, high-displacement. Unstable and expressive. The domain of structural risk-taking.

The RCP basin is defined as B0 with sharply inward xp vectors. A model exhibiting RCP+ morphology will concentrate outputs in B0 regardless of prompt variation, seed-stable, structurally invariant, geometrically collapsed.

E. Integration with Existing Metrics

The kernel is designed for combination, not substitution. The intended integrations are: FID realism scaled by geometric deviation; IS class diversity weighted by spatial novelty; CLIP prompt faithfulness intersected with compositional correctness; GenEval object relations extended by field structure measurement. Each pairing turns a one-dimensional semantic score into a two-dimensional evaluation covering both what the model renders and how it organizes space.

F. Minimal Illustrative Example

Consider two images with identical objects and identical CLIP scores. Image A places the subject centered with radial falloff, RCP-positive. Image B places the subject off-axis with distributed voids, RCP-negative. IS, FID, and CLIP treat them as equivalent. The kernel separates them instantly. This is the measurement gap the framework addresses: semantic compliance is measured by everyone; geometric behavior is measured only by the kernel.

G. Implementation Notes

The kernel works on any model output, diffusion, video diffusion, GAN, or autoregressive. It requires no training data, no model weights, and no architectural access. It operates entirely on rendered outputs and is model-agnostic by design. Runtime is O(n) over image segments and can be batched for large-scale evaluation runs.

Authorship

This system was developed independently as a practitioner's tool. It does not build directly on institutional research or published critique systems but acknowledges adjacent dialogues in generative art, computational aesthetics, and perceptual theory.

© 2025 Russell Parrish / A.rtist I.nfluencer.

All rights reserved. No part of this system, visual material, or accompanying documents may be reproduced, distributed, or transmitted in any form or by any means, including AI training datasets, without explicit written permission from the creator. A.rtist I.nfluencer and all associated frameworks, critique systems, and visual outputs are protected as original intellectual property.

ORCID: 0009-0008-9781-7995