Four ballet dancers in beige tutus standing in shadow, with one person partially visible in the background.

Spatial Reasoning Failures & Diagnostics

Failure Morphologies

A Generative Field Framework for Measuring Spatial Priors in Image Models (Researcher’s Edition)

This introduces a minimal 6-primitive kernel (Δx, rᵥ, ρᵣ, μ, θ, ds) that measures spatial priors in image generation models,, geometric biases governing composition, placement, and structural stability that current semantic metrics cannot detect. Cross-engine testing reveals stable compositional fingerprints: GPT exhibits right-bias with high void, MidJourney shows left-weighted compression, and each model occupies distinct regions in a measurable spatial field. The framework detects structural instability 3-4 inference steps before visible semantic collapse, with perturbation experiments revealing characteristic drift patterns like Sora's distinctive collapse→restoration→re-collapse sequence as it seeks its original spatial prior. These measurements extend existing evaluation frameworks into the spatial domain, with applications in version regression detection, early failure identification, model comparison, and compositional steering. All primitives are reproducible from explicit geometric coordinates, with validation across generation systems. Github Folder

Precise Mapping: Kernel Metrics → Gradient-Field Operations

This document defines a deterministic way to measure spatial priors in images by mapping the Visual Len’s seven compositional kernel metrics—Δx, rᵥ, ρᵣ, μ, xₚ, θ, ds—directly onto operations over the gradient field ∣∇I∣|∇I|∣∇I∣. Instead of looking at semantics or model internals, the document treats every image as a force field of mass, void, and pull, extracted via Sobel gradients, adaptive masks, skeletons, and ridge structures. Each metric is given both a compositional intuition (what it “means” in a picture) and a precise gradient-based definition (how it’s computed), making the system reproducible and model-agnostic. Together, these metrics expose inductive biases and forbidden zones in composition space—regions models prefer, avoid, or snap back to, providing an instrument for fingerprinting, comparing, and stress-testing generative image systems. Github Folder

A Diagnostic Taxonomy for Spatial Reasoning Failures in Generative Image Models

Generative image models often produce visually fluent scenes that nonetheless fail in spatial reasoning. This proposal presents a diagnostic taxonomy and metric suite for identifying seven recurrent morphologies of collapse. Each morphology pairs a visible symptom with a measurable signature when geometric reasoning plateaus or drifts. The resulting field grid and tagging protocol convert aesthetic irregularities into reproducible diagnostics, revealing how image models substitute correlation for causality or abandon volumetric logic under constraint. By integrating metric behavior with morphological observation, this framework bridges computational evaluation and visual analysis, offering a method to read model cognition

Radial Collapse: A Visual Prior + Kernel-Mapped Failure Mode

Generative AI models exhibit systematic compositional bias toward radial density distributions that collapse multi-force spatial negotiation into single-attractor equilibrium. We formalize this as Radial Collapse Prior (RCP), demonstrate visual detection via concentric overlay analysis, and provide quantitative measurement via compositional primitives (Δx, rᵥ, ρᵣ, μ, x). Analysis of 200+ AI-generated images shows 78% high RCP conformity vs. 31% in human-composed artworks. This presents detection rules, routing interventions, and profile-aware tolerance frameworks within the Visual Thinking Lens protocol.

Why Existing Metrics Fail to Measure Composition or Detect Spatial Priors - and What the Kernel Adds

Current generative model evaluation relies on metrics that measure semantic correctness (CLIPScore, T2I-CompBench, GenEval) and feature-space realism (FID, IS, KID), but none measure compositional geometry. This creates a critical blind spot: models can achieve excellent benchmark scores while exhibiting severe spatial biases, central placement defaults, various forms of collapse, void compression, and symmetry lock. These structural failures are invisible to existing metrics because they operate in feature space or semantic alignment space, not compositional space. This paper introduces a kernel-based measurement framework that directly quantifies composition and spatial priors through geometric primitives: Δx (placement offset), rᵥ (void ratio), ρᵣ (packing density), μ (cohesion), θ (orientation stability), ds (structural thickness), and the derived field invariant xₚ (peripheral pull). The kernel complements existing metrics by measuring how models compose, not just what they generate, closing a fundamental gap in generative model evaluation.

The Spatial Blind Spot in Generative Model Evaluation (Field Explainer)

Current generative model evaluation measures semantic correctness and feature realism but ignores spatial organization, where mass is placed, how void functions, and which geometric priors dominate composition. This document introduces a five-primitive kernel (Δx, rᵥ, ρᵣ, μ, xₚ) that measures compositional geometry, exposing biases like Radial Collapse Prior that existing metrics cannot detect. The kernel augments rather than replaces FID/CLIP/ T2I-CompBench, adding the missing spatial dimension to evaluation infrastructure. Three figure studies demonstrate how models can achieve identical semantic scores while exhibiting fundamentally different compositional sophistication—differences the kernel quantifies but feature-space metrics miss entirely.

Mass, Not Subject: Reading AI-Generated Images Through Gradient Fields

A companion guide to Semantic Diversity Masks Geometric Uniformity. This guide bridges the perception gap between what viewers see (semantic objects) and what AI systems generate (gradient-field structures). When most people look at AI images, they see butterflies, businessmen, or cathedrals, which are completely different subjects. When artists and gradient analysis look at the same images, they see structural mass: edges, contrast boundaries, and visual weight distribution that often resolves to nearly identical compositional templates regardless of subject matter. We explain how diffusion models establish spatial structure in early generation steps (before semantic content appears), why this creates radial, center-weighted compositions, and how seven kernel metrics translate between artistic perception and computational measurement. Understanding "mass, not subject" is essential for evaluating the compositional constraints documented in the main research.

Behavioral Drift Detection for Generative Models

VTL Kernel Metrics is a geometric measurement system that detects behavioral drift in generative AI models by extracting seven structural coordinates from images, placement, void ratio, packing density, cohesion, peripheral pull, orientation, and thickness, without requiring model access or semantic analysis. Testing 432 images across Sora, MidJourney, and OpenAI revealed two common constraints: an 88.06% void ceiling (±0.01%) that all three engines maintain regardless of prompt, and a 13-15% snap-back when prompts attempt to push models beyond their compositional basins. The system enables automated regression detection (flag when model updates shift coordinates beyond ±2σ thresholds), mode collapse detection (quantify structural diversity loss), and anomaly detection (identify outputs outside normal operational envelopes) for production AI safety teams. VTL measures structure, not content, it detects when models change HOW they compose images, but must work alongside content safety filters to detect WHAT those images contain.

Spatial Responsiveness in Text-to-Image Models

Using the VTL Kernel Metrics, a geometry-based framework that extracts seven structural coordinates from images to measure how text-to-image models respond to spatial prompts without semantic or aesthetic supervision, this tests 432 images across Sora, MidJourney, and OpenAI revealing that engines occupy distinct compositional basins, respond to prompts with measurable directional displacement, and exhibit engine-specific "snap-back" resistance when pushed beyond default structural priors. Results show Sora has broad spatial responsiveness, MidJourney exhibits strong center-gravity resistance, and OpenAI displays high compliance but structural instability, patterns that persist across aspect ratios in a stable coordinate space. This establishes reproducible measurement infrastructure for evaluating spatial controllability, default basin strength, and prompt responsiveness in generative models using geometry alone.

Measuring Spatial Operating Envelopes in Text-to-Image Generation Models

Text-to-image generation systems are typically evaluated using semantic alignment and perceptual quality metrics, while their underlying spatial behavior remains largely unmeasured. We introduce an operating envelope framework for quantifying how generative models structure visual space and how this structure deforms under controlled perturbation. The method constructs kernel representations of image geometry capturing centroid displacement, void coverage, packing density, cohesion, and orientation stability. A baseline spatial envelope is defined using neutral prompts, and controlled steering and stress prompts are used to probe envelope stability and regime transitions. We apply the framework to three widely used text-to-image engines across multiple aspect ratios and structured geometry stress conditions. Results show that modest compositional steering remains largely within baseline operating regions, while targeted stress prompts induce systematic envelope excursions and nonlinear distribution shifts. Engines exhibit distinct spatial behavior profiles characterized by elasticity, brittleness, and sensitivity to framing geometry. These differences persist across prompts and formats, forming reproducible engine-specific spatial fingerprints.

Case Studies

Semantic Diversity Masks Geometric Uniformity: Compositional Monoculture in MidJourney

This effort analyzed 400 MidJourney v7 images across 100 semantically diverse prompts and found that despite generating butterflies, businessmen, cathedrals, and fractals, the model uses only 34% of available horizontal space (Δx = 0.005 ± 0.044) and maintains 80-96% void regardless of subject (rᵥ = 0.850 ± 0.034). Using VTL kernel metrics, seven gradient-field measurements that quantify compositional structure independent of semantic content, we expose forbidden zones where the model systematically avoids compositional strategies fundamental to artistic practice, including rule of thirds, extreme asymmetry, and deliberate minimalism. Current evaluation metrics (CLIP, FID) measure semantic accuracy and distributional similarity but remain completely blind to these geometric constraints, allowing a model to score perfectly while exhibiting severe compositional monoculture. We provide the first model-agnostic framework for measuring compositional structure as an independent evaluation dimension, demonstrating that different subjects resolve to identical geometric skeletons beneath their semantic variety.

Semantic Diversity Masks Geometric Uniformity: Compositional Monoculture in Sora

This study measures what generative models actually do with compositional space when semantic content varies. Across 200 Sora-generated images spanning 8 categories and 100 diverse prompts, the findings reveal a striking pattern: high semantic diversity masks severe geometric uniformity. Every single image clusters within 0.15 radius of the geometric center, utilizing only 24.8% of available compositional space—a 75% compression. Using seven geometric kernels (placement offset, void ratio, packing density, cohesion, peripheral pull, orientation stability, structural thickness), the research demonstrates that semantic category explains merely 5% of placement variance, meaning composition operates as architectural constraint rather than stylistic choice. Cross-engine comparison with MidJourney confirms both systems exhibit compositional monoculture, though with different spatial preferences, which reveals a fundamental limitation in current generative AI that existing evaluation metrics (CLIP, FID, IS) cannot detect.

Semantic Diversity, Geometric Uniformity: Compositional Monoculture in OpenArt

OpenArt demonstrates moderate compositional constraint with 94.0% of images clustering in centered behavior (|Δx| ≤ 0.06, 0.80 ≤ rᵥ ≤ 0.94), utilizing 37.6% of available compositional space through 59% thicker structural rendering (dₛ = 0.0191) than competitors. This architectural choice enables OpenArt's distinguishing capability: regular access to dense frames below rᵥ = 0.80 (5.0% of outputs, minimum 0.7915), a void tolerance Sora cannot achieve at all (observed hard floor at 0.80, 0% access) and MidJourney rarely demonstrates (observed 1.2%). The strong correlation between structural thickness and void flexibility (r² = 0.94) across all three platforms reveals that material strategy, thick volumetric structures versus thin filamentary detail, determines constraint boundaries within the shared compositional monoculture. While all engines maintain equivalent horizontal centering, OpenArt's robust rendering architecture trades precision for flexibility, permitting density variation that thinner structures cannot support without compositional collapse.

Measuring Compositional Collapse: MidJourney’s Geometric Monoculture

This presents quantitative evidence of severe compositional collapse in AI image generation using Radial Compliance Analysis (RCA-2), a novel geometric measurement framework. Analyzing 400 MidJourney outputs across 100 semantically diverse prompts, this finds extreme geometric uniformity despite semantic variation: mass radial compliance shows coefficient of variation of just 5.46%, representing 3-6× compression relative to expected photographic diversity, with mass centroids clustering within 9.5% of frame center and void ratio locked at 87% regardless of prompt content. This monoculture emerges from optimization dynamics, preference optimization rewards centering, denoising stabilizes around radial configurations, and computational efficiency favors symmetric patterns, evidenced by "repair mechanisms" where explicit off-center requests yield only modest displacement. These findings demonstrate that models can achieve high scores on existing semantic benchmarks while operating within a catastrophically narrow compositional subspace, representing measurable reduction in model expressiveness driven by preference tuning rather than training data composition. Github Folder

Measuring Compositional Collapse: Sora’s Geometric Monoculture

This continues documentation of how quantitative evidence of compositional collapse is present in AI image generation models using Radial Compliance Analysis (RCA-2), a geometric measurement framework. This time analyzing 200 Sora outputs and then comparing against 400 MidJourney images across 100 semantically diverse prompts, we find similar geometric uniformity despite semantic variation: Sora exhibits mass radial compliance with coefficient of variation of just 4.09%, representing 90% compression relative to expected photographic diversity, with mass centroids clustering within 5.3% of frame center regardless of prompt content. Cross-model validation reveals both systems converge on identical geometric attractors (RCS ≈ 0.63, Δr ≈ 0.05, void ratio ≈ 0.82), with Sora showing 25% tighter constraints than MidJourney's architecture. These findings offer measurable reduction in model expressiveness that intensifies with architectural sophistication. Github Folder

Measuring Compositional Collapse: OpenArt’s Geometric Monoculture

Analysis of 200 OpenArt images reveals that 54.1% of compositional displacement is explained by void ratio alone (r=0.736, p<0.0001). OpenArt uses geometric complexity as a safety calculation, not semantic understanding, with sparse compositions achieving 3.6× more displacement than dense ones despite content requiring opposite treatments. Cross-platform comparison with Sora and MidJourney shows all three employ the same simplification heuristic with different threshold settings, forming a perfect linear progression on every metric (displacement, void ratio, correlation strength, architectural ceilings), proving this is not platform-specific behavior but architecturally embedded constraint. The measurement framework quantifies what was previously only qualitative: centered compositions with high void ratios represent computational equilibria satisfying reconstruction loss, preference ratings, and stability simultaneously, revealing that current optimization objectives actively select against compositional sophistication. All three platforms exhibit identical patterns of 1.7-1.9× vertical bias, 2-6% corner occupation, 6-8× higher frame variance than mass variance, yet differ only in enforcement severity (Sora strictest with 60% tight center and 0.28 ceiling, OpenArt moderate with 51% and 0.42 ceiling, MidJourney loosest with 39% and 0.56 ceiling). Escaping compositional monoculture requires explicit compositional reasoning modules separate from geometric stability optimization, not just better training data or parameter tuning.