Spatial Reasoning Failures & Diagnostics
Failure Morphologies
A Generative Field Framework for Measuring Spatial Priors in Image Models (Researcher’s Edition)
This introduces a minimal 6-primitive kernel (Δx, rᵥ, ρᵣ, μ, θ, ds) that measures spatial priors in image generation models,, geometric biases governing composition, placement, and structural stability that current semantic metrics cannot detect. Cross-engine testing reveals stable compositional fingerprints: GPT exhibits right-bias with high void, MidJourney shows left-weighted compression, and each model occupies distinct regions in a measurable spatial field. The framework detects structural instability 3-4 inference steps before visible semantic collapse, with perturbation experiments revealing characteristic drift patterns like Sora's distinctive collapse→restoration→re-collapse sequence as it seeks its original spatial prior. These measurements extend existing evaluation frameworks into the spatial domain, with applications in version regression detection, early failure identification, model comparison, and compositional steering. All primitives are reproducible from explicit geometric coordinates, with validation across generation systems. Github Folder
Precise Mapping: Kernel Metrics → Gradient-Field Operations
This document defines a deterministic way to measure spatial priors in images by mapping the Visual Len’s seven compositional kernel metrics—Δx, rᵥ, ρᵣ, μ, xₚ, θ, ds—directly onto operations over the gradient field ∣∇I∣|∇I|∣∇I∣. Instead of looking at semantics or model internals, the document treats every image as a force field of mass, void, and pull, extracted via Sobel gradients, adaptive masks, skeletons, and ridge structures. Each metric is given both a compositional intuition (what it “means” in a picture) and a precise gradient-based definition (how it’s computed), making the system reproducible and model-agnostic. Together, these metrics expose inductive biases and forbidden zones in composition space—regions models prefer, avoid, or snap back to, providing an instrument for fingerprinting, comparing, and stress-testing generative image systems. Github Folder
A Diagnostic Taxonomy for Spatial Reasoning Failures in Generative Image Models
Generative image models often produce visually fluent scenes that nonetheless fail in spatial reasoning. This study presents a diagnostic taxonomy and metric suite for identifying seven recurrent morphologies of collapse. Each morphology pairs a visible symptom with a measurable signature when geometric reasoning plateaus or drifts. The resulting field grid and tagging protocol convert aesthetic irregularities into reproducible diagnostics, revealing how image models substitute correlation for causality or abandon volumetric logic under constraint. By integrating metric behavior with morphological observation, this framework bridges computational evaluation and visual analysis, offering a method to read model cognition
Radial Collapse: A Visual Prior + Kernel-Mapped Failure Mode
Generative AI models exhibit systematic compositional bias toward radial density distributions that collapse multi-force spatial negotiation into single-attractor equilibrium. We formalize this as Radial Collapse Prior (RCP), demonstrate visual detection via concentric overlay analysis, and provide quantitative measurement via compositional primitives (Δx, rᵥ, ρᵣ, μ, x). Analysis of 200+ AI-generated images shows 78% high RCP conformity vs. 31% in human-composed artworks. This presents detection rules, routing interventions, and profile-aware tolerance frameworks within the Visual Thinking Lens protocol.
Playbook: Why Structurally-Informed Prompts Produce Stable Images
Moving images outside trained priors can often create structural instability that frequently triggers collapse. This playbook documents how structurally-informed prompts produce stable AI-generated images by aligning with how diffusion models actually generate content, through latent fields and geometric constraints rather than semantic object descriptions. It demonstrates that prompts containing spatial operators (offset, void control, directional forces) function as inference-time regularizers that prevent radial collapse and maintain compositional integrity. The framework translates abstract geometric primitives (Δx, rᵥ, ρᵣ, μ, xₚ) into natural language prompting strategies, providing both theoretical explanation and practical paired examples showing collapse vs. anti-collapse outputs across multiple image generation platforms.
Why Existing Metrics Fail to Measure Composition or Detect Spatial Priors - and What the Kernel Adds
Current generative model evaluation relies on metrics that measure semantic correctness (CLIPScore, T2I-CompBench, GenEval) and feature-space realism (FID, IS, KID), but none measure compositional geometry. This creates a critical blind spot: models can achieve excellent benchmark scores while exhibiting severe spatial biases, central placement defaults, various forms of collapse, void compression, and symmetry lock. These structural failures are invisible to existing metrics because they operate in feature space or semantic alignment space, not compositional space. This paper introduces a kernel-based measurement framework that directly quantifies composition and spatial priors through geometric primitives: Δx (placement offset), rᵥ (void ratio), ρᵣ (packing density), μ (cohesion), θ (orientation stability), ds (structural thickness), and the derived field invariant xₚ (peripheral pull). The kernel complements existing metrics by measuring how models compose, not just what they generate, closing a fundamental gap in generative model evaluation.
The Spatial Blind Spot in Generative Model Evaluation (Field Explainer)
Current generative model evaluation measures semantic correctness and feature realism but ignores spatial organization, where mass is placed, how void functions, and which geometric priors dominate composition. This document introduces a five-primitive kernel (Δx, rᵥ, ρᵣ, μ, xₚ) that measures compositional geometry, exposing biases like Radial Collapse Prior that existing metrics cannot detect. The kernel augments rather than replaces FID/CLIP/ T2I-CompBench, adding the missing spatial dimension to evaluation infrastructure. Three figure studies demonstrate how models can achieve identical semantic scores while exhibiting fundamentally different compositional sophistication—differences the kernel quantifies but feature-space metrics miss entirely.
Mass, Not Subject: Reading AI-Generated Images Through Gradient Fields
A companion guide to Semantic Diversity Masks Geometric Uniformity. This guide bridges the perception gap between what viewers see (semantic objects) and what AI systems generate (gradient-field structures). When most people look at AI images, they see butterflies, businessmen, or cathedrals, which are completely different subjects. When artists and gradient analysis look at the same images, they see structural mass: edges, contrast boundaries, and visual weight distribution that often resolves to nearly identical compositional templates regardless of subject matter. We explain how diffusion models establish spatial structure in early generation steps (before semantic content appears), why this creates radial, center-weighted compositions, and how seven kernel metrics translate between artistic perception and computational measurement. Understanding "mass, not subject" is essential for evaluating the compositional constraints documented in the main research.
Semantic Diversity Masks Geometric Uniformity: Compositional Monoculture in MidJourney
This effort analyzed 400 MidJourney v7 images across 100 semantically diverse prompts and found that despite generating butterflies, businessmen, cathedrals, and fractals, the model uses only 34% of available horizontal space (Δx = 0.005 ± 0.044) and maintains 80-96% void regardless of subject (rᵥ = 0.850 ± 0.034). Using VTL kernel metrics, seven gradient-field measurements that quantify compositional structure independent of semantic content, we expose forbidden zones where the model systematically avoids compositional strategies fundamental to artistic practice, including rule of thirds, extreme asymmetry, and deliberate minimalism. Current evaluation metrics (CLIP, FID) measure semantic accuracy and distributional similarity but remain completely blind to these geometric constraints, allowing a model to score perfectly while exhibiting severe compositional monoculture. We provide the first model-agnostic framework for measuring compositional structure as an independent evaluation dimension, demonstrating that different subjects resolve to identical geometric skeletons beneath their semantic variety.
Measuring Compositional Collapse: MidJourney’s Geometric Monoculture
This presents quantitative evidence of severe compositional collapse in AI image generation using Radial Compliance Analysis (RCA-2), a novel geometric measurement framework. Analyzing 400 MidJourney outputs across 100 semantically diverse prompts, this finds extreme geometric uniformity despite semantic variation: mass radial compliance shows coefficient of variation of just 5.46%, representing 3-6× compression relative to expected photographic diversity, with mass centroids clustering within 9.5% of frame center and void ratio locked at 87% regardless of prompt content. This monoculture emerges from optimization dynamics, preference optimization rewards centering, denoising stabilizes around radial configurations, and computational efficiency favors symmetric patterns, evidenced by "repair mechanisms" where explicit off-center requests yield only modest displacement. These findings demonstrate that models can achieve high scores on existing semantic benchmarks while operating within a catastrophically narrow compositional subspace, representing measurable reduction in model expressiveness driven by preference tuning rather than training data composition. Github Folder
Measuring Compositional Collapse: Sora’s Geometric Monoculture
This continues documentation of how quantitative evidence of compositional collapse is present in AI image generation models using Radial Compliance Analysis (RCA-2), a geometric measurement framework. This time analyzing 200 Sora outputs and then comparing against 400 MidJourney images across 100 semantically diverse prompts, we find similar geometric uniformity despite semantic variation: Sora exhibits mass radial compliance with coefficient of variation of just 4.09%, representing 90% compression relative to expected photographic diversity, with mass centroids clustering within 5.3% of frame center regardless of prompt content. Cross-model validation reveals both systems converge on identical geometric attractors (RCS ≈ 0.63, Δr ≈ 0.05, void ratio ≈ 0.82), with Sora showing 25% tighter constraints than MidJourney's architecture. These findings offer measurable reduction in model expressiveness that intensifies with architectural sophistication. Github Folder