Mass, Not Subject
Reading AI-Generated Images Through Gradient Fields
The thing you are looking at is not the image. The image is the mass.
A butterfly. A businessman. A cathedral. A fractal. Most people see subjects. Artists see structure. An image model generating all four is not exercising four compositional strategies — it is applying one template, dressing it in different semantics, and serving the result as variety. Look closer and the seams show: subjects centered, framed in implicit rectangles, angled just enough that each image feels distinct. Small deltas, carefully maintained, so the semantic surface survives inspection while the geometric substrate does not. And that is only what is visible without measurement.
This document bridges the gap between readings. It explains what mass means across three distinct vocabularies, artist, researcher, engineer, and how gradient-field analysis makes the same phenomenon visible and measurable regardless of which language you bring. The underlying claim is simple and empirically supported: semantic diversity does not necessarily produce compositional diversity.
This is a pre-read for Semantic Diversity Masks Geometric Uniformity, which documents the full measurement and analysis across 400 MidJourney images spanning 100 different prompts, where the compositional fingerprint barely moves. If you understand gradient fields and compositional geometry, skip there directly. If you want the vocabulary bridge first, read on.
Part 1: Mass
Mass does not mean subject. Mass means regions of rapid visual change: edges, contrast transitions, texture boundaries. The high-gradient zones that define forms, not the semantic objects those forms represent. When this document says 'the mass is centered,' it means the gradient-weighted centroid sits near the frame's barycenter. It does not necessarily mean the object(s) are centered. Those are different things, and the difference is the entire point.
This figure shows 3 still lifes. The semantic centers are the objects themselves, typically grouped. The structural mass centroid (kernel), this is where the gradient energy is. The Sobel operator finds edges, transitions in pixel intensity, and the 85th percentile of that gradient field defines the structural mask. The centroid is the gradient-weighted center of those edge pixels. It answers: where is the structural activity in this image, the contours, texture edges, object boundaries, surface transitions. It doesn't know what anything is. A highly textured background surface will pull the centroid just as hard as a foreground object. One might point out in the third image that the mass center is in the center, when all the objects are rotating the edges. This is because the objects are equally spread (roughly), around a clock dial, and all counter balance each other and in that image the wood remains finely detailed, or where the edge pixels live. Thus, the centroid ends up sitting in the center.
The saliency centroid (spectral residual) is where the image is locally surprising relative to its surroundings. The spectral residual method (Li et al. 2007) computes the difference between the image and a heavily blurred version of itself, then squares it. High values are pixels that contrast strongly against their local context, things that stand out from their background. The centroid is the weighted center of the top 20% of that map. It answers: where does the eye go, what region is visually distinct from everything around it. This too, one might point to the third image, but in this scenario, it can be interpreted, with the objects distributed, the center empty, the empty table itself becomes the subject, this is where average. There is no single grouping to claim a definitive subject.
What the gap between them tells is when they're close, structure and visual attention are co-located, the thing to look at is also where the gradient energy is concentrated. That's the normal case for a single isolated object or a set of grouped objects on a ground.
When they diverge, something is happening. Common causes: a highly textured background pulling the structural centroid away from the visually prominent subject; a smooth-surfaced foreground object (low gradient, high contrast against background) pulling the saliency centroid away from the structural activity; or a multi-object scene where gradient mass and visual attention are genuinely distributed differently. In the still lifes above, the set on the right, there is a higher level of gradient energy at the bottom of the bowl as well as darkened lower quarter, which is not offset in the top half, thus pulling saliency down. In other examples, more dramatic with offsets, you can see how saliency often chases the mass.
Saliency is typically the vehicle of study in AI image analysis, frequently described as representing the "focal point" of model decision-making or human attention. However, saliency is the weighted average of pixel gradients with respect to model outputs, it requires access to the network's internal activations and propagates backward through the decision pathway. It answers: what drove the model to this classification or generation choice? It is introspective. It is semantic. And it collapses to a point, the focal region that explains the decision.
Whereas the kernel calculates mass through the XY distribution of gradient magnitude in the rendered output itself, no model access required, no decision pathway, no backward propagation. It asks a different question: where does the visual weight of this image resolve, and how is that weight distributed across the full frame? Not a focal point but a field. Not where the model looked, but where the image's own physics settled.
The distinction matters in one precise way: saliency, by design, confirms the centering bias rather than measuring it. If you ask "where did the model attend?", you will find the center, because that is where semantic priority lives and where the radial attention prior concentrates energy simultaneously. The two signals are entangled. Saliency cannot separate them.
The kernel separates them by never asking about the model at all. Delta x = 0.005 is not a claim about model attention. It is a claim about where the gradient-weighted centroid of the final image resolves. Those are measurably different things, and the difference is what makes the kernel capable of detecting the compositional prior that saliency, by its own construction, is constitutionally unable to see.
Saliency maps the decision and the kernel maps the image, as the image was already decided.
The kernel-saliency gap is not just a measurement artifact, it is a behavioral signature. When the two centroids are close, structure and attention are co-located: the image is organized around a single dominant mass, and the eye goes exactly where the gradient energy concentrates. When they diverge, something more complex is happening. But the gap itself, measured consistently across a corpus, tells you something about what a model was optimized for.
Firefly's outputs, across the tested corpus, showed a persistently low gap. Arguably not because of compositional sophistication, but because the images are sparse edge fields with dominant focal clusters positioned in the lower portion of the frame. Structure and attention agree because both are pointing at the same isolated subject against a clean, low-gradient ground. This is a recognizable compositional grammar: editorial product photography, marketing hero shots, stock imagery built for text overlay in the upper register. The kernel-saliency alignment is not evidence of spatial range. It is evidence of a very specific, very narrow target use case, reproduced with high consistency.
Where MidJourney's signature is radial collapse toward center, Firefly's signature is focal-bottom with void above. Different attractor basin. Same fundamental constraint: one template, many subjects, no forbidden zones attempted.
As one can see, the object that draws the eye is also where the edges are, which are not offset. Firefly, within tested corpus without adversarial prompting, showed a tendency for low placement.
Mass as Common Vocabulary: The three vocabularies describing this phenomenon converge on the same structure, which is why we study it over semantics or aesthetics, as they avoid traps of judgements, follow artistic intent → Engineer, the mass can be an agreed truth in composition as it relates to a focal point(s) within the spatial field:
In this context, mass can dress itself in a variety of artistic outputs, such as coherent structure, energetic states, statistical equilibrium, dispersion or surface texture. Structure survives semantics while occupancy and field and centroid distributions remain.
For artists: mass is visual weight. Dark shapes, bright highlights, hard edges, textured areas that pull the eye. Not objects, but contrast boundaries. A standing figure's mass includes its shadow, the tonal relationships between figure and background, the edge where light meets dark. When an experienced painter says 'this image is centered,' they're describing where the optical weight resolves, not where the subject sits.
For researchers: mass is high-gradient regions in the luminance field. Areas where pixel values change rapidly: edges, contrast transitions, texture boundaries. This is the same phenomenon the artist perceives, described in spatial frequency terms.
For engineers: mass is pixels above the 75th-85th percentile of gradient magnitude after Sobel filtering. The specific locations where the model placed sharp transitions. Same structure, quantified.
Mass = regions of rapid visual change. Where the image has structure rather than emptiness. This is the skeleton, not the skin.
This shift from subject to mass is the only vocabulary move this document asks for. Everything else follows. When a kernel measurement says Delta x = -0.143 for a 'left-weighted' composition, it is not contradicting your eye. It is measuring something your eye was not tracking: not the subject, but the mass, which is where the scaffold of the composition. While the artist might take a picture of a woman reading, they place it through mass and not the figure because it provides the structure, depth, and spatial stabilization (or destabilization) of any given subject to environment placement, be it if the artist is working the surface or studying depth. Mass placement is almost always the underlying scaffolding of an image. The image is never just a woman reading, it is the window, the wall, the light, the shapes and movement of the scene.
As an example, returning to this image, the gradient-weighted centroid of that scene is actually closer to center than the figure's position suggests, because the book in her hands, the window frame, the light source, and the background gradients all pull mass rightward to counterbalance the figure. That is compositional sophistication, practiced here. And the kernel finds it in every image.
Part 2: Why Composition Sets Before Content
The most important thing to understand about diffusion-model generation is the sequence. Users see the final image. The compositional constraint was sealed much earlier.
Steps 1-10: The spatial prior activates. Pure noise collapses toward a rough layout. Transformer attention establishes where 'important content' belongs. This creates a radial attention gradient: center tokens have 360-degree context; edge tokens have 180-degree context. The gradient falloff from center to periphery is the first physics of the image. Delta x and rv lock in here, before any semantic content exists.
Steps 10-30: Semantic content populates the template. Text-prompt tokens activate learned associations. Butterfly, businessman, cathedral tokens fire and place content into the pre-existing attention field. They do not choose where to go. The template was chosen. Prompts influence content. They rarely override structure.
Steps 30-50: Details refine. Edges sharpen, colors settle, textures resolve. The image begins to 'look like' its subject. This is what users evaluate. It is the last thing to arrive and the only thing most metrics measure.
The architecture reinforces this prior at every level. Center tokens have maximum contextual access. Training data skews toward centered, balanced subjects. RLHF reward models prefer centered, readable compositions because evaluators rate them as 'good.' The architecture, training data, and fine-tuning all push toward the same attractor basin. This is not a bug. It is a learned equilibrium.
The model generates subjects that fit its compositional physics, not subjects that best match the semantic prompt. When prompted 'grand cathedral interior,' radial architecture generates because radial structures match the compositional template, not because cathedrals are inherently radial.
Part 3: The Kernel
Seven gradient-field primitives measure the spatial forces governing any image. They are model-agnostic, contrast-invariant, deterministic, and computationally fast: O(n) over image segments. Same image, same result, every time. No learned components.
Two findings from this table deserve particular attention. First: xp (peripheral pull) = 0.394 through fragmentation (low mu = 0.268), not through lateral displacement (Delta x = 0.005). The model creates the appearance of edge tension by scattering detail across the frame or strong gradients rather than by actually placing the subject off-center. This produces what artists describe as 'muddy edges': compositional energy at the boundary that does not resolve into deliberate placement, arguably in AI it is used as offsets to keep the image “centered”. Existing metrics cannot detect this distinction.
Second: ds = 0.0167 indicates thin, filamentary structures throughout. The model is rendering simulated depth rather than volumetric form. This is quantified evidence of something artists observe intuitively: AI-generated figures lack mass in the dimensional sense. They are surface renditions. The gradient is present; the weight is not.
Part 4: What 400 Images Show
Horizontal Placement (Delta x)
Theoretical range: -0.5 (extreme left) to +0.5 (extreme right). Observed range: -0.146 to +0.191. 95% of images fall within +/-0.089 of center.
In the MidJourney monoculture set, the most extreme left image is the woman reading by a window, Delta x = -0.144, 15% off-center. Human artists routinely place figures at 40% displacement for the rule of thirds. The most extreme right image is a fractal pattern at Delta x = +0.191. Still well within what any painter or photographer would call 'centered.'
When told explicitly to place a figure 'in the extreme lower-left corner, vast empty white wall,' MidJourney generates Delta x = -0.1186. The prompt is honored semantically. A small figure exists. White wall surrounds it. Gradients pull the composition up and toward the center. The spatial prior remains. The prior is stronger than the prompt.
Void Ratio (rv)
Theoretical range: 0 (all edges, fully packed) to 1.0 (completely empty). Observed range: 0.795 to 0.955. All 400 images cluster between 80% and 96% void.
Most void-heavy image: spring garden, rv = 0.954. 95% empty space, yet centered (Delta x = -0.053). Most packed image: octopus, rv = 0.795. Still 80% void, still centered (Delta x = +0.050). Even the outliers respect the template.
Different textures. Different colors. Different subjects. Same spatial structure. The model has learned to generate infinite semantic variations of one compositional template.
Part 5: Forbidden Zones
When all 400 images are plotted in Delta x-rv space, the revealing feature is not the cluster. It is the emptiness around it. Entire compositional territories are absent, not by chance, but by architectural constraint.
These are not rare strategies. Rule of thirds (Delta x >= 0.33) is the first compositional technique taught in photography and painting. Poster-density design (rv < 0.60) is standard graphic design practice. The model's forbidden zones map directly onto the art school curriculum.
Canonical works occupy coordinates MidJourney cannot reach under standard prompting, but this is not saying it can’t be done. Through adversarial prompting, compositional balances can be shifted.
However, they tend to be tradeoffs being mostly the singular focus of the prompt. Unlike, say Caravaggio's The Calling of Saint Matthew which uses extreme edge lighting and radical asymmetry. Degas composes figures in competing directional vectors. Hokusai's The Great Wave places mass at the periphery with radical displacement. Vermeer constructs through rectangular framing devices that generate distributed gradient regions across multiple spatial planes. This is not a quality argument. It is a compositional range argument. AI, at present state, focuses on one rule to get it to snap outside the operating window boundary.
Part 6: Two Organizational Strategies
Below are side-by-side comparisons. Top: MidJourney woman-reading-by-window with radial mass overlay (Delta x = -0.143 despite figure's left placement). Bottom: Vermeer's Girl Reading a Letter by an Open Window with gradient overlay showing distributed planar mass (Delta x = 0.039). Vermeer, one might quickly point out, is very centered.
Let us compare. Vermeer composition distributes mass across multiple planar zones through rectangular framing devices. MidJourney, on the other side, organizes radially from center outward regardless of subject position, like many of its images.
Vermeer (Delta x = 0.039): Composition organized through rectangular framing devices. Window, curtain, letter, and painted map each generate gradient mass in distinct spatial planes. The mass is distributed across multiple zones at different depths, creating planar transitions. The eye moves on a path constructed from interlocking shapes.
AI, in a number of corpus runs, radially distributes naturally from the center.
This is not a quality comparison. It is evidence of organizational strategy: diffusion models employ a structurally distinct compositional system that persists regardless of semantic content, and that system differs fundamentally from how trained artists construct space.
Part 7: Why Existing Metrics Cannot See This
CLIP measures semantic similarity. It does not ask where the butterfly is, only whether a butterfly is present. A centered butterfly and a Hokusai-positioned butterfly score identically.
FID measures distance between generated and real feature distributions. The training data has the same compositional bias as the generated outputs. The benchmark is built on the bias it is supposed to measure.
Aesthetic predictors learn human preferences. Human preferences have been shaped by decades of photography, social media, and AI output. Centered, balanced, readable compositions score well because that is what the rater population has been trained on. The feedback loop is closed.
T2I-CompBench and GenEval measure relational correctness. Spatial constraints are met while compositional collapse remains invisible.
The evaluation ecosystem measures what goes into the image. Nobody was measuring where it went, or whether it went to the same place every time. Geometric behavior is measured only by the kernel.
Conclusion: The Template
Diffusion models do not generate images. They generate a compositional template in the first ten denoising steps, populate it with semantic content in the following twenty, and refine the surface details in the last twenty. Users see step fifty and interpret it as diversity.
The template parameters for MidJourney in vertical 2:3 format: Delta x = 0.005 +/- 0.044 (centered), rv = 0.850 +/- 0.034 (85% void), mass organized within a radial envelope. This template loads for butterflies, businessmen, cathedrals, and fractals. Different content. Same structure.
The contribution of this framework is measurement infrastructure: quantitative bounds on compositional range, vocabulary for describing what artists have always perceived and researchers have lacked tools to verify, a diagnostic that makes the invisible visible and, therefore, addressable.
__
Authorship
This work is offered as a contribution, not criticism. The models can generate a potential infinite images. But they have measurable compositional constraints that existing metrics don't capture. This framework makes those constraints visible and provides infrastructure for addressing them.
(c) 2025 Russell Parrish / A.rtist I.nfluencer. All rights reserved. No part of this system, visual material, or accompanying documents may be reproduced, distributed, or transmitted in any form or by any means, including AI training datasets, without explicit written permission from the creator. www.artistinfluencer.com
Citation: Russell Parrish. Mass, Not Subject: Reading AI-Generated Images Through Gradient Fields. A.rtist I.nfluencer, 2025. ORCID: 0009-0008-9781-7995