MetamerGen

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.

In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene.

What are scene metamers? Scene metamers are physically distinct images that produce indistinguishable perceptual experiences in human observers. They reveal the representational structure of the visual system by identifying image variations that preserve the latent features underlying human scene understanding.

Each trial began with free viewing of an original scene, after which fixation coordinates from 45 participants were sent to MetamerGen to generate a new image conditioned on their gaze. The generated image was then shown briefly, and participants made a same–different judgment to assess perceptual metamerism. In a separate experiment with 12 participants using randomized fixation locations, we tested how individual gaze strategies influence metamerism.

MetamerGen model architecture

High-resolution and blurred, low-resolution images are processed through DINOv2-Base to extract patch tokens. Foveal features are obtained by applying binary masks to high-resolution patch tokens, retaining only fixated regions. Both foveal and peripheral patch tokens are processed through separate Perceiver-based query networks that compress features into conditioning tokens compatible with Stable Diffusion’s cross-attention mechanism. The resulting dual conditioning streams are integrated into the pretrained UNet.

Probing the features of scene metamers across the visual hierarchy

Multi-level feature analysis pipeline using neurally-grounded model: (Top) Early, mid, and late network layers serve as proxies for different stages of processing across the hierarchy of visual brain areas. (Bottom) Results show that as feature similarity increased at these different processing levels, the proportion of participants judging generated images as metameric also increased. These effects were clearer when metamers were generated based on a viewer’s own fixated locations (salmon) than on randomly-sampled locations (turquoise).

Semantics outperforms structure in driving scene metamerism

(Left) Mid-level visual features driving metamer judgments: For metamers generated from the viewer's own fixation locations (salmon), changes in monocular depth estimates of scene structure strongly predicted "same" judgments to generations. The alignment in proto-object segmentation between original and generated scenes, quantified by mIoU, similarly predicted metamerism rate (higher mIoU scores correlating with higher proportions of "same" judgments), although here the relationship was less pronounced. (Right) High-level visual features driving metamer judgments: Semantic similarity strongly predicts metameric scene understanding, with larger DreamSim distances corresponding to reduced perceptual alignment. CLIP similarity shows a similar trend, but not when scenes were generated from randomly sampled locations (turquoise).

Acknowledgements

We would like to thank the National Science Foundation for supporting this work through awards 2123920 and 2444540 to GJZ, and the National Institutes of Health through their award R01EY030669, also to GJZ. AL is supported by NSF-GRFP award 2234683. AG is supported by NSF grants IIS-2123920, IIS-2212046 awarded to DS. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agency.


            @inproceedings{

                raina2026generating,

                title={Generating metamers of human scene understanding},

                author={Ritik Raina and Abe Leite and Alexandros Graikos and Seoyoung Ahn and Dimitris Samaras and Greg Zelinsky},

                booktitle={The Fourteenth International Conference on Learning Representations},

                year={2026},

                url={https://openreview.net/forum?id=cSDXx8V6K9}

                }

Ritik Raina¹	Abe Leite¹	Alexandros Graikos¹	Seoyoung Ahn²
Dimitris Samaras¹		Gregory J. Zelinsky¹

MetamerGen model architecture

Probing the features of scene metamers across the visual hierarchy

Semantics outperforms structure in driving scene metamerism

MetamerGen Metameric vs Non-Metameric Scenes by Human vs Random Fixations

Acknowledgements

Bibtex