Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene.
What are scene metamers? Scene metamers are physically distinct images that produce indistinguishable perceptual experiences in human observers. They reveal the representational structure of the visual system by identifying image variations that preserve the latent features underlying human scene understanding.
Each trial began with free viewing of an original scene, after which fixation coordinates from 45 participants were sent to MetamerGen to generate a new image conditioned on their gaze. The generated image was then shown briefly, and participants made a same–different judgment to assess perceptual metamerism. In a separate experiment with 12 participants using randomized fixation locations, we tested how individual gaze strategies influence metamerism.
MetamerGen model architecture
High-resolution and blurred, low-resolution images
are processed through DINOv2-Base to extract patch tokens. Foveal features are obtained by applying
binary masks to high-resolution patch tokens, retaining only fixated regions. Both foveal and
peripheral patch tokens are processed through separate Perceiver-based query networks that compress
features into conditioning tokens compatible with Stable Diffusion’s cross-attention mechanism.
The resulting dual conditioning streams are integrated into the pretrained UNet.
Probing the features of scene metamers across the visual hierarchy
Multi-level feature analysis pipeline using neurally-grounded model: (Top) Early, mid,
and late network layers serve as proxies for different stages of processing across the hierarchy of
visual brain areas. (Bottom) Results show that as feature similarity increased at these different
processing levels, the proportion of participants judging generated images as metameric also increased.
These effects were clearer when metamers were generated based on a viewer’s own fixated locations
(salmon) than on randomly-sampled locations (turquoise).
Semantics outperforms structure in driving scene metamerism
(Left) Mid-level visual features driving metamer judgments: For metamers generated from the viewer's own fixation locations (salmon), changes in monocular depth estimates of scene structure strongly predicted "same" judgments to generations. The alignment in proto-object segmentation between original and generated scenes, quantified by mIoU, similarly predicted metamerism rate (higher mIoU scores correlating with higher proportions of "same" judgments), although here the relationship was less pronounced.
(Right) High-level visual features driving metamer judgments: Semantic similarity strongly predicts metameric scene understanding, with larger DreamSim distances corresponding to reduced perceptual alignment. CLIP similarity shows a similar trend, but not when scenes were generated from randomly sampled locations (turquoise).
MetamerGen Metameric vs Non-Metameric Scenes by Human vs Random Fixations
(1/2) Human Fixations: Additional metameric vs. non-metameric judgment examples.
(Left) Original images with human fixations overlaid in red and corresponding generated images judged as "same" by participants.
(Right) Original images with fixations and generated images judged as "different" by participants.
(2/2) Random Fixations: Additional metameric vs. non-metameric judgment examples.
(Left) Original images with randomly-sampled fixations overlaid in red and corresponding generated images judged as "same" by participants.
(Right) Original images with fixations and generated images judged as "different" by participants.
Acknowledgements
We would like to thank the National Science Foundation for supporting this work through
awards 2123920 and 2444540 to GJZ, and the National Institutes of Health through their award
R01EY030669, also to GJZ. AL is supported by NSF-GRFP award 2234683. AG is supported by
NSF grants IIS-2123920, IIS-2212046 awarded to DS. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author(s) and do not necessarily reflect
the views of the funding agency.
Bibtex
@inproceedings{
raina2026generating,
title={Generating metamers of human scene understanding},
author={Ritik Raina and Abe Leite and Alexandros Graikos and Seoyoung Ahn and Dimitris Samaras and Greg Zelinsky},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=cSDXx8V6K9}
}