SPIN-SGG: spatial integration for open-vocabulary scene graph generation

Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations b...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nanhao Liang, Xiaoyuan Yang, Shengyi Wang, Yong Liu, Yingwei Xia
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Scene graph generation Spatial integration Open-vocabulary detection Vision-language alignment
Online Access:	https://doi.org/10.1007/s44443-025-00203-2
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849225863071006720
author	Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia
author_facet	Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia
author_sort	Nanhao Liang
collection	DOAJ
description	Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.
format	Article
id	doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d
institution	Kabale University
issn	1319-1578 2213-1248
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d2025-08-24T11:53:22ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782213-12482025-08-0137711410.1007/s44443-025-00203-2SPIN-SGG: spatial integration for open-vocabulary scene graph generationNanhao Liang0Xiaoyuan Yang1Shengyi Wang2Yong Liu3Yingwei Xia4Hefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesAbstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.https://doi.org/10.1007/s44443-025-00203-2Scene graph generationSpatial integrationOpen-vocabulary detectionVision-language alignment
spellingShingle	Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia SPIN-SGG: spatial integration for open-vocabulary scene graph generation Journal of King Saud University: Computer and Information Sciences Scene graph generation Spatial integration Open-vocabulary detection Vision-language alignment
title	SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_full	SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_fullStr	SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_full_unstemmed	SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_short	SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_sort	spin sgg spatial integration for open vocabulary scene graph generation
topic	Scene graph generation Spatial integration Open-vocabulary detection Vision-language alignment
url	https://doi.org/10.1007/s44443-025-00203-2
work_keys_str_mv	AT nanhaoliang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT xiaoyuanyang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT shengyiwang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT yongliu spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT yingweixia spinsggspatialintegrationforopenvocabularyscenegraphgeneration

SPIN-SGG: spatial integration for open-vocabulary scene graph generation

Similar Items