SPIN-SGG: spatial integration for open-vocabulary scene graph generation

Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations b...

Full description

Saved in:
Bibliographic Details
Main Authors: Nanhao Liang, Xiaoyuan Yang, Shengyi Wang, Yong Liu, Yingwei Xia
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:https://doi.org/10.1007/s44443-025-00203-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849225863071006720
author Nanhao Liang
Xiaoyuan Yang
Shengyi Wang
Yong Liu
Yingwei Xia
author_facet Nanhao Liang
Xiaoyuan Yang
Shengyi Wang
Yong Liu
Yingwei Xia
author_sort Nanhao Liang
collection DOAJ
description Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.
format Article
id doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d
institution Kabale University
issn 1319-1578
2213-1248
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series Journal of King Saud University: Computer and Information Sciences
spelling doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d2025-08-24T11:53:22ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782213-12482025-08-0137711410.1007/s44443-025-00203-2SPIN-SGG: spatial integration for open-vocabulary scene graph generationNanhao Liang0Xiaoyuan Yang1Shengyi Wang2Yong Liu3Yingwei Xia4Hefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesAbstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.https://doi.org/10.1007/s44443-025-00203-2Scene graph generationSpatial integrationOpen-vocabulary detectionVision-language alignment
spellingShingle Nanhao Liang
Xiaoyuan Yang
Shengyi Wang
Yong Liu
Yingwei Xia
SPIN-SGG: spatial integration for open-vocabulary scene graph generation
Journal of King Saud University: Computer and Information Sciences
Scene graph generation
Spatial integration
Open-vocabulary detection
Vision-language alignment
title SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_full SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_fullStr SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_full_unstemmed SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_short SPIN-SGG: spatial integration for open-vocabulary scene graph generation
title_sort spin sgg spatial integration for open vocabulary scene graph generation
topic Scene graph generation
Spatial integration
Open-vocabulary detection
Vision-language alignment
url https://doi.org/10.1007/s44443-025-00203-2
work_keys_str_mv AT nanhaoliang spinsggspatialintegrationforopenvocabularyscenegraphgeneration
AT xiaoyuanyang spinsggspatialintegrationforopenvocabularyscenegraphgeneration
AT shengyiwang spinsggspatialintegrationforopenvocabularyscenegraphgeneration
AT yongliu spinsggspatialintegrationforopenvocabularyscenegraphgeneration
AT yingweixia spinsggspatialintegrationforopenvocabularyscenegraphgeneration