SPIN-SGG: spatial integration for open-vocabulary scene graph generation

Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations b...

Full description

Saved in:
Bibliographic Details
Main Authors: Nanhao Liang, Xiaoyuan Yang, Shengyi Wang, Yong Liu, Yingwei Xia
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:https://doi.org/10.1007/s44443-025-00203-2
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.
ISSN:1319-1578
2213-1248