SPIN-SGG: spatial integration for open-vocabulary scene graph generation
Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations b...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Journal of King Saud University: Computer and Information Sciences |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44443-025-00203-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849225863071006720 |
|---|---|
| author | Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia |
| author_facet | Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia |
| author_sort | Nanhao Liang |
| collection | DOAJ |
| description | Abstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities. |
| format | Article |
| id | doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d |
| institution | Kabale University |
| issn | 1319-1578 2213-1248 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Journal of King Saud University: Computer and Information Sciences |
| spelling | doaj-art-b42f33e9e8844ee78e7fdbdb92c8208d2025-08-24T11:53:22ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782213-12482025-08-0137711410.1007/s44443-025-00203-2SPIN-SGG: spatial integration for open-vocabulary scene graph generationNanhao Liang0Xiaoyuan Yang1Shengyi Wang2Yong Liu3Yingwei Xia4Hefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesHefei Institutes of Physical Science, Chinese Academy of SciencesAbstract Scene Graph Generation (SGG) aims to represent visual scenes as structured graphs, where objects and their pairwise relationships are modeled as relational triples. However, conventional SGG methods often struggle with two crucial limitations: inadequate modeling of rich spatial relations beyond 2D layouts, and poor generalization in open-vocabulary scenarios. To address these limitations, we propose SPIN-SGG, a novel framework that explicitly integrates multi-dimensional spatial information—including planar geometry, relative depth, and topological structure—into an open-vocabulary SGG pipeline, without requiring ground-truth 3D annotations. Our approach builds upon a two-stage design. First, we generate pseudo-3D scene reconstructions from monocular images using multi-view synthesis and point cloud estimation, and construct a spatially enriched instruction-tuning dataset that includes fine-grained spatial predicates (e.g., left of, above, inside). Second, we propose a spatially aware visual-language model, trained with both static scene graph descriptions and dynamic spatial reasoning tasks such as question answering and multi-turn dialogue. To further enhance spatial layout consistency, we incorporate layer-aware clustering and object-level depth anchoring in the scene parsing module. Extensive experiments on the PSG benchmark and a newly curated SpatialSGG dataset demonstrate that SPIN-SGG significantly outperforms previous open-vocabulary SGG methods, with improvements of +1.3% in mR@50 on PSG and +2.7% QA accuracy on SpatialSGG, showcasing its robust and comprehensive spatial reasoning capabilities.https://doi.org/10.1007/s44443-025-00203-2Scene graph generationSpatial integrationOpen-vocabulary detectionVision-language alignment |
| spellingShingle | Nanhao Liang Xiaoyuan Yang Shengyi Wang Yong Liu Yingwei Xia SPIN-SGG: spatial integration for open-vocabulary scene graph generation Journal of King Saud University: Computer and Information Sciences Scene graph generation Spatial integration Open-vocabulary detection Vision-language alignment |
| title | SPIN-SGG: spatial integration for open-vocabulary scene graph generation |
| title_full | SPIN-SGG: spatial integration for open-vocabulary scene graph generation |
| title_fullStr | SPIN-SGG: spatial integration for open-vocabulary scene graph generation |
| title_full_unstemmed | SPIN-SGG: spatial integration for open-vocabulary scene graph generation |
| title_short | SPIN-SGG: spatial integration for open-vocabulary scene graph generation |
| title_sort | spin sgg spatial integration for open vocabulary scene graph generation |
| topic | Scene graph generation Spatial integration Open-vocabulary detection Vision-language alignment |
| url | https://doi.org/10.1007/s44443-025-00203-2 |
| work_keys_str_mv | AT nanhaoliang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT xiaoyuanyang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT shengyiwang spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT yongliu spinsggspatialintegrationforopenvocabularyscenegraphgeneration AT yingweixia spinsggspatialintegrationforopenvocabularyscenegraphgeneration |