Spatial Position Reasoning of Image Entities Based on Location Words

The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant ad...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xingguo Qin, Ya Zhou, Jun Li
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Mathematics
Subjects:	visual–spatial reasoning locative preposition contrastive learning image–text retrieval
Online Access:	https://www.mdpi.com/2227-7390/12/24/3940
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846103796220952576
author	Xingguo Qin Ya Zhou Jun Li
author_facet	Xingguo Qin Ya Zhou Jun Li
author_sort	Xingguo Qin
collection	DOAJ
description	The endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant advancements in multimodal reasoning tasks. Notably, contrastive learning models based on the Contrastive Language-Image pre-training (CLIP) framework have attracted substantial interest. Predominantly, current contrastive learning models focus on nominal and verbal elements within image descriptions, while spatial locatives receive comparatively less attention. However, prepositional spatial indicators are pivotal for encapsulating the critical positional data between entities within images, which is essential for the reasoning capabilities of image–language models. This paper introduces a spatial location reasoning model that is founded on spatial locative terms. The model concentrates on spatial prepositions within image descriptions, models the locational interrelations between entities in images through these prepositions, evaluates and corroborates the spatial interconnections of entities within images, and harmonizes the precision with image–textual descriptions. This model represents an enhancement of the CLIP model, delving deeply into the semantic characteristics of spatial prepositions and highlighting their directive role in visual language models. Empirical evidence suggests that the proposed model adeptly captures the correlation of spatial indicators in both image and textual representations across open datasets. The incorporation of spatial position terms into the model was observed to elevate the average predictive accuracy by approximately three percentage points.
format	Article
id	doaj-art-827511a00dfe4615af3e1890b4d06ec6
institution	Kabale University
issn	2227-7390
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj-art-827511a00dfe4615af3e1890b4d06ec62024-12-27T14:38:04ZengMDPI AGMathematics2227-73902024-12-011224394010.3390/math12243940Spatial Position Reasoning of Image Entities Based on Location WordsXingguo Qin0Ya Zhou1Jun Li2School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, ChinaSchool of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, ChinaSchool of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, ChinaThe endeavor of spatial position reasoning effectively simulates the sensory and comprehension faculties of artificial intelligence, especially within the purview of multimodal modeling that fuses imagery with linguistic data. Recent progress in visual image–language models has marked significant advancements in multimodal reasoning tasks. Notably, contrastive learning models based on the Contrastive Language-Image pre-training (CLIP) framework have attracted substantial interest. Predominantly, current contrastive learning models focus on nominal and verbal elements within image descriptions, while spatial locatives receive comparatively less attention. However, prepositional spatial indicators are pivotal for encapsulating the critical positional data between entities within images, which is essential for the reasoning capabilities of image–language models. This paper introduces a spatial location reasoning model that is founded on spatial locative terms. The model concentrates on spatial prepositions within image descriptions, models the locational interrelations between entities in images through these prepositions, evaluates and corroborates the spatial interconnections of entities within images, and harmonizes the precision with image–textual descriptions. This model represents an enhancement of the CLIP model, delving deeply into the semantic characteristics of spatial prepositions and highlighting their directive role in visual language models. Empirical evidence suggests that the proposed model adeptly captures the correlation of spatial indicators in both image and textual representations across open datasets. The incorporation of spatial position terms into the model was observed to elevate the average predictive accuracy by approximately three percentage points.https://www.mdpi.com/2227-7390/12/24/3940visual–spatial reasoninglocative prepositioncontrastive learningimage–text retrieval
spellingShingle	Xingguo Qin Ya Zhou Jun Li Spatial Position Reasoning of Image Entities Based on Location Words Mathematics visual–spatial reasoning locative preposition contrastive learning image–text retrieval
title	Spatial Position Reasoning of Image Entities Based on Location Words
title_full	Spatial Position Reasoning of Image Entities Based on Location Words
title_fullStr	Spatial Position Reasoning of Image Entities Based on Location Words
title_full_unstemmed	Spatial Position Reasoning of Image Entities Based on Location Words
title_short	Spatial Position Reasoning of Image Entities Based on Location Words
title_sort	spatial position reasoning of image entities based on location words
topic	visual–spatial reasoning locative preposition contrastive learning image–text retrieval
url	https://www.mdpi.com/2227-7390/12/24/3940
work_keys_str_mv	AT xingguoqin spatialpositionreasoningofimageentitiesbasedonlocationwords AT yazhou spatialpositionreasoningofimageentitiesbasedonlocationwords AT junli spatialpositionreasoningofimageentitiesbasedonlocationwords

Spatial Position Reasoning of Image Entities Based on Location Words

Similar Items