Statistical analysis of Hindi multi-word expressions using multiple threshold method

Abstract Multiword Expressions (MWEs) extraction is one of the important aspects of text processing, which is used to find the correct meaning of a text phrase. MWEs are lexical phrases consisting of two or more words exhibiting semantic property. MWEs play a vital role in many Natural Language Proc...

Full description

Saved in:
Bibliographic Details
Main Authors: Rakhi Joon, Archana Singhal, Nupur Chugh, Gargi Mishra
Format: Article
Language:English
Published: Springer 2025-05-01
Series:Discover Artificial Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44163-025-00291-z
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Multiword Expressions (MWEs) extraction is one of the important aspects of text processing, which is used to find the correct meaning of a text phrase. MWEs are lexical phrases consisting of two or more words exhibiting semantic property. MWEs play a vital role in many Natural Language Processing (NLP) applications like machine translation, information retrieval, text processing, and other practical applications. Much of the research in this area has focused on the extraction and analysis of MWEs in English and other natural languages. The MWEs in Hindi have not gained much attention from earlier researchers. In the proposed work the statistical aspects of Hindi MWEs are explored using various statistical measures. Different classes of functional classification of Hindi MWEs are considered for the statistical analysis and experiments. This paper mainly focuses on the evaluation of the following statistical measures, Pointwise Mutual Information (PMI), Dice Coefficient (DC), Modified Dice Coefficient (MDC), Lexical Fixedness (LF), Syntactic Fixedness (SF), and Relevance Measure (RM). The dataset used for the evaluation of MWEs is the benchmark dataset collected from Hindi novels written by “Munshi Premchand Ji”. The best statistical measures are also identified for each functional category of Hindi MWEs. Different threshold values have been obtained for the evaluation of the functional categories. The threshold value represents the maximum limit of the corpus size that one can select for efficient evaluation of a specific category of Hindi MWEs. This approach has been applied to two different datasets to compare and justify the obtained results.
ISSN:2731-0809