Statistical analysis of Hindi multi-word expressions using multiple threshold method

Abstract Multiword Expressions (MWEs) extraction is one of the important aspects of text processing, which is used to find the correct meaning of a text phrase. MWEs are lexical phrases consisting of two or more words exhibiting semantic property. MWEs play a vital role in many Natural Language Proc...

Full description

Saved in:
Bibliographic Details
Main Authors: Rakhi Joon, Archana Singhal, Nupur Chugh, Gargi Mishra
Format: Article
Language:English
Published: Springer 2025-05-01
Series:Discover Artificial Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44163-025-00291-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849312003885105152
author Rakhi Joon
Archana Singhal
Nupur Chugh
Gargi Mishra
author_facet Rakhi Joon
Archana Singhal
Nupur Chugh
Gargi Mishra
author_sort Rakhi Joon
collection DOAJ
description Abstract Multiword Expressions (MWEs) extraction is one of the important aspects of text processing, which is used to find the correct meaning of a text phrase. MWEs are lexical phrases consisting of two or more words exhibiting semantic property. MWEs play a vital role in many Natural Language Processing (NLP) applications like machine translation, information retrieval, text processing, and other practical applications. Much of the research in this area has focused on the extraction and analysis of MWEs in English and other natural languages. The MWEs in Hindi have not gained much attention from earlier researchers. In the proposed work the statistical aspects of Hindi MWEs are explored using various statistical measures. Different classes of functional classification of Hindi MWEs are considered for the statistical analysis and experiments. This paper mainly focuses on the evaluation of the following statistical measures, Pointwise Mutual Information (PMI), Dice Coefficient (DC), Modified Dice Coefficient (MDC), Lexical Fixedness (LF), Syntactic Fixedness (SF), and Relevance Measure (RM). The dataset used for the evaluation of MWEs is the benchmark dataset collected from Hindi novels written by “Munshi Premchand Ji”. The best statistical measures are also identified for each functional category of Hindi MWEs. Different threshold values have been obtained for the evaluation of the functional categories. The threshold value represents the maximum limit of the corpus size that one can select for efficient evaluation of a specific category of Hindi MWEs. This approach has been applied to two different datasets to compare and justify the obtained results.
format Article
id doaj-art-2ec4377f987c49eda0a9cc473cadb53d
institution Kabale University
issn 2731-0809
language English
publishDate 2025-05-01
publisher Springer
record_format Article
series Discover Artificial Intelligence
spelling doaj-art-2ec4377f987c49eda0a9cc473cadb53d2025-08-20T03:53:13ZengSpringerDiscover Artificial Intelligence2731-08092025-05-015112110.1007/s44163-025-00291-zStatistical analysis of Hindi multi-word expressions using multiple threshold methodRakhi Joon0Archana Singhal1Nupur Chugh2Gargi Mishra3Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of EngineeringDepartment of Computer Science, IP College for WomenDepartment of Computer Science and Engineering, Bharati Vidyapeeth’s College of EngineeringDepartment of Computer Science and Engineering, Bharati Vidyapeeth’s College of EngineeringAbstract Multiword Expressions (MWEs) extraction is one of the important aspects of text processing, which is used to find the correct meaning of a text phrase. MWEs are lexical phrases consisting of two or more words exhibiting semantic property. MWEs play a vital role in many Natural Language Processing (NLP) applications like machine translation, information retrieval, text processing, and other practical applications. Much of the research in this area has focused on the extraction and analysis of MWEs in English and other natural languages. The MWEs in Hindi have not gained much attention from earlier researchers. In the proposed work the statistical aspects of Hindi MWEs are explored using various statistical measures. Different classes of functional classification of Hindi MWEs are considered for the statistical analysis and experiments. This paper mainly focuses on the evaluation of the following statistical measures, Pointwise Mutual Information (PMI), Dice Coefficient (DC), Modified Dice Coefficient (MDC), Lexical Fixedness (LF), Syntactic Fixedness (SF), and Relevance Measure (RM). The dataset used for the evaluation of MWEs is the benchmark dataset collected from Hindi novels written by “Munshi Premchand Ji”. The best statistical measures are also identified for each functional category of Hindi MWEs. Different threshold values have been obtained for the evaluation of the functional categories. The threshold value represents the maximum limit of the corpus size that one can select for efficient evaluation of a specific category of Hindi MWEs. This approach has been applied to two different datasets to compare and justify the obtained results.https://doi.org/10.1007/s44163-025-00291-zMulti-word expressionsHindi multi-wordsMultiple thresholdsStatistical measures
spellingShingle Rakhi Joon
Archana Singhal
Nupur Chugh
Gargi Mishra
Statistical analysis of Hindi multi-word expressions using multiple threshold method
Discover Artificial Intelligence
Multi-word expressions
Hindi multi-words
Multiple thresholds
Statistical measures
title Statistical analysis of Hindi multi-word expressions using multiple threshold method
title_full Statistical analysis of Hindi multi-word expressions using multiple threshold method
title_fullStr Statistical analysis of Hindi multi-word expressions using multiple threshold method
title_full_unstemmed Statistical analysis of Hindi multi-word expressions using multiple threshold method
title_short Statistical analysis of Hindi multi-word expressions using multiple threshold method
title_sort statistical analysis of hindi multi word expressions using multiple threshold method
topic Multi-word expressions
Hindi multi-words
Multiple thresholds
Statistical measures
url https://doi.org/10.1007/s44163-025-00291-z
work_keys_str_mv AT rakhijoon statisticalanalysisofhindimultiwordexpressionsusingmultiplethresholdmethod
AT archanasinghal statisticalanalysisofhindimultiwordexpressionsusingmultiplethresholdmethod
AT nupurchugh statisticalanalysisofhindimultiwordexpressionsusingmultiplethresholdmethod
AT gargimishra statisticalanalysisofhindimultiwordexpressionsusingmultiplethresholdmethod