Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach

Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach

It has been argued that most of corpus linguistics involves one of four fundamental methods: frequency lists, dispersion, collocation, and concordancing. All these presuppose (if only implicitly) the definition of a unit: the element whose frequency in a corpus, in corpus parts, or around a search w...

Full description

Saved in:

Bibliographic Details
Main Author:	Stefan Th. Gries
Format:	Article
Language:	English
Published:	Université Jean Moulin - Lyon 3 2022-03-01
Series:	Lexis: Journal in English Lexicology
Subjects:	corpus linguistics multi-word units n-grams frequency dispersion association
Online Access:	https://journals.openedition.org/lexis/6231
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Blending creativity and productivity: on the issue of delimiting the boundaries of blends as a type of word formation
by: Natalia Beliaeva
Published: (2019-12-01)

Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation
by: Azzah Allahim, et al.
Published: (2024-11-01)

Big data mining and comparative analyses across lexica on the relationship between syllable complexity and word stress
by: Amanda Post da Silveira
Published: (2023-12-01)

The Use of Corpora in Word Formation Research
by: Pius ten Hacken, et al.
Published: (2014-01-01)

SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation
by: Hossein Rouhizadeh, et al.
Published: (2022-07-01)

Meaning Extensions of Grasp: A Corpus-Based Study
by: Marie Nordlund
Published: (2010-04-01)

UnifiedCut: A Simple and Efficient Neural Model for Thai, Burmese and Khmer Word Segmentation
by: Yonghua Wen, et al.
Published: (2024-12-01)

When more is less: the impact of multimorphemic words on learning word meaning
by: Niveen Omar, et al.
Published: (2024-12-01)

Towards Kyrgyz stop words
by: Ruslan Isaev, et al.
Published: (2023-12-01)

Linguistic and sociological perspective on the perception of profanity in Moscow in 2024
by: Ekaterina R. Dobrushina
Published: (2024-01-01)

A FIRST APPROACH TO THE LEXICAL PROFILE OF TELECOMMUNICATION ENGLISH: FREQUENCY, DISTRIBUTION, RESTRICTION AND KEYNESS
by: Camino Rea Rizzo
Published: (2009-10-01)

Manipulative Influence in the Spanish Political Discourse: Words Stereotypes, Words Symbols and Words Slogans
by: K. V. Kucherenko
Published: (2012-10-01)

Un objet lexicographique non identifié : le dictionnaire usuel des bionymes
by: Michèle Debrenne
Published: (2016-06-01)

Frequency effects in acquisition and processing of complex verbal constructions in Romance: Experimental and corpus perspectives on Spanish as well as potential implications for foreign language acquisition
by: Birgit Füreder
Published: (2025-06-01)

Combinability and Stability Analysis of Lexical Units by Statistical Methods (Exemplified by the Verb Take)
by: Marina S. Matytcina, et al.
Published: (2024-09-01)

Where do new words like boobage, flamage, ownage come from? Tracking the history of ‑age words from 1100 to 2000 in the OED3
by: Chris A. Smith
Published: (2018-12-01)

INFORMATION TECHNOLOGIES IN OPTIMIZING SCIENTIFIC RESEARCH IN THE SPHERE OF THEORETICAL AND APPLIED LINGUISTICS IN THE DIGITAL AGE
by: M. V. Kamensky
Published: (2022-02-01)

The microbial landscape action on development destructive erysipelas forms
by: L. A. Vasilevskaya
Published: (2019-06-01)

Marking the beginning and end of the Tatar word: The system of vowels
by: A.M. Galieva, et al.
Published: (2022-10-01)

Abbreviation and its Place in the Word-Formation System of Contemporary English Language
by: Hurskaya Volha
Published: (2024-12-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

The Weight of Words
by: Laurent Metzger
Published: (2017-07-01)

Synchronic and diachronic analysis of the word families of motion verbs (using the root morphemes -khod-, -id-, and -shed- as an example)
by: O.I. Dmitrieva, et al.
Published: (2017-10-01)

Modern Trends in Chinese Language: Integration of Letter Words into the Language System
by: A.R. Alikberova, et al.
Published: (2016-12-01)

The Subtitling Strategies Adopted to Render the Four-Letter Words 'Fuck, Shit' and Their Variants in the French Version of 'Orange Is the New Black: 'A Corpus-Based Study
by: Eponine Moreau
Published: (2024-12-01)

Tracing the scope of fear in corpus: similarities and differences in cross-domain/genre texts
by: Ignacio Rodríguez Sánchez, et al.
Published: (2024-12-01)

If-conditional sentences across Asian Englishes
by: Ariel Robert Ponce, et al.
Published: (2020-12-01)

If-conditional sentences across Asian Englishes
by: Ariel Robert Ponce, et al.
Published: (2020-12-01)

Learning novel words for motion by speakers of structurally different languages
by: Irmak Su Tütüncü, et al.
Published: (2024-12-01)

Non-Usual Word Formation in Poetic Discourse at the Turn of the 20 th – 21 st Centuries
by: Irina V. Erofeeva, et al.
Published: (2024-09-01)

Language's Units in contemporary Poetry
by: محمد خسروی شکیب
Published: (2010-12-01)

Coregistration of eye movements and EEG reveals frequency effects of words and their constituent characters in natural silent Chinese reading
by: Taishen Zeng, et al.
Published: (2025-01-01)

lol thats how reddit talks ;) : le site américain Reddit comme espace de variation de l’anglais. Étude de corpus intersectionnelle et quantitative d’usages non standard, au prisme du genre, de l’âge et de l’ethnicité
by: Marie Flesch
Published: (2020-12-01)

Corpus based study of verbs explain and clarify as an example of assistance in pedagogical settings
by: Séguin Maja
Published: (2020-12-01)