Size of corpora and collocations: The case of Russian
With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora b...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
University of Ljubljana Press (Založba Univerze v Ljubljani)
2020-08-01
|
| Series: | Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
| Subjects: | |
| Online Access: | https://journals.uni-lj.si/slovenscina2/article/view/9153 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849470313354493952 |
|---|---|
| author | Maria Khokhlova Vladimir Benko |
| author_facet | Maria Khokhlova Vladimir Benko |
| author_sort | Maria Khokhlova |
| collection | DOAJ |
| description |
With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora.
|
| format | Article |
| id | doaj-art-a00dcaaa77d24bcbae6ebbcf65d35ab8 |
| institution | Kabale University |
| issn | 2335-2736 |
| language | English |
| publishDate | 2020-08-01 |
| publisher | University of Ljubljana Press (Založba Univerze v Ljubljani) |
| record_format | Article |
| series | Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
| spelling | doaj-art-a00dcaaa77d24bcbae6ebbcf65d35ab82025-08-20T03:25:11ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362020-08-018210.4312/slo2.0.2020.2.58-77Size of corpora and collocations: The case of RussianMaria Khokhlova0Vladimir Benko1St Petersburg State University, RussiaSlovak Academy of Sciences, Bratislava, Slovakia With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora. https://journals.uni-lj.si/slovenscina2/article/view/9153CollocationsRussian corporacorpus sizecorpus linguisticsstatistical measures |
| spellingShingle | Maria Khokhlova Vladimir Benko Size of corpora and collocations: The case of Russian Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave Collocations Russian corpora corpus size corpus linguistics statistical measures |
| title | Size of corpora and collocations: The case of Russian |
| title_full | Size of corpora and collocations: The case of Russian |
| title_fullStr | Size of corpora and collocations: The case of Russian |
| title_full_unstemmed | Size of corpora and collocations: The case of Russian |
| title_short | Size of corpora and collocations: The case of Russian |
| title_sort | size of corpora and collocations the case of russian |
| topic | Collocations Russian corpora corpus size corpus linguistics statistical measures |
| url | https://journals.uni-lj.si/slovenscina2/article/view/9153 |
| work_keys_str_mv | AT mariakhokhlova sizeofcorporaandcollocationsthecaseofrussian AT vladimirbenko sizeofcorporaandcollocationsthecaseofrussian |