Revisiting the idea of a representative linguistic corpus

A Dinu, A Vlad - 2022 14th International Conference on …, 2022 - ieeexplore.ieee.org
A Dinu, A Vlad
2022 14th International Conference on Communications (COMM), 2022ieeexplore.ieee.org
The paper revisits the notion of representativeness of a linguistic corpus obtained through
successive concatenations of author corpora. The means for investigating this concept is
statistical probability estimation by piecing together multiple author corpora and evaluating
the consistency of the resulted construction in comparison with the overall corpus consisting
of full texts for all the authors considered. Essential for this investigation are the concepts of
statistical probability estimation and the associated confidence intervals. Furthermore, we …
The paper revisits the notion of representativeness of a linguistic corpus obtained through successive concatenations of author corpora. The means for investigating this concept is statistical probability estimation by piecing together multiple author corpora and evaluating the consistency of the resulted construction in comparison with the overall corpus consisting of full texts for all the authors considered. Essential for this investigation are the concepts of statistical probability estimation and the associated confidence intervals. Furthermore, we also made use of previous results and we made correlations with the notions of minimum statistical independence distance, Zipf's Law priority areas and common words, starting with the new methodology proposed in this paper. The experimental results obtained shows that when data from the analyzed corpus were collected in an i.i.d. manner (independent and identically distributed), the analyzed word sets had a very good overlap with the set of Zipf Area 1 words (99%) and common words among author corpora (75%). These results can be aligned with the idea of representativeness of a linguistic corpus.
ieeexplore.ieee.org
Showing the best result for this search. See all results