Sentence length statistics for learner translator corpus, reference corpus and professional corpus Learner RNC Professional Corpus size
(in tokens) 612,839 2,691,142 154,484 No.
Comparisons of the normalized tokens against the corpus size
still suggest that, as a whole, test-takers of higher proficiency levels used more lexical bundle tokens.
The normalization was obtained through this formula: NF = (AF/CS), that is, the normalized frequency (NF) is equal to the absolute frequency (AF) divided by the corpus size
and composition: Evidence from the inflectional morphology of nouns in Old English and Old Frisian.
Notably, the performance of all free systems degrades when the corpus size
exceeds 1M words.
At the same time, the gains due to the corpus size
level end at 30-50 million words.
This result shows that a larger the corpus size
is the better acoustic model.
is given by the number of tokens in every corpus, that is, by the total of running words.
Keywords: lexical unit, lexical unit identification, token/type ratio, dice score, corpus size
, average minimum law
is obviously a matter of considerable discussion and is not the point of this particular paper but the subject of further research.
A corpus, following Huizhog, should include the highest number of entries in order to obtain reliable results, although lately some researchers point out that corpus size
is not so important, depending on the research goals (Krausse, 2005).
This can on one hand be attributed to different corpus sizes
, with the reference corpus being significantly larger than EnCon.