Corpus CesCa: Compiling a corpus of written Catalan produced by school children.
Any
2012
Lloc
International Journal of Corpus Linguistics, 17:3, 428-441
ISBN
ISSN 1384–6655
This paper outlines the compilation of a corpus of Catalan written production. T he CesCa corpus presents a picture of the Catalan written language throughout compulsory schooling. It contains two kinds of data: Vocabularies of five semantic fields comprising 242,404 lexical forms and Textual data of four different discourse genres consisting of 207,028 tokens. Both vocabularies and the textual data have been morphologically analyzed and lemmatized. The corpus is freely available. This paper will outline the main features of the corpus and make some suggestions as to the uses to which the corpus can be put.
dijous, 6 juliol, 2017 - 11:26