Corpus linguistics and signed languages: no lemmata, no corpus

Autor/a: JOHNSTON, Trevor
Año: 2008
Editorial: Paris: ELDA, 2008
Tipo de código: Copyright
Soporte: Digital


Lingüística, Lingüística » Sistemas de transcripción de las Lenguas de Signos


 A fundamental problem in the creation of signed language corpora is lemmatisation. Lemmatisation - the classification or identifica-tion of related word forms under a single label or lemma (the equivalent of headwords or headsigns in a dictionary) - is central to theprocess of corpus creation. The reason is that signed language corpora - as with all modern linguistic corpora need to be ma-chine-readable and this means that sign annotations should not only be informed by linguistic theory but also that tags appended tothese annotations should be used consistently and systematically. In addition, a corpus must also be well documented (i.e., with ac-curate and relevant metadata) and representative of the language community (i.e., of relevant registers and sociolinguistic). All thisrequires dedicated technology (e.g., ELAN), standards and protocols (e.g., IMDI metadata descriptors), and transparent and agreedgrammatical tags (e.g., grammatical class labels). However, it also requires the identification of lemmata and this presupposes theunique identification of sign forms. In other words, a successful corpus project presupposes the availability of a reference dictionary orlexical database to facilitate lemma identification and consistency in lemmatisation. Without lemmatisation a collection of recordingswith various related appended annotation files will not be able to be used as a true linguistic corpus as the counting, sorting, tagging. etc.of types and tokens is rendered virtually impossible.

En: Crasborn, Onno; Hanke, Thomas; Efthimiou, Eleni; Zwitserlood, Inge y Thoutenhoofd, Ernst (eds.) Construction and Exploitation of Sign Language Corpora. 3rd Workshop on the Representation and Processing of Sign Languages, pp. 82-87.