Framework for building linguistic corpora for a large language model project for the Heritage Nubian Language of Kenya
DOI:
https://doi.org/10.57040/tvpzzk79Keywords:
Annotation, Framework, Large Language Model, Linguistic corpora, Parts-of-speech labeling, Web scrapingsAbstract
Low-resource languages face an uphill task in their documentation and preservation. Language technologies offer a way out for these beleaguered languages. However, these technologies depend on developing high-quality linguistic corpora absent in understudied and under-resourced languages like the Kisii Town Nubian. This study aims to develop a framework for constructing linguistic corpora for the Kisii Town Heritage Nubian language that can be used to develop an LLM and other language technologies. The objectives are to: 1) develop a framework for data collection and metadata labeling, and 2) Identify the main tenets to be considered in developing the language technologies. The methodology used to collect data will utilize local community knowledge experts and opinion leaders at Nyanchwa where the Nubians reside. The process will draw from linguists' knowledge of the language's terrain with necessary permissions and consents sought in the process. Diverse data will be assembled from written texts, recorded audio, web scrapings, and word lists to comprehensively view the language. This will be followed by data processing and annotation. The processed data will be trained on linguistic features such as phonology, morphology, syntax, semantics, and parts-of-speech labeling. This will then be structured into selective linguistic corpora with robust quality control guidelines. The deliverables of the project will be linguistic corpora for various domains of the Nubian language, the development of language technologies, and comprehensive documentation of linguistic corpora. The results of this project will be consequential in the field of language documentation and technological support for this endangered language.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Peter Nyansera Otieno
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.