Framework for building linguistic corpora for a large language model project for the Heritage Nubian Language of Kenya

Authors

  • Peter Nyansera Otieno Languages, Linguistics and Literature, Kisii University, Kenya

DOI:

https://doi.org/10.57040/tvpzzk79

Keywords:

Annotation, Framework, Large Language Model, Linguistic corpora, Parts-of-speech labeling, Web scrapings

Abstract

Low-resource languages face an uphill task in their documentation and preservation. Language technologies offer a way out for these beleaguered languages. However, these technologies depend on developing high-quality linguistic corpora absent in understudied and under-resourced languages like the Kisii Town Nubian. This study aims to develop a framework for constructing linguistic corpora for the Kisii Town Heritage Nubian language that can be used to develop an LLM and other language technologies. The objectives are to: 1) develop a framework for data collection and metadata labeling, and 2) Identify the main tenets to be considered in developing the language technologies. The methodology used to collect data will utilize local community knowledge experts and opinion leaders at Nyanchwa where the Nubians reside. The process will draw from linguists' knowledge of the language's terrain with necessary permissions and consents sought in the process. Diverse data will be assembled from written texts, recorded audio, web scrapings, and word lists to comprehensively view the language. This will be followed by data processing and annotation. The processed data will be trained on linguistic features such as phonology, morphology, syntax, semantics, and parts-of-speech labeling. This will then be structured into selective linguistic corpora with robust quality control guidelines. The deliverables of the project will be linguistic corpora for various domains of the Nubian language, the development of language technologies, and comprehensive documentation of linguistic corpora. The results of this project will be consequential in the field of language documentation and technological support for this endangered language.

Downloads

Download data is not yet available.

Author Biography

  • Peter Nyansera Otieno, Languages, Linguistics and Literature, Kisii University, Kenya

    Peter Nyansera Otieno, Ph.D., (circa. 1976) is a linguist with the Department of Languages Linguistics and Literature, at Kisii University. He teaches undergraduate and graduate courses in the areas of Phonetics, Phonology, Morphology, and Communication Skills. Nyansera is an interdisciplinary scholar. He has research interests in Acoustic Phonetics, Bantu sound systems, Socio-Phonetics, lexicography, Language impairment, Natural Language Processing, Linguistic Anthropology, Language endangerment, and Language Documentation. He is a crusader for Gusii Language and Culture revitalization. He writes poetry and short stories from his rich EkeGusii oral literature background. He patronizes the Bobasi Chapter of the AbaGusii Cultural and Development Council and the Mwanyagetinge Heritage Council.

Published

2024-11-13

Issue

Section

Articles

Similar Articles

1-10 of 53

You may also start an advanced similarity search for this article.