Natural Language Processing of a Low Resource Language (Igbo, an African Language)

Submission information

Submission Number: 156

Submission ID: 1744

Submission UUID: ed99afdf-d8fb-4166-aa1c-be2ff5a75ac8

Submission URI: /form/project

Created: Mon, 10/31/2022 - 20:56

Completed: Mon, 10/31/2022 - 20:56

Changed: Mon, 06/03/2024 - 13:05

Remote IP address: 71.200.179.51

Submitted by: Stanley Nwoji

Language: English

Is draft: No

Webform: Project

Project Title Natural Language Processing of a Low Resource Language (Igbo, an African Language)

Program CAREERS

Project Image {Empty}

Tags {Empty}

Status Complete

Project Leader

Project Leader Stanley Nwoji

Email SNwoji@harrisburgu.edu

Mobile Phone +1443.941.4064

Work Phone +1717.901.5152

Project Personnel

Mentor(s) Iheb Abdellatif

Student-facilitator(s) Atajan Abdyyev

Mentee(s) {Empty}

Project Information

Project Description Though there are only 20 languages that fall into the high-resource category, most natural language processing (NLP) advancements have been accomplished in these 20 languages, excluding thousands of the low-resource languages spoken by millions of people in the world. It's not only a technological problem; equity is also in danger. This study seeks to fill this gap. The lack of low-resource language corpora and other linguistic resources is one of the causes of this knowledge gap. We must create a corpus of the African Igbo language to solve this problem. We will employ NLP machine and deep learning techniques to analyze the corpus. The outcome of this project could be applications like text categorization, information extraction, summarization, dialogue systems, and machine translation in the Igbo language. Currently, we have started building the Igbo_News corpus with Sketch Engine.

Project Information Subsection

Project Deliverables Phase 1:
Development of the Igbo Corpora from News content (~4 Weeks)
Cleaning of the Corpora (~3 Weeks)
Statistical Analysis of Corpora (~4 Weeks)

Phase 2:
Text Categorization using the Corpora (~3 Weeks)
Information Extraction on the Corpora (~3 Weeks)
Machine Translation using the Corpora (~4 Weeks)

Project Deliverables {Empty}

Student Research Computing Facilitator Profile Student facilitator should posses the following:
1. High emotional intelligence to work with other students, the mentor, and the PI
2. Experience with the python language
3. Experience/Interest in NLP techniques and analyses
4. Good writing and communication skills.

Mentee Research Computing Profile {Empty}

Student Facilitator Programming Skill Level Practical applications

Mentee Programming Skill Level {Empty}

Project Institution {Empty}

Project Address {Empty}

Anchor Institution CR-Penn State

Preferred Start Date {Empty}

Start as soon as possible. No

Project Urgency Already behind3Start date is flexible

Expected Project Duration (in months) 6

Launch Presentation

NLP Of Low Resource Language.pdf (419.76 KB)

Launch Presentation Date 09/13/2023

Wrap Presentation

Igbo NLP of Careers.project.wrap-1.pdf (854.63 KB)

Wrap Presentation Date 05/08/2024

Project Milestones

Milestone Title: Corpora Development
Milestone Description: Corpora Development. The lack of annotated corpora is one of the fundamental reasons why natural language processing of low-resource languages, such as Igbo language, is a daunting task. It is the process of creating and curating a large collection of texts to be used in language research and NLP tasks. The Igbo corpora is a structured and organized set of texts that serves as a representative sample of the language. The steps involved in building Igbo corpora include data collection, preprocessing the data collected, annotation of the data and structuring the corpora. The reason for developing Igbo corpora is to provide researchers, linguists, and developers valuable resources for training and evaluating machine learning models and studying Igbo language patterns.
Completion Date Goal: 2023-10-11
Actual Completion Date: 2023-10-10
Milestone Title: Data Cleaning
Milestone Description: Data Cleaning. Data cleaning (or data cleansing or data scrubbing) involves identification of errors, correction of the errors, and removal of inconsistencies, inaccuracies, and anomalies from the Igbo corpora. The process involves removing duplicate records, handling missing values, correcting inaccuracies, standardization data formats, validating data integrity, and resolving inconsistencies and contradictions.
Completion Date Goal: 2023-11-01
Actual Completion Date: 2023-10-10
Milestone Title: Statistical Analysis
Milestone Description: Statistical Analysis. Statistical analysis in NLP of the Igbo language involves applying statistical methods and techniques to analyze and understand Igbo textual data (Igbo

corpora). The focus here is to extract meaningful insights, patterns, and relationships from the Igbo corpora using quantitative measures and statistical models. The tasks involved include frequency analysis, sentiment analysis, text classification, named entity recognition, topic modeling, language modeling, and co-occurrence analysis. Statistical analysis involves training and evaluating models using annotated corpora using machine learning algorithms such as decision trees, support vector machines, and neural networks.
Completion Date Goal: 2023-12-06
Milestone Title: Text Categorization
Milestone Description: Text Categorization. In text categorization, we assign predefined categories or labels to text documents based on their content to enable efficient organization, retrieval, and analyses. To do this using the Igbo corpora we created involves, data preprocessing, feature extraction, training data preparation, model training, model evaluation, and prediction.
Completion Date Goal: 2023-12-22
Milestone Title: Information Extraction
Milestone Description: Information Extraction. The focus in information extraction (IE) is to automatically extract structured information or knowledge from unstructured or semi-structured text data (such as the Igbo corpora). Our goal in IE is to identify specific types of information such as entities, relationships, events, attributes, and transform them into more structured and machine-readable format. The tasks involved include named entity recognition (NER), relationship extraction, event extraction, attribute extraction, and template filling. To accomplish information extraction using the Igbo corpora, we hope to use any of these approaches: rule-based methods or statistical and machine learning methods or a hybrid approach.
Completion Date Goal: 2024-01-31
Milestone Title: Machine Translation
Milestone Description: Machine translation (MT). Machine translation involves automated translation of text or speech from one language to another using computational methods and machine learning techniques. We want to generate input text from high-resource languages, which we call sources, and generate accurate translation in the Igbo language. This process will involve preprocessing of our corpora, statistical machine translation, neural machine translation, and postprocessing to refine the output.
Completion Date Goal: 2024-03-13

Github Contributions {Empty}

Planned Portal Contributions (if any) {Empty}

Planned Publications (if any) {Empty}

What will the student learn? {Empty}

What will the mentee learn? {Empty}

What will the Cyberteam program learn from this project? {Empty}

HPC resources needed to complete this project? {Empty}

Notes {Empty}

Final Report

What is the impact on the development of the principal discipline(s) of the project? {Empty}

What is the impact on other disciplines? {Empty}

Is there an impact physical resources that form infrastructure? {Empty}

Is there an impact on the development of human resources for research computing? {Empty}

Is there an impact on institutional resources that form infrastructure? {Empty}

Is there an impact on information resources that form infrastructure? {Empty}

Is there an impact on technology transfer? {Empty}

Is there an impact on society beyond science and technology? {Empty}

Lessons Learned {Empty}

Overall results {Empty}