Skip to main content

Natural Language Processing of a Low Resource Language (Igbo, an African Language)

Submission Number: 156
Submission ID: 1744
Submission UUID: ed99afdf-d8fb-4166-aa1c-be2ff5a75ac8
Submission URI: /form/project

Created: Mon, 10/31/2022 - 20:56
Completed: Mon, 10/31/2022 - 20:56
Changed: Mon, 06/03/2024 - 13:05

Remote IP address: 71.200.179.51
Submitted by: Stanley Nwoji
Language: English

Is draft: No
Webform: Project
Natural Language Processing of a Low Resource Language (Igbo, an African Language)
CAREERS
{Empty}
{Empty}
Complete

Project Leader

Stanley Nwoji
+1443.941.4064
+1717.901.5152

Project Personnel

Iheb Abdellatif
Atajan Abdyyev
{Empty}

Project Information

Though there are only 20 languages that fall into the high-resource category, most natural language processing (NLP) advancements have been accomplished in these 20 languages, excluding thousands of the low-resource languages spoken by millions of people in the world. It's not only a technological problem; equity is also in danger. This study seeks to fill this gap. The lack of low-resource language corpora and other linguistic resources is one of the causes of this knowledge gap. We must create a corpus of the African Igbo language to solve this problem. We will employ NLP machine and deep learning techniques to analyze the corpus. The outcome of this project could be applications like text categorization, information extraction, summarization, dialogue systems, and machine translation in the Igbo language. Currently, we have started building the Igbo_News corpus with Sketch Engine.

Project Information Subsection

Phase 1:
Development of the Igbo Corpora from News content (~4 Weeks)
Cleaning of the Corpora (~3 Weeks)
Statistical Analysis of Corpora (~4 Weeks)

Phase 2:
Text Categorization using the Corpora (~3 Weeks)
Information Extraction on the Corpora (~3 Weeks)
Machine Translation using the Corpora (~4 Weeks)
{Empty}
Student facilitator should posses the following:
1. High emotional intelligence to work with other students, the mentor, and the PI
2. Experience with the python language
3. Experience/Interest in NLP techniques and analyses
4. Good writing and communication skills.
{Empty}
Practical applications
{Empty}
{Empty}
{Empty}
CR-Penn State
{Empty}
No
Already behind3Start date is flexible
6
09/13/2023
05/08/2024
  • Milestone Title: Corpora Development
    Milestone Description: Corpora Development. The lack of annotated corpora is one of the fundamental reasons why natural language processing of low-resource languages, such as Igbo language, is a daunting task. It is the process of creating and curating a large collection of texts to be used in language research and NLP tasks. The Igbo corpora is a structured and organized set of texts that serves as a representative sample of the language. The steps involved in building Igbo corpora include data collection, preprocessing the data collected, annotation of the data and structuring the corpora. The reason for developing Igbo corpora is to provide researchers, linguists, and developers valuable resources for training and evaluating machine learning models and studying Igbo language patterns.
    Completion Date Goal: 2023-10-11
    Actual Completion Date: 2023-10-10
  • Milestone Title: Data Cleaning
    Milestone Description: Data Cleaning. Data cleaning (or data cleansing or data scrubbing) involves identification of errors, correction of the errors, and removal of inconsistencies, inaccuracies, and anomalies from the Igbo corpora. The process involves removing duplicate records, handling missing values, correcting inaccuracies, standardization data formats, validating data integrity, and resolving inconsistencies and contradictions.
    Completion Date Goal: 2023-11-01
    Actual Completion Date: 2023-10-10
  • Milestone Title: Statistical Analysis
    Milestone Description: Statistical Analysis. Statistical analysis in NLP of the Igbo language involves applying statistical methods and techniques to analyze and understand Igbo textual data (Igbo

    corpora). The focus here is to extract meaningful insights, patterns, and relationships from the Igbo corpora using quantitative measures and statistical models. The tasks involved include frequency analysis, sentiment analysis, text classification, named entity recognition, topic modeling, language modeling, and co-occurrence analysis. Statistical analysis involves training and evaluating models using annotated corpora using machine learning algorithms such as decision trees, support vector machines, and neural networks.
    Completion Date Goal: 2023-12-06
  • Milestone Title: Text Categorization
    Milestone Description: Text Categorization. In text categorization, we assign predefined categories or labels to text documents based on their content to enable efficient organization, retrieval, and analyses. To do this using the Igbo corpora we created involves, data preprocessing, feature extraction, training data preparation, model training, model evaluation, and prediction.
    Completion Date Goal: 2023-12-22
  • Milestone Title: Information Extraction
    Milestone Description: Information Extraction. The focus in information extraction (IE) is to automatically extract structured information or knowledge from unstructured or semi-structured text data (such as the Igbo corpora). Our goal in IE is to identify specific types of information such as entities, relationships, events, attributes, and transform them into more structured and machine-readable format. The tasks involved include named entity recognition (NER), relationship extraction, event extraction, attribute extraction, and template filling. To accomplish information extraction using the Igbo corpora, we hope to use any of these approaches: rule-based methods or statistical and machine learning methods or a hybrid approach.
    Completion Date Goal: 2024-01-31
  • Milestone Title: Machine Translation
    Milestone Description: Machine translation (MT). Machine translation involves automated translation of text or speech from one language to another using computational methods and machine learning techniques. We want to generate input text from high-resource languages, which we call sources, and generate accurate translation in the Igbo language. This process will involve preprocessing of our corpora, statistical machine translation, neural machine translation, and postprocessing to refine the output.
    Completion Date Goal: 2024-03-13
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}

Final Report

{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}