Offensive language identification using Hindi-English code-mixed tweets, and code-mixed data augmentation
Jahan, Md Saroar; Oussalah, Mourad; Mim, Jhuma Kabir; Islam, Mominul (2021-12-13)
Jahan, M. S., Oussalah, M., Mim, J. K., & Islam, M. (2021). Offensive Language Identification Using Hindi-English Code-Mixed Tweets, and Code-Mixed Data Augmentation. In P. Mehta, T. Mandl, P. Majumder, & Mandar M. (Eds.), Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021 (pp. 226-238). RWTH Aachen University. http://ceur-ws.org/Vol-3159/T1-23.pdf
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
https://creativecommons.org/licenses/by/4.0/
https://urn.fi/URN:NBN:fi-fe2022070551216
Tiivistelmä
Abstract
The Code-mixed text classification is challenging due to the lack of code-mixed labeled datasets and the non-existence of pre-trained models. This paper presents the HASOC-2021 offensive language identification results and main findings on code-mixed (Hindi-English) Subtask2. In this work, we have proposed a new method of code-mixed data augmentation using synonym replacement of Hindi and English words using WordNet, and phonetics conversion of Hinglish (Hindi-English) words. We used a 5.7k pre-annotated HASOC-2021 code-mixed dataset for training and data augmentation. The proposal’s feasibility was tested with a Logistic Regression (LR) used as a baseline, Convolutional Neural Network (CNN), and BERT with and without data augmentation. The research outcomes were promising and yields almost 3% increase of classifier accuracy and F1 scores as compared to baseline. Our official submission showed a 66.56% F1 score and ranked 8th position in the competition.
Kokoelmat
- Avoin saatavuus [37298]