Multilingual hate speech detection using deep learning
10.11591/ijict.v14i3.pp1015-1023
Vincent Vincent
,
Amalia Zahra
The rise of social media has enabled public expression but also fueled the spread of hate speech, contributing to social tensions and potential violence. Natural language processing (NLP), particularly text classification, has become essential for detecting hate speech. This study develops a hate speech detection model on Twitter using FastText with bidirectional long short-term memory (Bi-LSTM) and explores multilingual bidirectional encoder representations from transformers (M-BERT) for handling diverse languages. Data augmentation techniques-including easy data augmentation (EDA) methods, back translation, and generative adversarial networks (GANs)-are employed to enhance classification, especially for imbalanced datasets. Results show that data augmentation significantly boosts performance. The highest F1-scores are achieved by random insertion for Indonesian (F1-score: 0.889, Accuracy: 0.879), synonym replacement for English (F1-score: 0.872, Accuracy: 0.831), and random deletion for German (F1-score: 0.853, Accuracy: 0.830) with the FastText + Bi-LSTM model. The M-BERT model performs best with random deletion for Indonesian (F1-score: 0.898, Accuracy: 0.880), random swap for English (F1 score: 0.870, Accuracy: 0.866), and random deletion for German (F1-score: 0.662, Accuracy: 0.858). These findings underscore that data augmentation effectiveness varies by language and model. This research supports efforts to mitigate hate speech’s impact on social media by advancing multilingual detection capabilities.