Multilingual Code-Mixed Sentiment Analysis in Hate Speech

Article Fingerprint
Research ID J6UF8

Abstract

This paper presents a simple and effective method to find the sentiment (positive, negative, or neutral) in hate speech written in more than one language in the same sentence, which is common on Indian social media. This type of text, called code-mixed (for example, Hindi and English together), is difficult for traditional sentiment analysis systems that work on a single language. Because there are very few labeled datasets for such text, we first collected tweets containing both hate and non-hate speech from Twitter. We then used a pretrained transformer model, trained on multilingual social media data, to automatically give sentiment labels to these tweets. Using this labeled data, we trained six machine learning models including ensemble model. Our tests show that the ensemble model performs the best, giving higher accuracy, precision, recall, and F1-score. We found that hate speech usually has negative sentiment, while non-hate speech is often neutral or positive. This work offers a scalable framework for sentiment classification in low-resource, code-mixed environments and sets a foundation for broader applications such as toxic comment moderation and social media monitoring.

Conflict of Interest

The authors declare no conflict of interest.

Ethical Approval

Not applicable

Data Availability

The datasets used in this study are openly available at [repository link] and the source code is available on GitHub at [GitHub link].

Funding

This work did not receive any external funding.

Cite this article

Generating citation...

Related Research

  • Version of record

    v1.0

  • Issue date

    NA

  • Language

    English

Article Placeholder
Open Access
Research Article
CC-BY-NC 4.0
Volume Journal Issue