Multilingual Code-Mixed Sentiment Analysis in Hate Speech

Abstract

This paper presents a simple and effective method to find the sentiment (positive, negative, or neutral) in hate speech written in more than one language in the same sentence, which is common on Indian social media. This type of text, called code-mixed (for example, Hindi and English together), is difficult for traditional sentiment analysis systems that work on a single language. Because there are very few labeled datasets for such text, we first collected tweets containing both hate and non-hate speech from Twitter. We then used a pretrained transformer model, trained on multilingual social media data, to automatically give sentiment labels to these tweets. Using this labeled data, we trained six machine learning models including ensemble model. Our tests show that the ensemble model performs the best, giving higher accuracy, precision, recall, and F1-score. We found that hate speech usually has negative sentiment, while non-hate speech is often neutral or positive. This work offers a scalable framework for sentiment classification in low-resource, code-mixed environments and sets a foundation for broader applications such as toxic comment moderation and social media monitoring.

Keywords

NA

  • License

    Creative Commons Attribution 4.0 (CC BY 4.0)

  • Language & Pages

    English, NA