Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification

Naufal Azmi Verdikha; Teguh Bharata Adji; Adhistya Erna Permanasari

doi:10.22146/ijitee.42152

Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification

https://doi.org/10.22146/ijitee.42152

Naufal Azmi Verdikha^(1*), Teguh Bharata Adji⁽²⁾, Adhistya Erna Permanasari⁽³⁾

(1) Universitas Gadjah Mada
(2) Universitas Gadjah Mada
(3) Universitas Gadjah Mada
(*) Corresponding Author

Abstract

A text classification system is needed to address the problem of hate speech in social media. However, texts of hate speech are very hard to find in social media. This will make the distribution of training data to be unbalanced (imbalanced data). Classification with imbalanced data will make a poor performance. There are several methods to solve the problem of classification with imbalanced data. One of them is undersampling with Instance Hardness Threshold (IHT) method. IHT method balances the dataset by eliminating data that are frequently misclassified. To find those data, IHT requires an estimator, which is a classifier. This research aims to compare estimators of IHT method to solve imbalanced data problem in hate speech classification using TF-IDF weighting method. This research uses the class ratio of dataset after undersampling, time of the undersampling process, and Index of Balanced Accuracy (IBA) evaluation to determine the best IHT method. The results of this research show that IHT method using the Logistic Regression (IHT(LR)) has the fastest undersampling process (1.91 s), perfectly balance dataset with the class ratio is 1:1, and has the best of IBA evaluation in all estimation process. This result makes IHT(LR) be the best method to solve the imbalanced data problem in hate speech classification.

Keywords

Hate Speech Classification, Imbalanced Data, Instance Hardness Threshold, TF-IDF

Full Text:

PDF

References

(2017) “International Covenant on Civil and Political Rights.” [Online], http://www.ohchr.org/en/professionalinterest/pages/ccpr.aspx, Accessed date: 15-Nov-2017.

T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated Hate Speech Detection and the Problem of Offensive Language,” Proceedings of the 11th International AAAI Conference on Web and Social Media, vol. abs/1703.04009, 2017, pp. 512-515.

H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng., Vol. 21, No. 9, pp. 1263–1284, 2009.

M. R. Smith, T. Martinez, and C. Giraud-Carrier, “An Instance Level Analysis of Data Complexity,” Mach. Learn., Vol. 95, No. 2, pp. 225–256, 2014.

V. Bijalwan, V. Kumar, P. Kumari, and J. Pascual, “KNN based Machine Learning Approach for Text and Document Mining,” Int. J. Database Theory Appl., Vol. 7, No. 1, pp. 61–70, Feb. 2014.

M. A. Shehab, O. Badarneh, M. Al-Ayyoub, and Y. Jararweh, “A Supervised Approach for Multi-Label Classification of Arabic News Articles,” Proc. - CSIT 2016 2016 7th Int. Conf. Comput. Sci. Inf. Technol., 2016, pp. 1–6.

U. Inyaem, P. Meesad, and C. Haruechaiyasak, “Named-entity Techniques for Terrorism Event Extraction and Classification,” 2009 8th Int. Symp. Nat. Lang. Process. SNLP ’09, 2009, pp. 175–179.

F. A. Wenando, T. B. Adji, and I. Ardiyanto, “Text Classification to Detect Student Level of Understanding in Prior Knowledge Activation Process,” Adv. Sci. Lett., Vol. 23, No. 3, pp. 2285–2287, Mar. 2017.

G. Lemaitre, “Computer-Aided Diagnosis for Prostate Cancer using Multi-Parametric Magnetic Resonance Imaging,” Doctoral Thesis, Universitat de Girona, Girona, Catalonia, Spain, Nov. 2016.

P. Fortuna, “Automatic Detection of Hate Speech in Text : An Overview of the Topic and Dataset Annotation with Hierarchical Classes,” Dissertation, Universidade do Porto, Porto, Portugal, Jun. 2017.

G. M. Weiss, “Foundations of Imbalanced Learning,” in Imbalanced Learning, Hoboken, NJ, USA: John Wiley & Sons, Inc., 2013, pp. 13–41.

G. M. Weiss and F. J. Provost, “Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction.,” J. Artif. Intell. Res., Vol. 19, pp. 315–354, 2003.

G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning,” CoRR, Vol. abs/1609.0, pp. 1–5, 2016.

Z. Waseem and D. Hovy, “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter,” Proc. NAACL Student Res. Work, 2016, pp. 88–93.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., Vol. 12, pp. 2825–2830, 2011.

V. García, R. A. Mollineda, and J. S. Sánchez, “Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions,” in Pattern Recognition and Image Analysis: Proceedings of 4th Iberian Conference (IbPRIA 2009), 2009, pp. 441–448.

DOI: https://doi.org/10.22146/ijitee.42152