Evaluation of normalization techniques in text classification for Portuguese

Merley Da Silva Conrado, Víctor Antonio Laguna Gutiérrez, Solange Oliveira Rezende

Producción científica: Informe/libroLibrorevisión exhaustiva

2 Citas (Scopus)

Resumen

Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task. © 2012 Springer-Verlag.
Idioma originalEspañol
EstadoPublicada - 23 jul. 2012
Publicado de forma externa

Citar esto