ABSTRACT:
The importance of stop words list generation helps in the elimination of stop words, which contribute to reduce the size of the vector space of the corpus and indexing structure considerably to obtain a high compression rate, speed up calculation and increase the accuracy of information retrieval systems. The proposed system focuses on the different characteristics of stop words which distinctively identify stop words based on a carefully adopted aggregated approach of Frequency Analysis, Word Distribution Analysis and Word Entropy Measure. Each of the method generates its own stop words list after passing through a thorough text preprocessing stage including the diacritization of the Yoruba corpus, which is then aggregated using set theory to redefine stop words as those with high occurrences, stable distribution and less informative. The system uses a novel approach of machine learning using Multinomial Naïve Bayes to make the system perform the automatic generation of stop words list and keep the list updated in the event of an evolving new word. When applied to Yoruba language corpus, the output produced a standardized automatically generated stop words list, which outperforms the existing stop words lists that were mostly produced using frequency measure. New words in comparison to existing list were also identified. The system which outperforms existing ones generated 255 stop words with a text compression rate of 63% when the stop words are removed from text document.
Keywords:
Natural Language Processing, Zipf’s law, Stop words, Entropy, Variance, Diacritization