Spanish is the second most spoken language in the world (Ethnologue, 2015). Therefore there is a great variety of texts in Spanish stored electronically. This corpus can provide information about a specific topic. On the other hand you can generate new knowledge from what is already known. This shows us the great challenge that multilingual text classification faces.
To organize these documents, it has been used text mining. The aim of the text mining is to discover knowledge from a text corpus collected. Text mining process large collections of text without labels to discover information. This explores extensive text collections in electronic form to find relationships in content and thus establish patterns that extract useful knowledge. Text information can be absorbed by a corpus depending on the language. The corpus is a resource with different kinds of linguistic information that allows the treatment of the same knowledge. The corpus is made with different types of semantic linguistic matters, syntactic, pragmatic, grammatical categories, syntactic relations, senses, anaphoric relations, rhetorical structures, etc.
This project focuses on the classification of documents in Spanish using the textual mining through an open source software Weka. This is a machine learning software that contains a repository of algorithms to find out knowledge where there is an easy preprocess for the training documents. Through this software it can analyze and compare the results of different algorithms based on measurements from a confusion matrix. In the first phase the textual mining and its relationship with other disciplines will be defined. Following, it will present some text mining articles. Subsequently it exemplified related investigations. Immediately, it will explain significant methods for data processing in text mining. As a consequence the text is in a position to be classified and the C4.5 algorithm based on decision trees will be defined. In order to demonstrate this learned it will exemplified using the WEKA tool. Finally, It will conclude with the results of the experiments. Additionally, it will provide a tutorial for the use of WEKA as a tool for text mining.
By Valeria Guevara