Document classification in Spanish is analyzed using text mining through Weka an open source software. This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making. When comparing WEKA with other data mining tools as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho, weka provides a friendly interface easy to understand, load data efficiently and consider data mining as main objective.
Text mining seeks patterns extraction from the analysis of large collections of documents in order to gain new knowledge. Its purpose is the discovery of interesting groups, trends, associations and the visualization of new findings.
Text mining is considering as a subset of data mining. For this reason, adopts text mining adopts the data mining techniques which uses machine learning algorithms. Computational linguistics techniques also provides techniques to text mining. This science studies natural language with computational methods to make them understandable by the operating system.
Automatic categorization determines the subject matter from a document collection. This unlike clustering, choose the class to which a document belongs in a list of predefined classes. Each category is trained through a previous manual process of categorization.
The classification starts with a set of training texts previously categorized then generate a classification model based on the set of examples. This is be able to allocate the correct clas from a new text. Decision tree is a classification technique that represent the knowledge through if-else statements structure represented in the branches of a tree.
Textual mining methodology provides a framework performed in four stages, data acquisition, preprocessing documents, information extraction and evaluation of results. Witten, Frank and Hall make mention of these steps in his work for the use of WEKA.
Data should be collected in a way that can create a training dataset. Witten, Frank and Hall considers three input data for text mining. These are the concepts, instances and attributes. The concepts specify what is want to learn. An instance represents the data from a class to be classified. This containing a set of specific characteristics called attributes. An attribute represents a measurement level of the attribute in that instance. In the case of document classification, classes will be nominal attributes, because the categories need not represent an order between them (ordinal attributes).
WEKA uses a standard format called File Attribute Relation (ARFF) to represent the collection of documents into instances that share an ordered set of attributes divided into 3 sections, relationship, and attribute data.
Preprocessing data is based on the preparation of the text using a series of operations over the text and generate some kind of structured or semi-structured information for analysis. The most popular way to represent documents is with a vector. That vector contains all words found in the text indicating its occurrence. Important tasks for preprocessing to categorize documents are stemming, lexematización, removing empty words, tokenization and conversion to lowercase.
Stemming algorithm eliminates morphemes and find the relationships between words and lexeme not themed. Stopwords exclude the words that not help to generate knowledge of the text. Tokenization is how to separate the text into words using punctuation. In Spanish punctuation are “; . 😕 ! – -. ()  ‘”<< >>” Where the dot and dash are ambiguous in Spanish, unlike English contemplates a sign of end in an exclamation and interrogation. Conversion to lowercase treat all letters regardless equal terms.
After data preprocess, the next step is knowledge extraction. Document classification in weka look for learn a predictive classification model. These models are used to predict the class to which an instance belongs. The model is created using the decision tree algorithm C4.5 as it is the simplest and wide for the classification task.
Weka generates a confusion matrix for the generated model. This shows in an easy way to detect how many times the model predictions were made correctly. The four possible outcomes are: true positives, false positives, true negatives and false negatives. TP – true positive: positive instance was predicted in the class as positive. TN – true negative: negative instance correctly classified as negative. FP – false positives: positive instance was listed in the wrong class. FN – false-negative negative instance incorrectly classified as positive.
The precision and recall are relevant metrics for document classification. The classified model reports results in a binary form in a confusion matrix, to calculate the predictive efficiency expressed. Precision is the percentage of positive cases correctly predicted: TP / (TP + FP). Recall or sensitivity is the ability to predict positive instances on the total of all positive instances: TP / (TP + FN). These measures are balanced as the F- measurement. It describes the proportion of instances wrongly predicted. As far as resulting F1- measurement is calculated by the following equation (2 * Accuracy * completeness) / (Accuracy + completeness).
The training data set selected has been found on the Thompson Rivers University library. It was randomly selected 71 medical academic articles in English and Spanish stored in PDF format. Based on the TRU library was classified this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are stored in directories named by its categories within the main folder called Medicine.
In order to form an arff file it an application that generated the arff from a documents collection a directory based. This application could be carried out with the collaboration of a library called iTextSharp PDF for a portable document format text extraction. This application is named as Documents Directory to ARFF.
The resulting arff generate a string type attribute called “DocumentText” that describe all text found in the document and the nominal attribute “docClass” that define the class to which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class attribute can never be named “class”.
Various tests applied to the same set of texts to assess the predictive exactitude of the model. They were generated a set of optimal options from different combinations of options applied to the same training data . Each resulting model was calculated its F-measurement which describes the proportion of its predicted instances erroneously.
First the best structure for the filter is analyzed, with unadjusted the J48 classifier options. In this the best parameters for the filter were selected. It select the best configuration to assess the best settings for J48 classifier algorithm. Based on a comparison chart it was discovered that the parameters of the combination of Stopwords + Word Tokenizer E&S + Lower Case Conversion adjusting the minNumObj to 1 on the J48 algorithm, provide values of 1 for recall and precision.
Concluding that the best model is the combination of the options Word Tokenizer Stopwords + S&E + Lower Case Conversion applied to the data preprocessing filter and further adjusting minNumObj to 1 on the J48 classifier algorithm.
By Valeria Guevara