Weka Tutorial on Document Classification

Tutorial de Clasificación de Documentos en WEKA


Weka tool was selected in order to generate a model that classifies specialized documents from two different sourpuss (English and Spanish). WEKA package is a collection of machine learning algorithms for data mining tasks. Text mining uses these algorithms to learn from examples or “training set”, new texts are classified into categories analyzed. It is defined as Waikato Environment for Knowledge Analysis. For more information contact http://www.cs.waikato.ac.nz/~ml/weka/.

Imagen2

Installing WEKA

Weka can be downloaded from:

http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

In this tutorial version is Weka 3.6.12.

For Windows

WEKA must be situated in the program launcher located in a weka folder. The Weka default directory is the same directory where the file is loaded.

For Linux:

WEKA will have to open a terminal and type: java -jar /installation/directory/weka.jar.

Based on the text mining methodology Weka is represented in a framework with four stages, data acquisition, document preprocessing, information extraction and evaluation.

 

Data Acquisition

ARFF files are the primary format to use any classification task in WEKA. These files considered basic input data (concepts, instances and attributes) for data mining. An Attribute-Relation File Format file describes a list of instances of a concept with their respective attributes.

The documents selected for the training data set has been found on the Thompson Rivers University library that has the following link: http://www.tru.ca/library.html. It was randomly selected 71 medical academic articles in English and Spanish. These documents are stored in Portable Document Format (PDF). Based on the TRU library was detected the classification of this documents into six categories Hemodialysis, Nutrition, Cancer, Obesity, Diet and Diabetes recognized. These documents are stored in directories named by its categories within the main folder called Medicine. As shown in the figure below.

WEKAImagen3

In order to form an arff file it was created in Microsoft Visual Studio Professional C # 2012 an application that generated the arff from a directory that contains a collection of documents in a based on their category name. This application could be carried out with the collaboration of a library called iTextSharp PDF for a portable document format text extraction.

Documents Directory to ARFF can specify the name of the relationship to define, the location of the home directory that contains all documents subdivided into categorical directories and comments required. Also, it specify the file name generated with arff extension and its location. At the end of the application are two buttons, one for exit and another to generate the arff file with the information described.

This can be download here directoryPDFtoARFF

WEKAImagen4

The resulting arff generate a string type attribute called textoDocumento” that describe all text found in the document and the nominal attribute “docClass” that define the class to which it belongs. As a note, recent versions of Weka Weka as in this case 3.6.12 the class attribute can never be named “class”.

 

The file will be generated as follows:

%  tutorial de Weka para la Clasificación de Documentos.

@RELATION Medicina

@attribute textoDocumento string

@attribute docClass {Hemodialysis, Nutrition, Cancer, Obesity, Diet, Diabetes}

@data

“texto…”, Hemodialysis

“texto…”, Nutrition

“texto….”, Cancer

“texto…”, Obesity

“texto…”, Diet

“texto…”, Diabetes

 

Document Preprocess

Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

“Applications” is the first screen on Weka to select the desired sub-tool. In this “Explorer” is selected. It consists of six panels: Preprocess, Classify, cluster, Associate, Select attributes and Visualize.

 

Preprocess

Preprocessing for the classification of documents.

To load the generated arff, click on the button “Open file …” at the top right.

Select the created file “medicinaWeka.arff”.

On “Current Relation” the dataset that has been loaded is described. It describes the relationship with the medicina name, the number of instances as 71 and a total of attributes as 2. At the bottom of the under “Attributes” section, attributes are described. This framework allows to select the attributes, in this case are show textoDocumento and “docClass”.

When selecting “docClass” the “Selected attribute” part describes the nominal attribute with 6 labels and the total of its instances. These “labels” are 11 levels from Hemodialysis and 12 instances from the others: Nutrition, Cancer, Obesity, Diabetes Diet. At the bottom of this section is ilustrated a histogram of the attribute “docClass” labels that by hovering the graph it will describe the attribute name as shown in the following figure illustrates.

WEKAImagen5

Weka uses StringToWordVector filter to convert the textoDocumento and “docClass”.” attribute into a set of attributes that represent the occurrence of words of the full text,. This filter is a technique of unsupervised learning. These inductive technique is designed to detect clusters and label entries from a set of observations without knowing the correct classification.

The filters are found when click the “Choose ” button under “Filter” section. This button opens a window with root weka. From there selecte filters and the unsupervised folder to after select attribute and finally select StringToWordVector.

StringToWordVector filter can configured its attributes with language processing techniques. To edit this filter is only necessary to click on the filter name. it will open a  that show the following options.

They were generated a set of optimal options from different combinations of options applied to the same training data . Each resulting model was calculated its  F-measurement which describes the proportion of its predicted instances erroneously. The options that generated the greatest number of instances predicted correctly are as follows:

  1. a) wordsToKeep: Standing with 1000 since it defines the word limit per class to maintain. Where doNotOperateOnPerClassBasis flag: as “False” to base wordsToKeep in all classes.

 

  1. b) TFTransform as “True”, DFTransform as “True” outputWordCounts as “True” and normalizeDocLength: is set to “No normalization”.

The values are not normalized to the filter papers find more interrelated and count how often a word is in the document and not only consider whether the term is in the document. OutputWordCounts is the flag that describes whether a word exist or not in the document  and normalizeDocLength couts a word with its actual value from tf-idf result of that word in the document, no matter how small or longer the document is.

 

  1. c) lowerCaseTokens: as “True” to convert all to lowercase words before being added to the record and analyze the same word in lowercase and uppercase separately.

 

  1. d) Stemmer: selects the algorithm to elimination the morpheme in a given language in order to reduce the word to its root. Select no stemmer as the classification of texts is multilingual and it will only aply stemming for one lenguage. No stemmer is configured when click on the “Select” button menu is deployed and “NullStemmer” is selected.

Weka has a standard algorithm in English from snowball.tartarus.org. Snowball is a string processing language designed for creating stemmer and feature a stemming algorithm in Spanish. To use the algorithm in Spanish will have to download the jar snowball-20051019.jar  from https://weka.wikispaces.com/Stemmers. This will be stored in the location where Weka application is. Finally the algorithm will be added when the following command is applied from the command line in Weka.

For Windows  java -classpath “weka.jar, snowball-20051019.jar” weka.gui.GUIChooser

For Linux: java -classpath “weka.jar: snowball-20051019.jar” weka.gui.GUIChooser

It will be confirmed with the command to verify the parameter java.class.path

java weka.core.SystemInfo

As shown in the following figure:

 

WEKAImagen6

Having set the SnowballStemmer, Selecte it by clicking the “Choose” button.

This button will display a menu which selecte from weka> core> stemmers and choose SnowballStemmer.

Click on the stemmer name and a window that can delimit the language will apear. For Spanish on the side labeled “stemmer” it will be type “spanish” in place of “porter” and click  “OK”.

WEKAImagen7

 

  1. e) Stopwords determines whether a sub string in a text is a word that does not provide information about a text. This words come from a predefined Rainbow list, where the default is Weka-3-6. Rainbow is a program that performs the statistical text classification base on Bow library. Rainbow has separate lists in English and Spanish, in order to make both languages is use the “ES-stopwords” file that contains both lists from Rainbow. “ES-stopwords” list can be download here ES-stopwords.

To change the list click on Weka-3-6 which is next to the label stopwords and choose “ES-stopwords” previously downloaded. Set the useStoplistse option to “True” to ignore the words that are on “ES-stopwords” within the “Stopwords” option list.

  1. f) Tokenizer: option to choose unit to separate the attribute “DocumentText”. By clicking “Choose” button a menu will be displayed and select “WordTokenizer”. Set the “deimiters” in English and Spanish when cloc on the name and following window will appear. Delimiters in Spanish are,;: .,;:'()?!“¿!-[]’<>“ “.. this includes an end character in for exclamation and interrogation. .,;:'”()?!“¿!-[]’<>“

As shown in the figure below.:

WEKAImagen8

Another option is to choose NGramTokenizer to divide the original text string in a subset of consecutive words that form a pattern with unique meaning. This uses the default “delimiters” is ‘\ r \ n \ t,;:.’ ?! “()”, This is useful to help uncover patterns of words between them representing a meaningful context.

g) minTermFreq: default is 1 for each word must to possess to be considered as an attribute to this the “doNotOperateOnPerClassBasis” flag should be “False”.

h) periodicPruning be filed in no pruning with -1, it won’t remove low-frequency words.

i) attributeNamePrefix lefts with nothing to not add a prefix to the attributes generated.

j) attributeIndices: will be saved as first-last to ensure that all attributes are treated as if they were a single chain from first to last.

k) invertSelection be preserved in “False” to work with the selected attributes.

At the end, you can save, cancel and apply. The window must have been as follows:

Imagen9

To save the algorithm with these options click on Save …” button and the select the location and name.

To apply the algorithm with these options in the click “OK” button. This will return to the “Preprocess” window where “DocumentText” attribute must have been selected from the “Attributes” framework.

Click the button “Apply”. It is located in the upper right of the module “Filter”. Weka image located in the lower right corner will start to dance until the process is complete.

 

Information extraction

After the data cleaning on the “Preprocess” tab, it proceeds to the extraction of information. By click on the tab “Classify” on the second panel of Explorer.

This stage analyze the attributes vector for the creation of the classification model that will define the structure found in the analyzed information.

Weka considered the decision tree model J48 the most popular on text classification. J48 is the Java implementation of the algorithm C4.5. Algorithm that in each node represent one of the possible decisions to be taken and each leave represent the predicted class.

First, choose the sorting algorithm from the “Choose” button located in the upper left side of the window.

 

Imagen10

This button will display a tree where the root is weka and the sub folder is “classifiers“.

Within the sub folder tree located in weka.classifiers.trees, select the tree model J48, as shown in the following figure:

Imagen11

Double-click on the name of the J48 classifier located next to the “Select” button to access to its options.

It can reach 100% in correct classification disabling pruning and setting the minimum number of instances in a leaf as 1. In this case these parameters changed are:

  1. a) minNumObj: is set to 1 and leave the other parameters in the default configuration.

Imagen12

In the “Test Options” module the training data is set.

Select “Use training set” to train the method with all available data and apply the results on the same input data collection.

Additionally you can apply a partitioning percentage to the input data by selecting the “Percentage Split” option and defining the percentage from the total input data to build the classifier model, leaving the remaining part to test.

 

Under options “Test Options” is a menu that displays a list with all attributes. In the case select “docClass” because this is the attribute that act as the result for classification in this example.

Imagen13

The classification method started by pressing the “Start” button.

 

The weka bird image found in the bottom right, will begin to dance until the end of the sorting process.

 

WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results “Result List” and selecting “Visualize tree” option.

 

Imagen14

 

The window size can be adjusted to make it more explicit by right clicking and selecting “Fit to Screen”, as show in the image below.

 

Imagen15

 

 Introduction 

 Text Mining 

Background

      WEKA     

 Conclusion    

References


By Valeria Guevara