Weka

 

WEKA Herramienta


 

Weka is a native New Zealand bird that does not fly but has a penchant for shiny objects. [30] Newzealand.com. (2015). Old legends from New Zealand narrate that these birds steal shiny items. The University of Waikato in New Zealand started the development of a tool with that name because this would contain algorithms for data analysis. Currently WEKA package is a collection of algorithms for machine learning tasks of data mining. The package of Waikato Environment for Knowledge Analysis contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. [31] Hall, M., Frank, E., Geoffrey H., Pfahringer, B., Reutemann, P., & Witten, IH (2009). This software analyzes large amounts of data and decide which is the most important. It aims to make automatic predictions that help decision making.

Weka VS Other Machine Learning Tools

There are other tools for data mining as RapidMiner, IBM Cognos Business Intelligence, Microsoft SharePoint and Pentaho. IBM Cognos Business Intelligence provides a not very user-friendly display. Microsoft SharePoint creates predictive models of mining business but their information is not their main objective. Where RapidMiner offers a great display of results, but the datasets are loaded slower than in Weka. Pentaho its graphical interface is not difficult to understand to describe your options as Weka does.

The tool implements Weka machine learning techniques implemented in easy to learn java under a GNU General Public License. WEKA provides three ways to be used, through its graphical interface, command line interfaces and application code in Java API interface language. Although WEKA has not been used primarily for troubleshooting predictions in business, this helps the construction of new algorithms. Therefore, it turns out to be the most optimal software for initial data analysis, classification, clustering algorithms, research.

In this project, the Weka tool is used to create a predictive model using text classification algorithms of machine learning algorithms.

Installation

Weka can be downloaded at: http://www.cs.waikato.ac.nz/~ml/weka/. In this case we speak of the latest version 3.6.12 Weka. In the same URL you can find instructions for installation on different platforms.

In Windows Weka must be located in launcher program in a folder version of Weka downloaded, in this case the latest version is weka-3-6. Weka default directory is same directory where the file is loaded.

Linux will have to open a terminal and type: java -jar /installation/directory/weka.jar.

It is common to find an error of insufficient memory, which in turn is achieved by specifying for example GB 2GB will “-Xmx2048m” in the setup files. Further information weka.wikispaces.com/OutOfMemoryException be found. You can be ordered with the -Xms and -Xmx parameter indicating the minimum and maximum RAM respectively.

In windows you can edit the file RunWeka.bat RunWeka.ini or the installation directory should be changed Weka maxheap = 128m = 1024m maxheap line. You can not assign more than 1.4G to JVM. You can also assign to the virtual machine with the command:

                          java -Xms <minimum-memory-mapped> M

                         -Xmx <Maximum-memory-mapped> M -jar weka.jar

[32] Garcia, D., (2006).

In linux the -XmMemorySizem option is used, replacing MemorySize the required size in megabytes. for instance:

                           java -jar -Xm512m /instalación/directorio/weka.jar.

 

Execution

Weka The first screen will show a coach you are interfaces called “Applications” where in this version of Explorer, Experimenter, KnowledgeFlow sub-CLI and Simple tools are deployed. Explorer is responsible for conducting exploration operations on a data set. Experimenter experiments performed statistical tests to create an automated manner different algorithms different data. KnowledgeFlow shows graphically the operation panel work Weka. Simple CLI or single client that provides the command line interface to enter commands.

The main user interface is “Explorer” consists of six panels. Preprocess is the first window to open this interface. In this window, the data are loaded. Weka accepts load the data set from a URL, database, CSV or ARFF files. The ARFF file is the primary format to use any classification task in WEKA.

Input data.

As previously it was described, three data inputs are considered in data mining. These are the concepts, instances and attributes. An Attribute-Relation File Format file is a text file that describes a list of instances of a concept with their respective attributes. These files not used by Weka for text classification and clustering applications.

    ARFF files.

These files have two parts, the header information and data information. The first section contains the name of the relationship with the attributes (name and type). The name of the relationship is defined in the first line of arff where name-relation is a string with the following format:

                           relation <relation-name>

The next section is the attribute declarations. This is an ordered sequence of statements of each attribute instances. These statements uniquely define the attribute name and data type. The order in which the attributes are declared indicates the position where you are in the instances. For example, the attribute that declares the first position is expected that all instances in the first position mark the value of this attribute. The format for its declaration is:

                           atributo <attribute-name> <data type>

 

Weka has several <datatype> supported:

i) NUMERIC or numerical: they are all real numbers where the separation between the decimal and integer part is represented by a point and not a comma.

ii) INTEGER or integers: treated as numeric.

iii) NOMINAL provide a list of possible values for example {good, bad}. These are to express the possible values that the attribute can take the following format:

              @attribute attr_name {<nominal1>, <nominal2>, <nominal3>, …}

iv) STRING: is a string of text values. These attributes are declared, as follows:

@attribute attr_name string.

v) DATE: Dates are declared as:

@attribute <name> Date [<date format>].

Where <name> is the name of the attribute and <date format> is an optional string consists of characters Hyphens spaces and time units. Specific date format values that the date should be analyzed. The format set accepts the combination of ISO-8601 format: yyyy-MM-dd’T’HH: mm: ss. Example:

@attribute timestamp DATE “yyyy-MM-dd HH: mm: ss”

vi) Relational attributes are data attributes for multiple instances in the following way:

@attribute <name> relational

<Attribute definitions Next>

end <name>

 

There are rules on the names to attribute statements:

a) The names of relations as string or string must be enclosed in double quotes “if it includes spaces.

b) Both the attributes and relationships names can not start with a character before the \ u0021 ASCII ‘{‘, ‘}’, ‘,’, or ‘%’.

c) Values that contain spaces must be quoted.

d) Keywords numeric, real, integer, string and date can be case sensitive.

e) Relational data must be enclosed in double quotes “.

 

The second section is the statement of information. It is declared as data on one line. Each line below represents an instance defining attributes with commas. The attribute value must be in the same order in which they were found in one section attribute. Missing values are represented with a trailing question mark “?”. The string values and nominal attributes are different between upper and lower case. It should be cited any value that contains a space. Comments are cited delimiter character “%” to the end of the line.

In text classification, arff files represent the entire document as a single text attribute that is of type string. The second attribute to consider is the class attribute. This will define the class instance belongs. This type of attribute can be of type string or nominal. An example of the resulting text file is the document type and the type string nominad class of two values:

                           @relation language

                           @attribute DocumentText string

                           @attribute class {English, Spanish}

                           @data

                           ‘texto a clasificar aquí… ‘, español

                           ‘Classify text here …’, English

 

 

Data preprocessing.

In this window, data are loaded and may be edited. Data can be manually modified with edition or filtering. Filters are learning techniques methods that modify the data set. Weka has a variety of filters structured hierarchically in supervised and non-supervised where the root is weka. These filters are divided into two categories as a result of the way they operate with data attribute and instance.

As point out earlier, these techniques are classified in a way that depends on the input data relationships. Unsupervised learning techniques as descriptive inductive models do not know their correct classification. This means that the instances do not require an attribute that declares the class. Inductive techniques of predictive supervised learning depend on the class values to which it corresponds. This refers to instances will contain a class attribute that corresponds which they belong.

In Current relation module the dataset that has been loaded is described as  the name, and instances number. Attributes allows to select attributes using options from All, None, Invert and it further provides the option to enter a regular expression. In the Selected attribute part display information about the selected attribute. At the bottom is illustrated a histogram of the attributes selected in Attributes.

 

Preprocessing for classifying documents

In Weka is possible to create documents classification models into categories previously analyzed. The documents in Weka usually need to be converted into “vectors text” before applying machine learning techniques. For this the easiest way to render text is as bag of words or word vector. [34] Namee, B. (2012). StringToWordVector filter performs the process of converting the string attribute to a set of attributes that represent the occurrence of words of the full text. The document is represented as a text string in a single attribute type string.

 

StringToWordVector Filter

This is the fundamental text analysis WEKA filter. This class offers abundant choices of natural language processing, including the use of lexematización for convenient corpus, custom tokens and using various lists of empty words. At the same time, it calculates weights Frequency and Duration TF.IDF etc.

StringToWordVector places the class attribute to the top of the list of attributes. To change the order it can use the filter Reorder to reorder. This filter can be configured all the techniques of linguistic natural language processing to attributes. To apply the filter StringtoWordVector in batch mode from the command line can be done as follows:

Java -cp/Aplicaciones/weka-3-6-2/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b -i

datos_entrenamiento.arff -o vector_ datos_entrenamiento.arff -r

datos_prueva.arff  vector_ data_ prueva .arff

The sets datos_entrenamiento  are the training set,  vector_ datos_entrenamiento are the training set vector, datos_prueva are the test set and vector_ data_ prueva are the   test set vector. The -cp option puts Weka jar in the class path, use -b indicates the batch mode, -i file specifies the training data, -o output file after processing the first file, -r is my file Test and -S is the output file of the previous test file.

Options can be modified in the user interface, when you click on the filter name beside the choose button. Having previously selected the filter from Booton choose.

Having the window open weka.filters.unsupervised.attribute.StringToWordVector show the following to be modified according to the needs of the documents to be classified. The options are:

IDFTransform

TFTransform

attributeIndices

attributeNamePrefix

doNotOperateOnPerClassBasis

invertSelection

lowerCaseTokens

minTermFreq

normalizeDocLength

outputWordCounts

periodicPruning

stemmer

stopwor

tokenizer

useStoplist

wordsToKeep

 

In Weka.sourcearchive.com [39] refers to a mental map of Weka options which are as follows shown in the following illustration is:

 

wordsToKeep

Defines the number N of words per class limit, if there is a class attribute which is trying to maintain. In this case only the more common N terms among all attribute values in the chain will remain. Higher values represent lower efficiency because it will take more time learning model.

doNotOperateOnPerClassBasis

Flag that set to keep all relevant words for all classes. It is set to true when the maximum number of words and the minimum term often does not apply to an attribute of a class, instead it is based on all classes.

TFTransform

Term frequency score (TF) Transformation: when position the flag as true, this filter executes the transformation term-frequency score representing textual data in a vector space the term-frequency (TF) is used. The TF represents numerical measure the words of the text relevance among the entire collection. This not only considers the relevance of a single term itself, it also contemplates the relevance in the entire collection of documents.

Mathematically its represented as the function TF (t, d) which expresses the term t in the document d is as: log (1 + t word frequency on the instance or document d). The inverse document frequency IDF is the number of documents containing the term t appear where t is defined in the TF. It find words often related in terms of log (1 + IJF) where fij is the frequency of the word t in the document (example) j.

DFTransform

Inverse Document Frequency (IDF) Transformation: positioning the flag with “true” will define the use of the following equation:

t word frequency in instance d as ftd and as a result:

F td * log (nº documents and instances d / nº of documents with word t)

This is explained taking into account set D which includes all documents in the collection represented as D = {d1, d2, …, dn}. It finds out most relevant documents to the other fij * log (nº Docs / nº nº of Documents with the i word) where fij is the frequency of word i in document  j.

By multiplying IDF by the TF the result assign more weigh to the terms with greater frequency in the documents but at the same time relatively rare in the collection of documents. Weight [33] Salton, G., Wong, A., & Yang, C. (1975).

outputWordCounts

Counts the words occurrences in the string, the default settings only reports the presence or absence as 0/1. The result is a vector where each dimension is a different word. The value in this dimension is a binary 0 or 1 is say yes or no is the word in that document.

The frequency of the word in that document is represented as an integer number with: IDFTransform and TFTransform as “False” and outputWordCounts to “True” opccions.

This is enable to do an explicit words account. It is established as “false” when only cares about the presence of a term, not its frequency.

To calculate tf * (IDF) must be set IDFTransform as True, TFTransform as false and outputWordCounts set as True.

To achieve log (1 + tf) * log (IDF) TFTransform must be set to True.

 

normalizeDocLength

It is set true to determine whether the words frequency in an instance must be normalized. Normalization is calculated as Actual Value * Average Document Length  / Document  Length  . This option is set with three sub-options, the first option “No normalization”. The second is “Normalize all data” that takes a measure as a common scale of all measures taken in the various documents. The third option is “Normalize test data only.” It has a word with a real value of the tf-idf result of the word in that document with the settings as follows IDFTransform and “TFTransform” to “True” and “normalizeDocLength” to “Normalize all data.”

 

Stemmer

Selects the stemming algorithm to use in the words. Weka by default supports four default stemmer algorithms. Lovin Stemmer algorithm is its iterated version and supports Snowball stemmers. IteratedLovinsStemmer algorithm is a version of the algorithm LovinsStemmer which is a set of transformation rules for changing word endings as well as words present participle, irregular plurals, and morphological English. NullStemmer algorithm performs any derivative at all. The algorithm SnowballStemmer came standard vocabularies of words and their equivalents roots.

Weka can easily add new algorithms stemmer because it contains a wrapper class for as snowball stemmers in Spanish. Weka contains all algorithms snowball but can be easily included in the location of the class weka.core.stemmers.SnowballStemmer Weka.

Snowball is a string processing language designed for stemming creation. There are three ways to get these algorithms, the first is to install the unofficial package. The second is to add snowball-20051019.jar pre-compiled class location. The third is to compile the latest stemmer by itself from snowball-20051019.zip. The algorithms are in snowball.tartarus.org that have a stemmer in Spanish. In the following link you can see examples and download this stemmer: http://snowball.tartarus.org/algorithms/spanish/stemmer.html

Snowball Spanish Stemming Algorithm comes from Snowball.tartarus.org. It defines an usual R1 and R2 regions. Furthermore RV is defined as the following vowel after the region if the second letter is a consonant, or RV and after the following consonant the region, if the first two letters are vowels, or RV as the region also after the third letter if these options do not exist RV is the end of the word.

Step 0: Search the longest pronoun between the following suffixes: “I selo selos selas is SELA’s you what the will of us” and remove it, if it comes after one of iendo ar Ando ír ER’m iendo ar er get going.

Step 1: Look in the longest common suffix and deletes it.

Step 2: If no suffix is not removed in step 1 seeks to eliminate other suffixes.

Step 3: Find the longest among the residual suffixes “os a o á í ó e é” in RV and eliminates them.

Step 4: remove sharp accents. [36]

. For more information about suffixes in step 1 and 2 go to snowball http://snowball.tartarus.org/algorithms/spanish/stemmer.html page.

 

The previous algorithm will be added into weka when the following command for Windows is applied:

java -classpath “weka.jar, snowball-20051019.jar” weka.gui.GUIChooser

For Linux:

java -classpath “weka.jar: snowball-20051019.jar” weka.gui.GUIChooser

[37] Weka.wikispaces.com ,. (2015).

The jar snowball-20051019.jar previously compiled and stored in the location where the application of Weka on the computer.

It may confirm with the command:

java weka.core.SystemInfo

As shown in the figure below.

 

 

Stopwords

This are terms that are widespread and appears more frequently and do not provide information about a text. This option determines whether a sub string in the text is an empty word. Stopwords terms come from predefined list. This option converts all words to lowercase before term removal. Stopwords it is pertinent to eliminate meaningless words within the text and eliminate frequent and useful words of decision trees. Weca´s stopwords by default are based on the Rainbow lists that are found in the next link: http://www.cs.cmu.edu/~mccallum/bow/rainbow/.

Rainbow is a program that performs statistical text classification. It is based on the Bow library. [38] Cs.cmu.edu, (2015). The format of these lists is one word per line, where each comments must start with ‘#’ to be omitted. WEKA is configured with a list of empty words English but you can set different lists of stopwords. You can change this list from the user interface by clicking on the option you have Weka by default uses Weka-3-6 list but it can choose any location that points to a desired list. Rainbow has separate lists for English and Spanish, in order to make both languages the “ES-stopwords” add both lists from Rainbow.

 

useStoplist:

Flag to use empty words. If is set to “True” ignores the words that are in the predefined stopwords list from the previous option.

 

Tokenizer:

Choose measurement unit to separate each text attribute from the arff. This has three sub options. The first is AlphabeticTokenizer where only alphabetical symbols are continuous sequences that cannot be edited. When tokenize only considers the alphabet in English. At the same time there is WordTokenizer option that establishing a list of delimiters. As was referenced previously, punctuation in Spanish is “;:.?!?! – – () [] ‘” << >> “. In Spanish, unlike English contemplates a sign of the beginning and another end in an exclamation.

The second is NGramTokenizer that divides the original text string in a subset of consecutive words that form a pattern with unique meaning. Its parameters are derived “delimiters” to use that default is ‘\ r \ n \ t,;:.’ “()! ‘GramMaxSize which is the maximum size of the Ngram with a default value of 3 and GramMinSize be the minimum size of the Ngram with a default value of 1. N-grams can help uncover patterns of words between them which represent a meaningful context.

minTermFreq:

Sets the minimum frequency that each word or term must possess to be considered as an attribute, the default is 1. It is often applied when class has an attribute that has not been set to true flag “doNotOperateOnPerClassBasis” the text of the entire chain for a particular class that is in that same attribute is selected tokenisa.

The frequency of each token is calculated based on its frequency in the class. In contrast, if there is no class, the filter will calculate a unique dictionary and the frequency is calculated based on the entire attribute value chain of the chosen attribute, not only those related to a particular class value.

periodicPruning

Eliminates low-frequency words. It uses a numerical value as a percentage of the size of the document that sets the frequency to prune the dictionary. The default value is -1, meaning no periodic pruning. Periodic pruning rate is specified as a percentage of the data set. For example, this specified that 15% of each set of input data, regularly pruned in the dictionary, after creating a comprehensive dictionary. May not have enough memory for this approach.

 

attributeNamePrefix

Sets the prefix for the names of attributes created, by default is “”. This only provides a prefix to be added to the names of the attributes that the filter StringToWordVector created when the document is fragmented.

 

lowerCaseTokens

Flag when its set to “True”, converts all words  in the document into lowercase before being added to the record. Flag true eliminate the option to distinguish themselves by eliminating the rule names that begin with uppercase names. Acronyms may be considered when this option to is set to “False”.

attributeIndices

Sets the range of attributes to act on the sets of attributes. The default is first-last which ensures that all attributes san treated as if they were a single chain from first to last. This range will create a chain of ranges containing a comma-separated list.

invertSelection

Flag to work with the attributes selected in the range. It stands as true to work with the unique attributes unselected “true” or. The default value is “False” is work with the selected attributes.

After cleaning the data on the tab “Preprocess” vector attributes are analyzed to obtain the desirable knowledge in the “Classify” tab.

 

Classification

The second panel of Explorer is “Classify” or classification generated by machine learning model from the training data. These models serve as a clear explanation of the structure found in the information analyzed. Weka especially considering the model J48 decision tree for the most popular text classification. J48 is the Java implementation of the algorithm C4.5. Previously described as the algorithm that each branch represents one of the possible choices in the if-then format that the tree offers to represent the results in each leaf. It can summarized the C4.5 algorithm as the amount of measurement of the information contained in a data set and grouped by importance. The idea of the importance of a given attribute in a dataset. J48 Print recursively the tree structure variable of type string by accessing information stored in each attribute nodes.

To create a classification, you must first choose the algorithm classifier in the “Choose” button located in the upper left side of the window. This button will display a tree where the root is Weka and sub folder is “classifiers“. Within the sub folder tree located in weka.classifiers.trees tree models such as J48 and RepTree are found.  RepTree combines the standard decision tree with random forest algorithm. To access the classifier’s options are given double-click the name of the selected classifier.

“Test Options”.

The classification has four main modes and others to manage the training data. These are found in the section “Test Options” with the following options

  1. a)  Use training set: training method with all available data and apply the results on the same dataset collection.
  2. b) Supplied test set: select training data set froma file or URL. This set must be compatible with the initial data and is selected by pressing “Set” button.
  3. c) Cross-validation: performs a cross-validation depending on the number of “Folds” selected. Cross-validation specify a number of partitions to determine how many temporary models will be created (Folds). First a part is selected, then a classifier is built from all parts are except the selected one that remains for testing. [32] Garcia, D., (2006).
  4. d) Percentage Split: define the percentage of the total input from the classifier model was built and the remaining part will be tested.

 

Weka allows us to select more than a few options for defining the test method with the “More Options” button, these are:

Output Model: open in the output window pattern classifier.

Output per-class stats: display statistics for each class.

Output entropy evaluation measures: displays measurement information entropy in the standings.

Output confusion matrix: displays the resulting confusion matrix classifier.

Store predictions for visualization: Weka will keep classifier model predictions as in the test data. In the case of using this option classifier J48 will show the tree errors.

Output predictions: show a table of the real and predicted values for each instance from test data. It states the relation between the classifier and each instance in the test data.

Output additional attributes: is set to display the values of attributes, not those of the class. A range will be specified to be included along the actual and predicted values of the class.

Cost-sensitive evaluation: produce additional information on the output of the assessment, the total cost and average cost of misclassification.

Random seed for xcal /% Split: specifies the random seed used when before data have been divide for evaluation purposes.

Preserve order for% Split: Retains the order in the percentage of data instead of creating a random for the first time with the value of the default seeds is 1.

Output source code: generate the Java code model produced by the classifier.

 

In the event that does not have a set of data independent evaluation it is necessary to obtain a reasonably accurate idea of the generated model and select the correct option. In the case of classifying documents is recommended select at least 10 “Folds” for cross-validation and assessment approach. It also recommends allocating a large percentage of “Percentage Split”.

Below these options “Test Options”, it is a menu where a list with all attributes will be find. This allows you to select the attribute that act as the result for classification. In the case of the classification of documents will be the class to which the instance belongs.

The classification method start by pressing the “Start” button. The image of the weka bird found in the bottom right will start to dance till the classifier achieves complete.

WEKA creates a graphical representation of the classification tree J48. This tree can be viewed by right-clicking on the last set of results “Result List” and selecting “Visualize tree” option. The window size can be adjusted by right-clicking and select “Fit to Screen”.

 

Classifier for classifying documents J48

The model J48 uses the decision tree algorithm C4.5 to build a model from selected training data. This algorithm is found in weka.classifiers.trees. J48 classifier has different parameters that can be edited by double clicking on the name of the selected classifier.

J48 employs two pruning methods, but this does not make the pruning of error. The main objectives of pruning are to make the tree easier to understand and reduce the risk of overuse of the training data in the direction of be able to classify just about perfectly. The tree learn the specific properties of the training data and not the lower concept.

The first J48 pruning method is known as replacement subtree. The nodes in a decision tree can be replaced with a leaf by reducing the number of nodes in a branch. This process starts from the fully formed leaves and work up towards the root.

The second is to raise the hive. A node is move to the tree root and replaces other nodes in the branch. Normally, this process is not negligible and is wise turn it off when the induction process takes time.

By clicking on the name of the J48 classifier which is located right next to the “Choose” will display a window with the following editable options:

confidenceFactor sets the number of pruning. Lower values experience more pruning. Reducing this value may reduce the size of the trees and also helps in removing irrelevant nodes that generate misclassification. [40] Drazin, S., & Montag, M. (2015).

minNumObj: Sets the minimum number of instances separation  per leaf in the case of trees with many branches.

unpruned: flag to preform pruning. In true the tree is pruned. Default is “False” which means that pruning is not carried out.

reducedErrorPruning: flag to use pruning error reduction in C.4.5 tree. Method after pruning using a resistance to the errors estimations. Similarly, it is for breeding hives and throw an exception not the confidence level used for pruning.

Seed: Seed number shuffle data randomly and reduce error pruning. This is considered when reducedErrorPruning flag is set to “True”. The default seed is 1.

numFolds: number of pruning to reduce error. Sets the number of folds that are retained for pruning, with a set used for pruning and the rest for training. To use these Folds reducedErrorPruning flag must be set to “True”.

 

binarySplits: when this flag is set “True”, it creates only two branches for nominal attributes with multiple values instead of a branch for each value. When the nominal attribute is binary there is no difference, except in how this attribute is shown in the output result. The default is “False”.

saveInstanceData: flag set to “True” to store training data for its visualization. The default is “False”.

subtreeRaising: flag to preform pruning with the subtree raising method. This moves a node to the tree root replacing other nodes. In “True” weka considered subtreeRaising in the process of pruning.

useLaplace: flag that preform a leaves count in Laplace. Set to “True”, weka will count the leaves that become smaller based on a popular complement to estimates probability called Laplace.

debug: banner to add information to the console. In “True”, it adds additional information to the console of the classifier.

 

 

Results Evaluation

Weka describes the proportion of instances erroneously predicted with the measure – Fβ score. The value is a percentage consist of precision and Recall. Precision measures the percentage of correct positive predictions that are truly positive Recall is the ability to detect positive cases out of the total of all positive cases.

With these percentages it is expected that the best model is the F-measure value closer to 1. The following table shows some combinations that are significant in the data preprocess for model generation. This comparison table describes its measures of precision and recall as well as its measurement-f.

First the best filter options are analyzed with unadjusted values for the J48 classifier. In this the best parameters are selected. After the best settings for J48 classifier algorithm are selected with the best configuration on the StringToWordVector filter.

 

Comparison table: Documents classification models.

Features Precision   Recall  F-Measure
Word Tokenizer  English Spanish (E&S ) 0.810 0.803 0.800
Word Tokenizer  E&S  + Lower Case Conversion 0.863 0.859 0.860
Trigrams  E&S   + Lower Case Conversion 0.823  0.775 0.754
Stemming + Word Tokenizer  E&S  + Lower Case Conversion 0.864 0.817 0.823
Stopwords + Word Tokenizer  E&S  + Lower Case Conversion 0.976 0.972 0.972
Stopwords + Stemming +Word Tokenizer  E&S  + Lower Case Conversion 0.974 0.972 0.971
Stopwords + Word Tokenizer  E&S  + Lower Case Conversion + J48 minNumObj = 1 1 1 1

 

 

In conclusion the best model is a combination of the options Word Tokenizer Stopwords + S + E & Lower Case Conversion applied to the filter on the data preprocessing and further adjusting 1 minNumObj on the J48 classifier algorithm.

The next confusion matrix is the result from the combination of Stopwords + Word Tokenizer  E&S  + Lower Case Conversion  adjusting minNumObj to 1 on the J48 algorithm. This generates the following binary  values in their confusion matrix.

 

a b c d E f Classified as
11 0 0 0 0 0 a = Hemodialysis
0 12 0 0 0 0 b = Nutrition
0 0 12 0 0 0 c = Cancer
0 0 0 12 0 0 d = Obesity
0 0 0 0 12 0 e = Diet
0 0 0 0 0 12 f = Diabetes

 

This table only shows classes with precision and recall at 100%. Accuracy values are as follows for each class:

 

 Class TP Rate FP Rate Precision Recall F-Measure
Hemodialysis 1 0 1 1 1
Nutrition 1 0 1 1 1
Cancer 1 0 1 1 1
Obesity 1 0 1 1 1
Diet 1 0 1 1 1
Diabetes 1 0 1 1 1
Weighted Avg . 1 0 1 1 1

 

References

Witten, I. H., Frank, E. ;., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and techniques / Ian H. Witten (3a. ed. –.). s.l.: Elsevier.

Cs.waikato.ac.nz,. (2015). Weka 3 – Data Mining with Open Source Machine Learning Software in Java . Retrieved 5 May 2015, from http://www.cs.waikato.ac.nz/~ml/weka/

Shams, R. (2015). Weka Tutorial 31: Document Classification 1 (Application). YouTube. Retrieved 15 May 2015, from https://www.youtube.com/watch?v=jSZ9jQy1sfE

Shams, R. (2015). Weka Tutorial 32: Document classification 2 (Application). YouTube. Retrieved 15 May 2015, from https://www.youtube.com/watch?v=zlVJ2_N_Olo

Rodríguez, J., Calot, E., & Merlino, H. (2014). Clasificación de prescripciones médicas en español. Sedici.unlp.edu.ar. Retrieved 15 May 2015, from http://sedici.unlp.edu.ar/handle/10915/42402

Weinberg, B. (2015). Weka Text Classification for First Time & Beginner Users. YouTube. Retrieved 15 May 2015, from https://www.youtube.com/watch?v=IY29uC4uem8.

Nlm.nih.gov,. (2015). PubMed Tutorial – Building the Search – How It Works – Stopwords. Retrieved 18 May 2015, from http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/020_170.html

 

 Introduction 

 Text Mining 

Background

 WEKA Tutorial 

 Conclusion

References


By Valeria Guevara