Text Mining

Minería de Texto


Knowledge is stored primarily in text documents such as books, magazines, newspapers, journals, articles, emails, web pages, letters, etc. This has created the need to find ways to classify documents and organize your information in new and relevant way instead of traditional sorted lists. When it comes to research a specific topic, you will find a host of information that has been ordered as relevant. However, it takes many hours to recognize between the millions documents the correct context. In order to ensure accurate organisation, documents had to been categorize manually. This finds out interesting to know how to organize texts with an excellent classification performance. As a result, text mining found new opportunities for understand and analyze documents. Text mining emerged with the purpose of extract, analyze and process texts from large data sets as well as facilitate the presentation to the understanding of new knowledge. Therefore Manuel Montes-y-Gómez defines textual mining as the process of discovering interesting patterns and new knowledge in a collection of texts, explicitly, text mining is the process responsible for the discovery of knowledge not explicitly exist in any text of the collection, however that arise relate content ([3] Hearst, 1999; [4] Kodratoff, 1999). In summary, text mining look for patterns extraction of new a new idea from the analysis of large collections of documents in order to gain new knowledge. Its purpose is discover interesting groups, trends, associations and referral patterns found and display for the derivation of new findings.

Text mining relations with other disciplines.
Based on the above definition, text mining aims to learn from text data, other techniques such as data mining that in same way seeking knowledge from a data set. These terms and KDD, data mining and text mining too often related by the similarity in their definitions. Knowledge discovery in databases, KDD for its acronym in English has been defined as the process to identify usable patterns among data. Knowledge discovery in databases is a process that manage the data in several steps to extract their relationships. The first steps are to understand and analyze this row data. Text mining needs these same steps with the objective of converting text data to an appropriate format for analysis. Data mining is distinguished by being the extraction of data patterns that generate information to generate new knowledge. Unlike a text mining text data patterns it seeks to generate knowledge of a text. Considering text mining as a subset of data mining. Data mining uses machine learning techniques to its develop. Machines learning are described by Witten and Frank Hall as the process of abstraction that the data take to infer a structure to represent them [1]. Therefore, this has methods that define data classification algorithms. In the case of text mining uses this algorithms that when it learn from a group of examples or “training set”, classify new texts into categories analyzed. It is assumed that these algorithms expresses the text with numerical values in a vector form representing the weights of the terms found in the text [2] Berry, MW, & Kogan, J. (2010) text. Concluding that these terms may appear as synonyms, but they are mutually dependent in their processes.
There are other techniques based on word processing and computational linguistics and information retrieval. Text recovery represents the relevant documents to an inquiry and establishes mechanisms to meet the desired needs. Text recovery does not facilitate the analysis process nor the extraction of new knowledge as text mining does. At the same time, computational linguistics studies natural language with computational methods to make them understandable by the operating system. This science uses the syntax and grammar analysis to the comprehension of language. This text process into an electronic format that allows the identification of similar texts written in different languages. Although the text mining shows different objectives than the computational linguistics, it adopts some of its techniques.

Text mining applications
Text mining tools are important because they provide the analysis of the information collected in a large volumes of documents. The purpose of these tools is to provide new knowledge. These functions include:

  • Feature extraction: is the process for automatic recognition of facts in the documents. It attempts to identify references to person’s names, institutions, events, existing authorities and their relationships.
  • The generation of grouping or clustering; groups similar documents without prior knowledge of clusters. That means that the group will be defined by the software and not from a list of predefined classes. The similarity is established by the ability to form classes or categories by terminology found in each text. The automatic classification facilitates understanding of documents, obtaining text overview. Another use is to evaluate the relevance of documents in each group. It also identifies unknown relationships and potential duplicates. Further, optimizes the organization of the results. [7] (Brun & Senso, 2004).
  • Automatic categorization: determines the subject matter of a documents collection. This unlike clustering, choose the class to which a document belongs in a list of predefined classes. Examples must detect spam emails, automatically tag articles flows, etc. Each category is trained through a previous manual process of categorization. The classification starts with a training set of previously classified documents; it creates a classification model based on the set of examples that is able to allocate the right kind of a new document. [8] Hotho, A., Nurnberger, A. & PAASS, G. (2005). This model will describe the analysis of the characteristics of allocation, based on the similarity between the new document and the training documents. The similarities calculation is commonly accomplished identifying the relations of terms jointly between training documents and new document.
  • Discovery of associations and diversions: it allows to detect associations at different levels of generalization and deviations, depending on subsets of the collection. They are based on the conceptual clustering within the text. [5] Montes-y-Gómez, M. Its objective focuses on finding implications between text characteristics to belong to a class. In order to find deviations, rare or unusual implications within the analyzed text are identified.
  • Trend analysis: refer to the detection of emerging issues in the texts. [9] Streibel, (2010). This analysis observe patterns of changes based on specific variables. Numbers, words, people or places define that variables. Emerging trends are the topics of interest and utility in some time found in the text.
  • Application of strategic or competitive intelligence information: identifies advantageous competitors to help decision making. Analyze data, discover patterns and strengths and strategies revealed in documents found with competing issues. It allows us to anticipate the activities of competitors and visualize potential areas of action. The use of competitive intelligence – with data mining tools that analyze social media companies and their competition – can produce findings that help companies make decisions that will improve their competitive advantage. [10] Gémar, G., & Jimenez-Quintero, JA (2015).
  • Identify main ideas: recognize and extract the main ideas or themes addressed by the document collection. Unlike the categorization of documents, this allows to extract the terms that are representative of the text without assigning them to a class. One idea is identified looking the occurrence of terms and combinations of terms in the documents. By identifying each idea conceptual networks are created through the documents dealing with the same theme.
  • Automatic summaries elaboration: generated by extracting sentences of the original document without being edited. The extraction is based on the statistical frequency of the terms found as well as their position these phrases in the text. It facilitates the analysis of large documents collections.
  • Documents visualization: interface that displays text in a format that facilitates the interpretation and navigation text collections. It allows the user to navigate between the results obtained from the analyzed documents.

Techniques of text mining.

As mentioned above, data mining uses machine learning techniques. Text mining as a subset of data mining adopts these techniques to identify and understand patterns of new information. Learning techniques are classified depending on the relationship of the input data. Learning styles each algorithm can take fall into addictive and inductive. Within the addictive methods are the explanatory methods also known as analytical learning. These abduction methods aim explain the context. Inductive methods are subdivided into descriptive and predictive. Inside descriptive models unsupervised learning and exploratory analysis are found. The technique of unsupervised learning or segmentation has as objective to detect groups and label entries from a set of observations without knowing the correct classification, such as: which groups created and the number of groups found. The exploratory analysis detect correlations, associations and dependencies, example: anomaly values. In predictive inductive methods are include interpolation, prediction and the sequential supervised learning. Interpolation is a continuous function of various dimensions, example: f (2,3) = ?. Sequential prediction sequentially ordered observations where the next sequence value, eg 1, 2, 3, 5, 7,? Is predicted. Studying a supervised learning depending on observations of class values corresponding to the classifier, eg 1,3> If, 2,5> If, 4,1-> No, 3,9-> ?. [13] Hernández, J., Ramírez, M.J., & Ferri, C. (2004).

For automatic knowledge extraction, predictive supervised learning technique is employed. This is where the knowledge base consists of labeled examples. These techniques are subdivided depending on whether the information is qualitative or quantitative. A function is estimated when the desired values correspond to the labels of each class. This is called classification because the information is qualitative with disjoint classes. Correspondence is estimated when the information is quantitative and classes can overlap. This is known as categorization. In this regression problem its output belongs to one class or more. Classification techniques uses various methodologies. These include k-NN (nearest neighbor), k-means (competitive learning), and decision tree learning Bayes classifiers, Support Vector Machines, among others.

Classification trees also known as decision trees represent knowledge from the classification problem using a tree structure. These are often used in decision analysis because it helps to identify the strategy most likely to succeed. Most classification tree algorithms start from a data set which containing labeled patterns. The labeling patterns are characterized by different predictor variables and class. These variables are current values of the attributes in the data. This algorithm inserts the value of the class assigned to various tree leaves. This set of rules begin in root node asking for the value of a variable. Each branch derived from the root node correspond to all possible values that this variable can take. The algorithm descends from the responses of each rule until the child node. A path can only go through a single link. In the founded subtree will be decide the possible value of new rule or parent node consecutively. The class to which it belongs when the child node or childless node will be defined. The result is a tree representation by a set of rules. Quinlan, a computer engineer in 1986 [11] has one of the most popular algorithms called ID3. In 1993 John Quinlan [12] proposed algorithm C4.5 to improve its previous work with ID3. The algorithm C4.5 removes branches that do not provide conclusive decisions, manage attributes with different costs and handles unknown values to be handle as missing attributes.



Text Mining Methodology

Nong Ye in his book “The Handbook of Data Mining” manages data mining in four stages. The first step is the collection of data. The second stage will be the preparation of the data. The third step is to measure the quality of data and evaluate the results. Finally the knowledge generated will be displayed. [14] (Ye, 2003). Meanwhile, Gary Miner in his book “Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications”, states that it is crucial to collect, organize, analyze and assimilate information. Miner offers three different activities with subtasks that deepens on the generated information. This book offers a detailed text mining diagram. The first activity is document collection. Texts with specific domain should be collected and organized. The corpus of the collection is established. The second activity is the data preprocessing or data structure. This second activity is responsible for introducing the corpus structure from the step 1. Finally, the knowledge is extracted. This last activity is responsible for discover patterns of previously processed data. At this stage you can provide feedback to the first and second activity by providing corrections and / or adjustments. Patterns and associations are represented and displayed. [15] (Miner 2012).

The image that Gary Miner presents, describes the methodology of text mining in a relevant way for its practice. Uysal, & Gunal cited a framework for text mining with the following stages: preprocessing, feature extraction, feature selection and classification stages. [17] Uysal, & Gunal, (2014). Uysal, & Gunal do not consider the activity of data collection as Gary Miner do or Ye Nong do for data mining. For purposes of this project a methodology of text mining for text classification that includes data collection is appointed. This framework is represented in four stages, data acquisition, preprocessing documents, information extraction and evaluation of results. Witten, Frank and Hall make mention of these steps in his work for the use of WEKA Witten, IH, Frank, E. tool;., & Hall, MA (2011).

Data acquisition.

In this first stage begins by creating a mechanism for collecting the texts. Data should be collected in a way that can create a training set of data. It must selected the correct text, deciding the relevance of the problematic facts and purpose of generating knowledge. These importance depends on the algorithm needs and he business problem. This input data collection should be store in a certain way to be processed.

Input data.

The techniques of learning machines operate with different forms of input data.  Witten, Frank and Hall describe three type of input in data mining. This are concepts, instances and attributes. The concepts specify what you want to learn. The concept belongs from of group classified examples that represent learning. An instance represents data that contains a class to be classified, associated, or clustered. An instance is a type of example individual and independent. Instances contain a set of specific characteristics named attributes. An attribute in an instance represents the measurement level of that attribute in that instance. Given the different nature of the data, the possible amounts of attributes are classified as nominal and ordinal. Nominal attributes also known as categorical are a finite set of different symbols. For example labels or names or places. Ordinal attributes also identified as continuous or numerical are representations of significant sequence measures. This make it possible to handle an order but not a distance. For example low <  medium <  high. S.S. Stevens in 1946 proposes a scales measurement partition for statistical classification processes. [16] Stevens, S. (1946). Data mining adopts this hierarchy to classify correctly each attributes type. The four measurement levels are divided into those already mentioned nominal and ordinal adding interval and ratio. Interval attributes are described as the metric scales have equal constant distances between their values. This are measured on a linear scale where zero is arbitrary, it can take positive and negative values. For example, Fahrenheit or Celsius temperatures. Finally, the ratio are interval attributes with where the zero position represents null or nothing. For example, weight, height, pulse etc. Text mining as a data mining subset uses this input data classification to meet its objectives.

Text mining need to join and specify data collection (a set of instances). This examples have to contain nominal attributes to represent text. These instances should be integrated in a clean and clearly in a set of nominal attributes format. Sets of instances are represented in an array of instances and attributes of the concept. These matrices also called “datasets” contain all examples selected as relevant documents. In a specific case, WEKA tool uses a standard format for representing special collection of text documents. The ARFF files represent instances that share a set of attributes. The File Format Attribute Relationship for its acronym in English are divided into 3 sections, relationship, and attribute data. These files will be described later in detail.

Preprocessing of data.

This step is based on the text preparation that consist on its selection, cleaning and preprocessing focused on the concept. It provides the basis for the application of methodologies text mining. This step is accomplished using a series of operations over the text and generate some kind of structured or semi-structured information for its analysis. Text representation is essential for its preprocessing. This step is performed on all documents previously collected where they are cleaned, compressed and transformed into important fragments that provide information.

Representations of text.

The documents have need of be represented in a structured way to be preprocessed. Natural language texts can be seen as a set of lexicon (words) by joints (grammar rules) allow to build fragments with a meaning (semantics) whose union (consistency) provides knowledge. [21] Muñoz, A., & Alvarez, I. (2014). The most popular way of representing documents is a vector form. The vector contains all words found in the text indicating its occurrence in it. This representation is usually used in huge size of text documents that generate a large number of values.

Vector Space Model- VSM.

The vector most used is the VSM, vector space model proposed by Salton, Wong and Yang. This model represents each document as a sequence (or ordered list) of n elements with nonnegative real numbers. Each term in the text this is represented in a coordinate. This coordinates measured the value (weight) of the importance of each term with higher rate to represent a very important term in the document settings. Coordinate with the lowest value represents the term minor. This model defines the similarity between ach vectors. [18] (Salton, G., Wong, A., & Yang, CS (1975).

TSM space tensor model.

Besides the vector space model, the tensor model space (TSM) is used. Tensor space model unlike VSM, TSM is a text document using the higher order tensor instead of vectors [19].


The text mining considerably uses n-grams or compound terms. These regulate the sequence grams of n number of words. They are basically a set of consecutive words in a text. They are known as statistical phrases or set of n words or root words. A statistical phrase is a group of two or more words that are repeated in neighboring places with a necessary frequency within text documents in the collection. For instance,  “text mining” is two words “mining” and “textual” with own meanings that provides a different meaning if that are interpreted connected. These have their own names depending on the number of words that can be found connected or n. In the case of n = 1 is called unigram, n = 2 are bigrams and  n = 3 are trigram. The algorithm is based on three parts, separation of tokens, the generation of the n-grams, and adding the n-grams to a data structure generally list. [22] Ramesh, B., Xiang, C., & Lee, T. H. (2015). As a result, Google and Microsoft have developed models of Web n-gram scale that deal with spell checking, text summary and word breaking.

Bag of words.

It is a model that represents the document as a container that containing the words found in a document. Bag of Words considers the simple words directly as indexing terms. These bags take correspondence between terms and concepts they represent regardless of the order and grammar or semantic dependence between the terms. The classification of documents using these methods since the word frequency is used as a tool to train the classifier.

Munková, D., Munk, M., & Vozár, M conclude that there are other representations of text and TSM VSM well as vectors which are: n-grams, Nature Language Processing, Bags of Words and Distributive Words Clusters. However, all these methods only consider the frequency of incidents terms of words in the texts, so ignore the importance in which they occur. [20] Munková, D., Munk, M., & Vozár, M. (2013).


Linguistic processing of natural language.

Depending on the operations type used in this pre-processing of data, the type of patterns will be discover in this collection. Each utility must preprocess the data differently to meet its objective manner. In the automatic document categorization case, its preprocess also depends on knowledge to discover. Then the important tasks for preprocessing in order to automatically categorize text documents will be described


Stemming algorithm eliminates morphemes and find the relationships between words not themed and lexeme. This eliminates morphemes in order to reduce the word to its root. Reports the linguistic root to which it belongs.


The lemmatization is a part of language processing that tries to determine the theme of each word that appears in the documents. Words are reduced gender, number, adjectives and verbal root times. The roots are used as indexing terms instead of using words. This has the advantage of reducing the number of units representing the dictionary consists. The same term should be standardized to a single form which unlike stemming the lemmatization reports the basic form of the word before changing to express tense, mood, person, number, case, and gender. [23] Ferilli, Esposito, Grieco (2014). For example, student, studding, study. Lemmatization reduces all words with the same root through a knowledge base on the different inflections.

Stop Words.

Empty words or stop words are terms that have become common and are abundant in any type of text that are not informative of the contents of a text. For example, articles, prepositions, pronouns, etc. Stop words must been eliminated to remove terms do not help to generate knowledge of the text. Removal of stop words is a technique of natural language programming on a lexical level. [23] Ferilli, Esposito, Grieco (2014). There are predefined stop words list for each corpus.

Repeated segments identification.

Sequence of words to be used in conjunction with a special meaning. These sets of words are repeated continuously in the texts. As these words are meaningless they divided causing contextualization. For example “economic engineering”, “international marketing”, “text mining”, “machine learning”, etc. Text mining on document classification uses the extraction of these terms to find concepts that represent text content. By identifying these segments statistics to select the most frequently applied. [7] Brun, R. E., & Senso, J.A. (2004).


Tokenization is how to separate the text into words commonly called tokens. This process takes into account that words can be broken by a line terminator, are attached to punctuation are not always separated by spaces and not always the blanks separate words. Punctuation in Spanish are “; . 😕 ! – -.. () [] ‘”<< >>” Where the dot and dash are ambiguous in Spanish, unlike English contemplates a sign of beginning and an end in an exclamation in Spanish should consider multi words -words calls locutions as “pre processes.”


This involves to break down the text into sentences and / or paragraphs using punctuation marks abbreviations, acronyms or numbers.

Lowercase Conversion.

Uppercase letters concerned an important role. These are found at the beginning of a sentence and may also represent names. In the case of not being names, it is convenient to convert them into lowercase to treat them later.

Name identification.

Proper names are names of people, institutions, companies, events, functions, money, and dates. These prototypes are based on heuristic rules to identify fragments corresponding to a given name. The text mining seeks to identify the relationship between proper names found in the text. [7] Brun, R. E., & Senso, J.A. (2004).


It remove low frequency words on the document.


Knowledge extraction

This stage is responsible for analyzing the text to provide knowledge. Also known as classifier. After preprocessing of data, these are analyzed to obtain the desired results. As previously mentioned, text mining is used for different purposes such as document classification or summaries creation. Respect to the document classification on text mining, the classifier search to learn functions that relates each instance attribute with a predefined class. In the case of documents classes will be nominal attributes instance, because the categories need not represent an order between them (ordinal attributes). This feature is known as classification model. This model can be descriptive or predictive. Descriptive models are tools that explain the difference between classes. Predictive models are used to predict the class to which an instance belongs. As previously mentioned, machine learning techniques can be k-NN (nearest neighbor), k-means (competitive learning), decision tree, Bayes classifiers, Support Vector Machines, rule-based classifiers, among others. Classification techniques using algorithms convenient learning machines. Decision trees are the simplest and fullest for classification task. Ddecision tree models have different uses. Among them is the selection of variables (select the most important variables), find the importance of the variables (most important variable roles), handling of missing values, prediction and information manipulation [25] (Yan-yan, and Ying, 2015).

Decision Tree Model

Previously he described as decision trees represent knowledge about the classification problem using a tree structure. The components of a decision tree are the root node, the internal nodes leaf nodes and branches. A root node is the root represent an option that will generate a subdivision thereof. The internal nodes are the nodes of opportunity as it represents one of the options available at that level of the tree. The leaf nodes represent the end result of the combination of decisions taken previously. The branches are the possible combinations of decisions in if-then format the tree provides.

Steps to build a decision tree model.

Yan-yan, and Ying said the division, detention and pruning as important to build a decision tree model steps. Indicate the division when creating the model must identify the most important attribute. Based on that identification records should be separated unto the root node and the corresponding internal nodes. The stop prevents the Model becoming too complex or long, by detecting parameters rules. Pruning does not consider the detention. This creates a tree with a high depth and then pruning removing nodes that do not provide relevant information. [25] (Yan-yan, and Ying, 2015).

[11] Quinlan a computer engineer in 1986 has one of the most popular algorithms called ID3. In [12] 1993John Quinlan proposed algorithm C4.5 improve its previous work with ID3. The C4.5 algorithm removes branches that do not provide conclusive decisions, manage attributes with different costs and handles unknown values as missing attributes to handle.

Cuartas, Anzola and Tarazona defined in Article construction methodology C4.5 decision tree in 4 steps. The first is to analyze the list of attributes. The second is to divide the information into sub sets. The next step is to identify the most relevant attribute information and recognize it as decision parameter. Finally the information in accordance with decision parameter [24] (2015) is classified. This algorithm also known as J48 for its implementation in the Weka software java.


 Text Mining 



 WEKA Tutorial 



By Valeria Guevara