Background

Once understanding text mining may describe the importance of it. This is an important discovery process that can analyze huge amounts of hidden information part. There large amounts of textual research of mining in various fields. One of these fields lies in one of the main concerns of human beings to live with health. The growing importance of global health in international affairs to generate innovative mechanisms. To achieve these objectives, text mining provides an opportunity to discover relevant information.

Medical Text Mining References

Currently, there is a large amount of important research that helps the medicine. Data mining is concerned not only in medicine, also in business, climate, transportation, government, insurance, etc. The mining of data to be applied in a specific industry provides privacy, security and accuracy.

Text mining of cancer-related information: Review of current status and future directions.”

The International Jurnal of Medical Informatics publishes the current status and future improvements in text mining with respect to cancer titled Text mining of cancer-related information: Review of current status and future directions [4] (Spasić, Livsey, Keane & Nenadić, 2014). The purpose of this article is to explain text mining field related to cancer through computational methods useful medical information is generated.

Text mining, natural language processing, and machine learning are techniques to represent structured cancer information. The extensive documentation connected to cancer and related terms is processed for classification by placing each document to a category. Another strategy is the removal of information from determined relationships between specified entities. Similarly, terminology extraction also collects the relevant terms of a specific domain corpus. Likewise in the names of predefined entities they are recognized.

The sources of information are collected from different types of text. Some come from databases such as Medline, IEEE Xplore digital library, and ACM. This information needs to be organized to create useful information. However, these texts are so complex that they require a higher level of analysis and knowledge of specialized terms. For example, there are over 200 different types of cancer symptoms and treatments together. In order to distinguish the authors mention this vocabulary UMLS. The Unified Medical Language System is a system that integrates medical language to find synonyms or lexical information between these words. Also mentions the SNOMED system where you can find general clinical terms.

In order to find an appropriate classifying documents technique, measures with comparative values must be taken. Document classification system is measured based on their results. Automatic predictions for instances provided classes are represented in a confusion matrix. The evaluation of a simple prediction is a set of attributes where the predicted class is true or false. The results of the comparisons are marked to be considered correct.

The processing of the texts used NLP techniques that are noted in this article. Named Entity Recognition identifies words and phrases and classifies them into categories. There is an extensive domain of biomedical terms where a particular entity varies in many ways. Notes that approximately one-third of occurrences for each term are variant thereof. The analysis of these texts will depend on the ability to recognize each variation of the word and turn to the same entity. Some of these variations are acronyms for a domain specific dictionaries as UMLS general does not store this kind of overly specialized jargon. Likewise, RxNorm provides clinical drug names and common names. Similarly, the UMLS SPECIALIST tool has synonymous.

The Information Extraction in the clinical setting specifies the relationship of interest. The vocabulary of interest is extracted itself should be linked one relationship to another entity extracted. These relationships become structured forms where names of entities, relationships and denials follows. Regular expressions are used to model such structures. Predictions must correctly determine the meaning of the text. A classification may depend on the frequency of an attribute previously considered that this attribute is not listed in a denial. Removing denials is relevant in this art. MedTAS / P is a system that deals with the problem of denials. Such a system develops a pre-linguistic process that includes the separation of terms, phrases discovery, part of speech, shallow parsing to fit in the medical context.

The text classification to find another relevant information from text adds meaning to it. This technique uses both functions as Named Entity Recognition Information Extraction to predefine a correct scheme. There are different classification techniques, inductive logic programming ILP is used for the automatic construction of a classification model. Inductive logic programming creates a logical relationship based on status if-then rules model. Additionally, a technique that extracts elements and combines them to have a global decision is Support Vector Machine. This article mentions an experiment with different types of machine learning to classify cancer found in reports. The study cast by Martinez et finds that the best results were found using Bayesian methods. Concluding that for a wide dimension text SVM has higher performance than the BNI Bayesian model. Other important aspects are keywords, stemmed tokens and disposal of empty words is added.

Information retrieval classified documents relating to relevant or irrelevant search. The article shows a study to obtain information relating to specific cases. This study expresses how you should search and how much detail should be used to find the desired information.

Paper Importance: 

  • Review of current status and future directions of the text mining field related to cancer.
  • It points out some sources of information are collected from different types of text.
  • Provide techniques of measures through a confusion matrix such as F-measurement.
  • Describe NLP techniques, Named Entity Recognition, Information Extraction, Text Classification (different classification techniques) and Information Retrieval from classified documents.

 

A Lexical Approach for Text Categorization of Medical Documents”.

The authors proposes a lexical approach of text categorisation in the medical domain. (Jindala, & Tanejab, 2015). [42]

 The algorithm proposed in this article will reduce size of the documents. Additionally this algorithm helps in the categorization of documents based on identify the tokens or lexemes to represent medical documents.  MESH medical subject headings is standard list of keywords used for this article proposes.

The KNN approach considers the tokens as the major source of information.  Each document is expressed as a vector of tokens and its respective weights. This approach calculate the distance of each K neighbor in the designed class. At that point, it computes the predicted class. Stop words and special characters are removed.  Lexical analyzer scan the characters from the data and group them into a set of tokens (key words or synonymous).  Finally, it calculates a function to compute the value of an article and its neighbor using token weight approaches.

Paper Importance: 

  • Bring forward a lexical approach of text categorisation in the medical domain.
  • Exemplify the KNN algorithm and compute the value of an article and its neighbor using token weight approaches.
  • Propose MESH (medical subject headings) as standard list of keywords used for this article proposes.

 

“Learning regular expressions for clinical text classification.”

The main goal for this article is to automate the creation of regular expressions and its utilization in text classification. (Bui & Zeng-Treitler, 2014). [43]

Paper Importance: 

  • Introduces regular expressions materials and methods. Describes the RED algorithm as a sequence of characters with semantic information for text classification.
  • This algorithm generates regular expressions applied in different clinical text classification.

 

Spanish Text Classification References

Computer aided classification of diagnostic terms in Spanish”.

Pérez, Gojenola, Casillas, Oronoz and Díaz [44] classify medical records by their diagnostic terms to perform a large scala text classification problem (Pérez, Gojenola, Casillas, Oronoz, & Díaz de Ilarraza, 2015).  Explore for computer aided approaches to classify Spanish medical records into its diagnostic term. This project faces differences challenges such as the processing of natural language and the need for efficient methods for large scale classification.

 Paper Importance: 

  • Introduces the diagnostic terms   according to the International Classification of Diseases Clinical Modification (ICD-9-CM). Search for computer aided approaches to classify Spanish medical records. Exemplify standard text categorization techniques using unigrams and points the use of ngrams for medical classification.
  • Discuses Levenshtein distance to provide lexical transformations and the approximation to spelling errors. Illustrates lexical seeds as the set to define acceptable diagnostic terms and lexical transformations to convert diagnostic terms from lower to upper case, management of singulars and plurals, gender, abbreviations, synonyms, acronyms, etc.
  • Practices operations for Spanish language in its clinical domain.
  • Provides lexical resources such as ICD-9-CM list and the MRs in Spanish (manually generated corpus), Documentation Service of the GUH (synonyms and misspells), SNOMED CT: Systematized Nomenclature of Medicine-Clinical Terms.

 

A syntactic approach for opinion mining on Spanish reviews.”

The paper proposes a technique for natural language processing (NLP). [45] (Vilares, Alonso,   & Gómez, 2015). This technique takes the preprocessing of data, tokenization, etc. it is made to get a syntactic structure of sentences to be the meaning of the analyzer unit.

Opinion mining (OM) or also known as sentiment analysis is focused on the automatic processing of subjective information texts. It shows a prototype analyzer feelings in Spanish. This language uses separators dependencies that define the polarity of the text.

The classification of polarity is made from different perspectives. The approaches used are supervised machine learning (ML) and unsupervised semantic-based methods. ML involves solutions such as bag of words or related linguistic process. Semantic methods involve dictionary words that are related based on their semantic orientations (SO). Machine learning approach manage independent domains although performance varies between different domains. Hybrids is different approach that use natural language processing (NLP) for automatic approximation of semantic method.

The opinion mining are based on lexicon-based or ML-based. These solutions can not interpret the syntactic structure of the text and therefore do not take into count the relationship of words. This article proposes a dependency-based parsing. This method is also called Parsing based determine the semantics of the texts in Spanish.

Semantic sentence structures are preprocessing presenting a special cases such as Unification of compound expressions and Normalization of punctuation marks. The second step is to separate words and sentences at a generic level. This tokenizer works with abbreviations and punctuation marks. The third step considered accents are detected to distinguish its meaning. At the end dependencies will be analyzed based on Nivre arc-eager algorithm. Titled algorithm generates a dependency tree for each sentence. Finally SODictionaries V1.11Spa (Brooke et al., 2009) considers dictionary sentiment analysis SOCALculator Spanish and English dictionaries.

Semantic orientations of each word are calculated taking into account common nouns, adjectives, adverbs, and verbs found in the dictionary. The classification treatment is a semantic organization for words that establish a shifter. Exclamation marks are also considered to mark the ideas. Adversative treatment of subordinate clauses can increase or decrease interpreted feeling. Where it completely ignores the sentiment expressed in the main clause. For the calculation algorithm of semantic organizations they consider both clauses SO.

The treatment of denials is commonly found in Spanish. It is expressed in a sequence of words or double negatives in the same sentence. As a first step this treatment depends on the verb used in the sentence. In this first stage the extension of a denial searching through tokens identified. The next step is to take into account the change of polarity in the text. This is accomplished by unifying the word of denial and denial in a single chain.

Dictionaries of common words are inadequate in several specific areas. These dictionaries must be specific to the context required. These dictionaries besides being robust in semantics must recognize the polarity of the subjective words in context. The article mentioned search methods contained in Weka software. The evaluate InfoGainAttributeEval evaluates attributes in relation to the relationship between information and class.

Paper Importance:  

  • The article describes techniques for natural language processing.
  • Describes the data preprocessing, tokenization, etc.
  • Describes mining opinion (OM).
  • Shows a prototype analyzer feelings in Spanish.
  • Exemplifies the polarity classification based on different approaches: supervised machine learning (ML) and unsupervised semantic-based methods.
  • ML describes solutions involving solutions such as bag of words or related linguistic process.
  • Semantic methods involve dictionary where words are related based on their semantic orientations (SO).
  • Describes basad in opinion mining lexicon-based or ML-based.
  • Under these terms describes the parsing semantics based to determine orientation of the texts in Spanish.
  • Preprocessing mentioned in special cases like Unification of compound expressions and Normalization of punctuation marks.
  • Indicates the separate words and sentences at a generic level. This tokenizer works with abbreviations and punctuation marks.
  • Considered as accents is detected to distinguish its meaning.
  • The dictionary mentions the dictionaries SODictionaries V1.11Spa and SOCALculator English dictionaries who regard sentiment analysis.
  • Defined as the semantic orientations of each word are calculated taking into account common nouns, adjectives, adverbs, and verbs found in the dictionary. This calculation also considered punctuation, with the sentiment expressed in the subordinate clause. And the treatment of denials that are commonly found in Spanish.
  • Exemplifies search methods contained in Weka software.
  • Propose machine learning techniques to obtain better precision in the semantic approaches due to the invalidity of generic semantic orientations.

 

Minería de texto para la categorización automática de documentos.”

This article describes the implementation of a semantic search engine. [46] (Pérez, Alicia, & Cardoso, 2010). This search engine implement metadata obtained automatically from the UIMA architecture. Through learning algorithms, the search can automatically categorize documents.

UMA uses components that contain the analysis logic implemented called scorers. Each scorer can be grouped with other related scorers. The techniques for recognizing entities uses matching regular expressions, dictionaries and templates. All analyzers are structured under a standard in this case XML.

Cardoso and Perez define text categorization as label natural language texts with a thematic categories from a defined set. Text classification consists of three phases: data preprocessing, classifier construction and categorizing new documents.

The first stage is the data preprocessor which includes various techniques such as stemming that takes a word to its root. Weka implements the algorithm Snowball (snowball.tartarus.org). The second technique cited is the removal of empty words (stopwords) that are irrelevant words, articles, prepositions, conjunctions, including free domain dependent. The third technique is attribute selection that reduce the number of attributes by eliminating the irrelevant. The last technique is to assign different or the same weight to attributes that represents their importance. The result will be a proper representation of the compacted collection to apply learning algorithms. This is done by WEKA StringToWordVector filter.

The second stage is the construction of the classifier from values obtained in the previous step. This process used semi-supervised algorithms because they learn from examples classified and unclassified. As an example they mention the SMO, Naive Bayes and Co-training. In the last stage the quality of the classification of each model to a real corpus is evaluated.

Paper Importance: 

  • The article describes the implementation of a semantic search engine through learning algorithms can automatically categorize documents.
  • Classification divided into three phases: preprocessing of data, classifier construction and categorizing new documents.
  • Mention the different techniques for the preprocessor of data stemming (stemming), removing empty words (stopwords), selection of attributes and different assignment or the same weight to the attributes to assess their importance.
  • All this is done using the filter in WEKA.
  • To build the classifier defines the semi-supervised algorithms.
  • As an example mention the SMO, Naive Bayes and Co-training. In the last stage the quality of the classification of each model to a real corpus is evaluated.

In this project, the use of Weka tool for document classification will be described. For this concept of medicine was used on two corpora: English, and Spanish.

 Introduction 

 Text Mining 

Background

      WEKA     

 WEKA Tutorial 

 Conclusion

References


By Valeria Guevara