Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

(1)

Improving Machine Learning Input for

Automatic Document Classification with Natural

Language Processing

Jan C. Scholtes Tim H.W. van Cann

University of Maastricht, Department of Knowledge Engineering.

P.O. Box 616, 6200 MD Maastricht

Abstract

Given a number of documents, we are interested in automatically classifying documents or document sections into a number of predefined classes as efficiently as possible with as little computational requirements as possible. This is done by using Natural Language Processing (NLP) Techniques in combination with traditional high-dimensional document representation techniques such as a Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) and machine learning techniques such as Support Vector Machines (SVM).

Despite the availability of various statistical feature-selection techniques, the high-dimensionality of the feature spaces causes computational problems, especially in collections containing old-spelling and Optical Character Recognition (OCR) errors which leads to exploding feature spaces. As a result, feature extraction, feature selection, training a supervised machine learning algorithm, or clustering can no longer practically be used because it is too slow and the memory requirements are too large.

We show that by applying a variety of Natural Language Processing (NLP) techniques as pre-processing, it is possible to significantly increase the discrimination between the classes. In this paper, we report f1-measures that are up to 11,3% compared to a baseline performance model which does not use NLP techniques. At the same time, the dimensionality of the feature space is reduced by up to 54%, leading to highly reduced computational requirements and better responds times in building the model of the feature space as well as in the machine learning and classification. Further experiments resulted in vector reductions up to 80%, with results being only 4% worse than the baseline model.

1 Introduction

In this paper a description will be given of a Natural Language Processing (NLP) based approach to the problem of reducing the dimensionality of the feature space when applying feature extraction such as Bag of Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF). BoW and TF-IDF are commonly used for document classification, document clustering and relevance ranking [16]. Despite the fact that the feature space is very sparse (e.g. contains many dimensions holding zero-values), the dimensionality of the feature space is often very high (in the 100,000’s or more). When the text of the documents also contains Optical Character Recognition (OCR) errors or old-spelling variations, the dimensionality can easily double or triple. The high-dimensionality of these feature spaces causes computational problems during the feature extraction, feature selection, machine learning and classification. Although many OCR errors are quit unique in their occurrences (there are many OCR variations, but these occur with very low individual frequencies [32]) and although the TF-IDF algorithm automatically grants very low feature values to rare words, this still causes huge computations delays in the feature extraction process. Especially in the field of humanities where digital-heritage collections are the basis of the research and investigations this is currently causing many problems [10, 18].

(2)

Statistical feature selection methods such as, but not limited to, Principle Component Analysis, Chi-square, Maximum-Likelihood and Maximal Entropy Models reduce the feature space by selecting the best features to increase the inner-class similarity and inter-class difference [4, 6, 8, 17, 27]. Obviously, these algorithms also suffer from the high dimensionality of the initial feature extraction process. An additional problem exists: after feature selection, the feature space is highly optimized to the documents used in this process and as a result, new documents that were not used building the feature selection process will be classified at much lower quality levels [21]. Since feature selection is computationally very expensive, it is also not convenient to repeat the feature selection calculations every time new documents are added to the collection.

During the research, we also found that statistical feature selection methods such as Chi-square did not work as well due of the sparseness of the data and because of the high dependence of the features of each other: after all, language is more than a bag of words. Models that presume statistical independence between features (e.g.: word occurrences in text) or that look for features with the largest statistical independence have to overcome the fact that words to not occur independently in natural language.

Based on our initial experience in [14], where co-reference resolution and synonym normalization led to significant improvement of classification results for sentence classification, we propose a different method where the inner-class similarity and inter-class difference are increased by using a number of NLP Techniques that pre-process the text of a document in such a way that the initial feature extraction and -selection will result in much smaller vectors than with the original text [8, 15]. This will yield highly reduced calculation times, but at the same time, it is also expected that better classification results could be achieved.

2 Overview of the Classification Pipeline

First, a Baseline Performance is created. Next, the impact of a number of different NLP techniques on the quality as well as on the computational complexity will be discussed. In this research, the following techniques are used in the machine learning process:

1. A number of Natural Language Processing techniques, and

2. Two document feature-extraction techniques known as a (i) bag-of-word (BOW) and (ii) Term Frequency- Inverse Document Frequency (TF-IDF), and

3. Basic document feature-selection techniques such as logarithmic normalization and selection of the relevant features by vector cut-off, and

4. A supervised machine-learning algorithm based on Support Vector Machines (SVM) to build binary classifiers for each document category.

We will discuss the setup of these different components of the classification pipeline in more detail in the following paragraphs.

2.1 Natural Language Processing

By using a number of Natural Language Processing techniques, the original text is modified in such a way that the text that is presented to the feature selection process contains words that improve the machine-learning inner-class similarity and inter-class dissimilarity. This goal is reached with the following means:

(3)

(i) increasing the number of relevant words for a class by using named-entity recognition, co-reference and anaphora resolution, and

(ii) consolidating the different textual occurrences of words which are caused by for instance synonyms, abbreviations, spelling variations, spelling errors, or OCR errors.

In order to use these methods as good as possible, Part of Speech (POS) tagging from the NLTK [35] library was used [8, 15, 16, 24, 28]. First, the additional POS information is used for boundary- and conjunction detection in the Named Entities recognition process with the CoreNLP library [36]. Second, the POS tags are used for the co-reference and pronoun resolution to replace co-references and pronouns with their corresponding Named Entity values [12, 26].

Next, abbreviations are resolved by using the official English abbreviations list from the Oxford English dictionary [5] and synonyms are resolved by using the lexical database WordNet [29]. In addition, WordNet, in combination with the POS tag, is used for lemmatization to reduce words to their individual lemma. Hereafter, by using the Jaro-Winkler [11, 31] and Levenshtein [22] distance methods, spelling variations due to prefixes and suffixes [25], international spelling variations, spelling errors, or OCR errors are resolved and all occurrences of such words are normalized to and replaced by one common token in the text. The process is concluded with the removal of stop words for which a predefined list is used [33].

For reasons of clarity, the process is explained on the following sample text (please notice that atcor is a deliberate typo):

In 2003, James Williams decided to become an atcor in Hollywood L.A. He moved there last week to pursue his livelong dream of acting glory and fame. His brother, a celebrated writer, wished him good luck.

After tokenization, the document looks like this:

['in', '2003', 'james', 'williams', 'decided', 'to', 'become', 'an', 'atcor', 'in', 'hollywood', 'l.a.', 'he', 'moved', 'there', 'last', 'week', 'to', 'pursue', 'his', 'livelong', 'dream', 'of', 'acting', 'glory', 'and', 'fame.', 'his', 'brother,', 'a', 'celebrated', 'writer,', 'wished', 'him', 'good', 'luck.']

After applying co-reference and pronoun resolution, lemmatization, word similarity and resolving abbreviations and synonyms, the document has changed to:

['James', 'Williams', 'decide', 'actor', 'Hollywood', 'los angeles', 'James', 'Williams', 'travel', 'week', 'prosecute', 'livelong', 'James', 'Williams', 'dream', 'act', 'glory', 'fame', 'James', 'Williams', 'brother', 'lionize', 'writer', 'wish', 'good', 'fortune']

Which will then be used as the basis for the feature extraction process. As one can see, on the one hand, the number of unique words is much smaller than in the original text, leading to a significant smaller feature vector. On the other hand, the remaining words are all more distinguishing for the specific content of the document; there are less word variations for similar content words and less common and high-frequent words such as stop words, pronouns and co-references.

2.2 Feature extraction

For feature extraction, the process of extracting relevant information from the data to create feature vectors, the Bag of Words (BoW) and the Term Frequency - Inverse Document Frequency (TF-IDF) methods are used [16].

(4)

2.3 Normalization & Basic Feature Selection

In order to reduce the possible large gaps between feature values, normalization can be applied. Each feature is normalized between 0 and 1. Since this normalization can still create very small values, a second normalization is applied by taking the logarithm (base 10).

In order to further decrease the number of features present in the document vectors and to select the best possible features for the machine learning process, a very basic feature selection process is applied by removing the dimensions with a more than average low value. We have called this the vector cut-off approach.

Vector Cut-off is defined as: Cut-off = min + (max - min) * (perc/100))

Where max is the maximum value in the document matrix mxn (the collection of all document vectors where m are the documents and n the features) and min is the minimum value in the document matrix.

Perc needs to be determined empirically. If a feature has no value higher or equal to the Cut-off, the

feature is removed from the vector, otherwise it is kept. We used a very small value for perc: 1 in our case.

2.4 Supervised Machine Learning and Automatic Document Classification with SVM

Supervised machine-learning based on a linear Support Vector Machine (SVM) was used to construct a system that can be trained with tagged data [1, 3, 7]. Because of the sparseness of the data, the linear model was good enough. The usage of the Gaussian kernel and the sigmoid kernel did not lead to better results, only to longer calculation times. LIBSVM was used for the implementation of the experiments [2]. Each training document was tagged with the appropriate classification category. To ensure multi-category classification, for each multi-category a separate binary model was trained which was used to predict with certain probability whether a document was part of the category or not. Single class classification was implemented by taking the maximum value returned. Multi-class classification was used by selecting all classifiers returning a value higher than a certain threshold.

3 Experiments and Results

3.1 Corpus and Evaluation Method

In this research, the fully annotated Reuters RCV1 corpus is used [13, 19, 23]. For the evaluation, we have used the same evaluation method used by the Legal-TREC conferences, which are based on best practice principles for measuring the quality of document classification. We used the F1 score and derived 11-points precision graphs representing the quality of the classifiers [9]. In a so-called 11-points precision graph, for each recall value corresponding to a certain threshold value, the precision can be calculated and both can be plotted as coordinates in a graph.

3.2 Baseline Performance Creation

The following experiments were implemented. First we created a baseline performance by selecting labeled documents from the RCV1 corpus:

1. First training and validation sets are randomly generated per Reuters category from RCV1 by randomly choosing 500 documents in the actual Reuters category (positive instances) and 500 documents outside the Reuters category (negative instances). The test set consists of 10.000 additional documents from RCV1, where 25% are documents from the within the class, and

(5)

75% are outside of the class and randomly selected from the entire RCV1 corpus. We tested several different RCV1 classes related to WAR, CRIME and VIOLENCE. The results for the different classes were about the same.

2. Next, feature extraction was done by using BoW and TF-IDF. Feature selection was done by using vector cut-off and the features were normalized by using logarithmic normalization. 3. For training, a SVM with a linear kernel was used.

The result of the baseline performance can be found in Table 1: F1 Scores for the Baseline Performance.

Table 1: F1 Scores for the Baseline Performance

Feature-Extraction Used F1 Scores Test Set

Bag of Words 0.767

TF-IDF 0.757

In the baseline performance, the dimensionality of the feature vectors was 142,722 after the basic feature extraction. If nfeatures is the dimensionality of the feature vector and nsamples is the number of

training samples, then the time complexity of the libsvm implementation scales between O(nfeatures x n2 samples) and O(nfeatures x n3samples) depending on how efficiently the libsvm cache is used in practice (which

is dataset dependent). If the data is very sparse nfeature should be replaced by the average number of

non-zero features in a sample vector. By reducing the length of the feature vector, the time and space complexity can be significantly reduced.

3.3 The Effects of NLP on the Results

The results of the additional NLP steps are listed in Table 2. In these experiments, we used the same train and test sets as in the baseline performance. Even though the BoW has a slightly higher baseline, BoW in combination with NLP techniques reached a maximum F-score of 0.795, only 0.3 higher than the initial baseline. The results for the TF-IDF in combination with NLP techniques, led to better results: up to 0.820, so we will focus on these results:

Table 2: Results after NLP on TF-IDF Feature Extraction

NLP techniques applied (in chronological sequence

Vector size TF-IDF

% vector reduction compared to baseline performance

F-score TF-IDF

Baseline (tokenization) 142,722 0% 0.757

Anaphora and NER 94,985 33% 0.788

Abbreviations 80,906 43% 0.809

Lemmatizations 66,008 54% 0.820

(6)

Jaro-Winkler 28.140 80% 0.720

Figure 1 shows the 11-points precision-recall for the NLP preprocessing experiments where the results were better than the original baseline. In addition, the best results in the upper right corner are circled, as these are the best results from the experiments resulting in both precision and recall larger than 0.8.

Figure 1: 11 Point precision for NLP Experiments on TF-IDF Feature Extraction

4 Conclusions

In this research we have shown that by applying a variety of Natural Language Processing pre-processing techniques, it is possible to significantly increase the inner-class similarity and the inter-class difference which results in up to 11,3% better machine learning quality at lemmatization. At the same time, the dimensionality of the feature space is reduced by up to 54%, leading to highly reduced computational and memory requirements and better responds times in building the model of the feature space as well as in the machine learning and classification. Further preprocessing with Synonyms and Jaro-Winkler does lead to even more vector size reductions (60% and 80%), but do also lead to lower machine learning quality compared to the lemmatization, although the synonyms are still better than the baseline. One can even say, that the highly reduced vector length of the Jaro-Winkler processing, leads to 80% smaller vectors, but only 4% less machine learning quality. In order to investigate the stability of the results, we also tested the same model for several other classes in the RCV1 corpus with up to 5 varying random sets of train and text sets; the results were similar than the ones presented here.

There is off-course a price to the NLP processing but each technique is done only once per document. High dimensional feature spaces lead to more problems, especially in cases where the machine learning process has to be done several times due to the addition of new documents or improved sets of training documents.

(7)

5 Future Research

A number of improvements can be made to the current setup. First, by using Dependency Grammars, we expect to reach a much better quality of co-reference and pronoun resolution. The currently available information from the POS tagging is not always sufficient to resolve the co-references and pronouns. Recent research shows that applying Dependency Grammars to this problem is very promising [34]. Second, The synonym replacement did not improve the results. We expect this to be caused by the fact that the context of the synonym usage is not taken into consideration. A better context-sensitive synonym replacement could result in maintaining or improving the quality of the machine learning as shown in [14] Finally, the Jaro-Winker algorithm caused problems for words that contained a negating pre-fix such as

democratic and un-democratic. These contradicting words were replaced by the same common token,

which resulted in quality loss. By detecting certain negating prefixes and exclude them from the Jaro-Winkler similarity detection, we also expect the machine learning results to increase.

6 References

[1] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM, 1992.

[2] Chih-Chung Chang and Chih-Jen Lin. Libsvm. http://www.csie.ntu.edu. tw/~cjlin/libsvm/.

[3] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[4] Devijver, P.A. and Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice Hall. [5] The Oxford English Dictionary. List of abbreviations.

http://www.indiana.edu/~letrs/help-services/QuickGuides/oed-abbr.html.

[6] Duda, R.O. and Hart, P.E. (2001). Pattern Classification (2nd Edition). John Wiley and Sons. [7] Kai-Bo Duan and S Sathiya Keerthi. Which is the best multiclass SVM method? An empirical study.

In Multiple Classifier Systems, pages 278–285. Springer, 2005.

[8] Feldman, R., and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing

Unstructured Data. Cambridge University Press.

[9] Maura R. Grossman, Gordon V. Cormack, Bruce Hedin, and Douglas W. Oard. Overview of the TREC 2011 legal track. In TREC, 2011.

[10] Martha van den Hoven, Antal van den Bosch, Kalliopi Zervanou. Beyond Reported History: Strikes That Never Happened. ILK Technical Report Series 10-‐05. August 2010.

[11] Matthew A Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414–420, 1989.

[12] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Deterministic co-reference resolution based on entity-centric, precision-ranked rules. Computational Linguistics, 1–54, 2013.

[13] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004.

[14] Aisan Maghsoodi, Merlijn Sevenster, Jan C. Scholtes and Georgi Nabaltov (2012), Automatic Sentence-based Classification of Free-text Breast Cancer Radiology Reports, 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS 2012).

(8)

[15] Manning, Christopher D. and Schütze, Hinrich, (1999). Foundations of statistical natural language

processing. MIT Press.

[16] Manning, C.D., Raghavan, P. and Schütze, H. Introduction to Information Retrieval Cambridge University Press, 2008.

[17] Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. (Editors), (1986a). Machine Learning, an

Artificial Intelligence Apporach. Volume 1&2. Morgan Kaufmann.

[18] http://www.nwo.nl/catch

[19] Reuters RCV1 Corpus: http://trec.nist.gov/data/reuters/reuters.html [20] Rijsbergen, C.J. van (1979). Information Retrieval. Butterworths, London.

[21] Scholtes, J.C., Cann, T. van, and Mack, M. (2013). The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review. International Conference on Artificial Intelligence in Law 2013, DESI V Workshop. June 14, 2013, Consiglio Nazionale delle Ricerche, Rome, Italy.

[22] LVI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 1966.

[23] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004.

[24] Ann Bies, Constance Cooper, Mark Ferguson, Alyson Littman, Mitchell Marcus, and Ann Taylor. The Penn tree bank project. http://www.cis.upenn.edu/~treebank/.

[25] Martin F Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980.

[26] Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. A multi-pass sieve for co-reference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 492–501. Association for Computational Linguistics, 2010.

[27] Claude Elwood Shannon and Warren Weaver. A mathematical theory of communication, 1948. [28] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich

part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

[29] Princeton university. About WordNet. http://wordnet.princeton.edu.

[30] Stanford university. English stop list. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop.

[31] William E Winkler. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.

[32] Scholtes, J.C. (1995): Artificial Neural Networks in Information Retrieval in a Libraries Context. PROLIB/ANN, EUR 16264 EN, European Commission, DG XIII-E3.

[33] http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop. [34] Anders Bjorkelund and Jonas Kuhn, Phrase Structures and Dependencies for End-to-End

Coreference Resolution. Proceedings of COLING 2012: Posters, pages 145–154, COLING 2012, Mumbai, December 2012.

[35] NLTK: Edward Loper Bird, Steven and Ewan Klein. Natural Language Processing with Python. O’Reilly Media Inc, 2009.