Nihil Obstat: Language Identification as Text Classification with WEKA

Language Identification, consisting on guessing the natural language in which a text is written (or an utterance is spoken), is not one of the hardest problems in Natural Language Processing, and in consequence, I believe it is a good starting point for learning about the text analysis capabilities available in WEKA.

This is in fact one problem taken by others like in this tutorial on using LingPipe for Language Identification, or by Alejandro Nolla at his post on Detecting Text Language With Python and NLTK. Moreover you can find a wide number of language identification programs, APIs and demos in the Wikipedia article on Language Identification. We may even consider this function as a natural language commodity, as you can see how Google Translate does it on default in the next figure:

The most typical (and rather simple) approach to Language Identification is storing a list of the most frequent character 3-grams in each language and checking the target overlap with each of the lists. Alternatively, you can use stop words lists. Of course, the accuracy depends on how you compute the overlap, but even simple distances can make it rather effective.

However, I will not follow this approach here. Instead, I will show how to build an standard text classifier using WEKA in order to show the options (and how to apply) the StringToWordVector filter, which is the main tool for text analysis in WEKA.

The steps we have to follow are the next ones:

To collect data from different languages in order to build a basic dataset.
To prepare the data for learning, which involves transforming it by using the StringToWordVector filter.
To analyze the resulting dataset, and hopefully, to improve it by using attribute selection.
To test over an independent test collection, which will give us a robust estimation of the accuracy of the approaches on real examples.
To learn the most accurate model as obtained from the previous step, and to use it for our classification program.

So this will be a rather long post. Be prepared for it.

Collecting the data and Creating the Datasets

Following the LingPipe Language ID Tutorial, I collect the data from the Leipzig Corpora Home Page. In particular, I will address guessing among English (EN), French (FR) and Spanish (SP), so I have gone to the download page, completed the CAPTCHA to get the list of available corpora, and downloaded:

The 2005 English 10k corpus of news in text format.
The 2009 French 10k corpus of news in text format.
The 2001-2002 Spanish 10k corpus of news in text format -- which is no longer there as far as I can see.

For your comfort, I have put these corpora in my LangID GITHub demo page. The files have the following format:

1 I didn't know it was police housing," officers quoted Tsuchida as saying. 2 You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off." 3 He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.

So I have loaded them into an OpenOffice spreadsheet, and replaced the number columns by the corresponding tags for the different languages: EN, FR, and SP. Then I have escaped the " and ' characters, because they are string delimiters in WEKA Attribute-Relation File Format (ARFF). In order to build the datasets, I have split the data keeping the first 9K sentences of each language for training, and the remaining 1K for testing. As some learning algorithms may be sensitive to the instance order, I have mixed the instances in batches of 1K texts, so the first 1K sentences are in English, the next 1K sentences are in French, and so on. The training data has the following header:

@relation langid_train @attribute language_class {EN,FR,SP} @attribute text String @data EN,'I didn\'t know it was police housing,\" officers quoted Tsuchida as saying.' EN,'You would be a great client for Southern Indiana Homeownership\'s credit counseling but you are saying to yourself \"Oh, we can pay that off.\"' EN,'He believes the 21st century will be the \"century of biology\" just as the 20th century was the century of IT.' ../..

The ARFF files for training and testing are available at the GITHub repository for the demo as well. You can open the training file (langid.collection.train.arff) in the WEKA Explorer, and setting the class to be the first attribute, you should be getting something like the following figure:

So we have a training collection with 9K instances per class (language), and a test collection with 1K instances per class.

Data Transformation

As in previous posts about text classification with WEKA, we need to transform the text strings into term vector to enable learning. This is done by applying the StringToWordVector filter, that is the most remarkable text mining function in WEKA. In previous posts, I have applied this filter with default options, but it offers a wide range of possibilities that can be seen when opening it in the WEKA Explorer. If you click on the Filter button and browse the tree to "weka > filters > unsupervised > attribute > StringToWordVector", and then click on the filter name, you get the next window:

Those are a lot of options, aren't them? So let us focus on the minimum set of options in order to be productive with this example of Language Identification. Those are:

doNoOperateOnPerClassBasis - we set this option to True in order to make the filter collect word tokens over the classes as a whole. This should be the standard setting in nearly all text classification problems.
lowerCaseTokens - we set this option to True because we are interested on the words independently of using upper or lower case. In other problems, like e.g. when processing Social Networks text, keeping the capitalization may be critical for getting a good accuracy.
tokenizer - WEKA provides several tokenizers, intended to break the original texts into tokes according to a number of rules. The most simple tokenizer is the weka.core.tokenizers.WordTokenizer, which splits the string into tokens by using a list of separators that can be set by clicking on the tokenizer name. It is a nice idea to give a look at the texts we have before setting up the list of separating characters. In our case, we have several languages and the default punctuation symbols may not fit our problem -- we need to add opening question and exclamation marks, apart from other symbols from HTML format like &, and other symbols. So our delimiters string will be " \r\n\t.,;:\"\'()?!-¿¡+*&#$%\\/=<>[]_`@" (backslash is escaped).
wordsToKeep - we set this option to keep as much words as we can, to include the full vocabulary of the dataset. An appropriate value may be one million.

So we leave the rest of options on default. Most notably, we are not using sophisticated weighting schemas (like TF or TF.IDF), nor stop words or stemming. These options are very frequent in Information Retrieval systems like Apache Lucene/SOLR, and they often lead to nice accuracy improvements in search systems.

We need to have the same vocabulary both in the training and the testing datasets, so we can apply this filter in the command line by using the batch (-b) option:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`@\"" -W 10000000 -b -i langid.collection.train.arff -o langid.collection.train.vector.arff -r langid.collection.test.arff -s langid.collection.test.vector.arff

The options -O, -L, -tokenizer and -W correspond to the options above. The delimiter string is escaped because it is included in the specification of the tokenizer. The resulting files are also in the GITHub repository for the LangID example, along with the script stwv.sh (String To Word Vector) which includes this command.

Data Analysis and Improvement

If we take a quick look to the terms or tokens we have got, e.g.:

@attribute archival numeric @attribute archivarlos numeric @attribute archivas numeric @attribute archives numeric @attribute archiving numeric @attribute archivo numeric @attribute archivos numeric

We can imagine that most of them will be useless for Language Identification. This motivates making a more precise analysis of the tokens by using some kind of quality metric, like Information Gain. In fact, I am applying the weka.filters.supervised.attribute.AttributeSelection filter as I did in my posts on selecting attributes by chaining filters and on command line functions for text mining. So I issue the following command:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -b -i langid.collection.train.vector.arff -o langid.collection.train.vector.ig0.arff -r langid.collection.test.vector.arff -s langid.collection.test.vector.ig0.arff

We apply the filter in batch mode as well, in order to get the same attributes both in the training and in the test collections. We also set up the first attribute as the class (with the option -c), and set the threshold for keeping attributes as 0.0 in the weka.attributeSelection.Ranker search method. This means that we will keep only those attributes with Information Gain score over 0, and they will be sorted according to their score as well. This command is included in the asig.sh (Attribute Selection by Information Gain) script of the GITHub repository for the LangID example, along with the data files.

From the original 65,429 word attributes we got in the previous step, we have kept only 16,840 (a 25.73% of the original ones). We can be more aggressive by setting the threshold to a bigger value (e.g. 0.2).

The first twenty attributes are the next ones:

As we can see, all of them are very frequent words (in each language) that would be present in the stop lists for them. In consequence, our "pure" data mining approach is quite close to the traditional one based on stop words.

It makes sense to learn a J48 tree to get an idea of the complexity of the term relations. The weka.classifiers.trees.J48 algorithm implements the Quinlan's popular C4.5 learner, and as it outputs a decision tree, it can give us valuable insights of the term relations, like e.g. which co-occurring terms are more predictive. If we train that classifier on our new training dataset with the following command:

$> java weka.classifiers.trees.J48 -t langid.collection.train.vector.ig0.arff -no-cv

However, we get a quite complex decision tree populated with 273 nodes and 137 leaves. All the tests in the tree have the following look: "word > 0" or "word <= 0". This means that the algorithm induces that only the occurrence of words is important, but not its weight. The root of the tree is obviously a test on "the", and the smallest side of the tree (its right hand side, with "the > 0") is the following one:

the > 0 | de <= 0: EN (5945.0/8.0) | de > 0 | | el <= 0 | | | and <= 0 | | | | for <= 0 | | | | | to <= 0: FR (24.0/3.0) | | | | | to > 0: EN (2.0) | | | | for > 0: EN (3.0) | | | and > 0: EN (7.0) | | el > 0: SP (3.0)

This means, for instance, that the word "the" is an excellent predictive feature, and if it occurs in a text and the word "de" (from French or Spanish) does not occur in the text, that text is most likely written in English (with an estimated likelihood of 99.86% on the training collection). The overall accuracy of J48 over the training collection is 98.3963%.

Training and then Evaluating on the Test Collection

Before start training and evaluating, we have to decide which algorithms are most appropriate for the problem. In my experience with text learning, it is wise to test at least the following ones:

The Naive Bayes probabilistic approach, quick and with good results in text learning on average problems. In WEKA, It is incarnated in the weka.classifiers.bayes.NaiveBayes class.
The rule learner PART, which induces a list of rules by learning partial decision trees. It is a symbolic algorithm that produces rules which can be very valuable as they are easy to understand. This algorithm is implemented by the weka.classifiers.rules.PART class.
Of course, the J48 algorithm because of its visualization capabilities.
The lazy learner k-Nearest Neighbors (kNN), which occasionally gives excellent results in text classification problems. The WEKA class that implements this algorithm is weka.classifiers.lazy.IBk.
The Support Vector Machines algorithm, which it is probably the most effective on text classification problems because of its ability to focus on the most relevant examples in order to separate the classes. It is a very good learning algorithm for sparse datasets, and it is implemented in WEKA via the weka.classifiers.functions.SMO class or by the library LibSVM. I choose the Sequential Minimum Optimization implementation (SMO) embedded in WEKA.

Also, when Naive Bayes or J48 are effective, I usually get from small to even big accuracy improvements by using boosting, implemented by the weka.classifiers.meta.AdaBoostM1 class in WEKA. Boosting takes as input a weak classifier, and build a classifier committee by iteratively training that weak learner on those dataset subsets on which the previous learners are not effective. In this case, I will not apply boosting because the weak learners get rather high levels of accuracy, and it is most likely that boosting will only achieve a marginal improvement (if any) at the cost of a much bigger training time.

I have written an script named test.sh to execute all these algorithms with default options at the GITHub repository for the LangID demo. The results obtained by the algorithms are included in the repository as well, and summarized in the next table:

The different versions of the lazy algorithm kNN tested here appear to be very weak. It is likely we can improve its performance by changing the way the distance among examples is computed (from the Euclidean distance to a more appropriate one for text, that would be the cosine similarity), but their performance is so low that they will not score better than the rest of the algorithms.

The top algorithms in this test are Naive Bayes and Support Vector Machines. There is a trade off between both algorithms: SVMs are more effective (in fact, they are very effective) but they employ quite a lot of time to be trained, while Naive Bayes is less effective but quicker to be trained. In terms of classification time, both algorithms are linear on the number of attributes.

Even we have used a big number of attributes, there are some examples with rather weak representations. For instance, let us check the following instances or texts:

{58 1,94 1,313 1,1663 1} {119 1,361 1,2644 1,16840 FR} {2 1,16840 SP}

The first and second examples have only 3 occurring words (the class value for the first text is EN in the sparse format it is used by WEKA in this example), and the third example has only one word ("el"). The two first examples attribute numbers (58 or over) mean that the attributes are not the most informative ones, while in the third example we find a very informative word. If we apply a more aggressive selection using Information Gain, we will be missing a lot of examples (with null representations) in this example, thus making them fall to the most likely class. As the classes have a balanced distribution, the language chosen in that case will be EN, which is the default value for the class attribute.

Learning the Best Classifier and Using it Programmatically

So after our experiments, we know the best classifier in our tests is SVMs. So it is time to learn it and store the classifier into a file for further programmatic use. For this purpose, I have written an script that trains the classifier and stores the model into a file, using the following command-line call:

$> java weka.classifiers.meta.FilteredClassifier -t langid.collection.train.arff -c first -no-cv -d smo.model.dat -v -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\r\\\\\\\n\\\\\\\t.,;:\\\\\\\\\\\\\\\"'()?!-¿¡+*&#$%/=<>[]_`@\\\\\\\"\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.functions.SMO

This call is rather painful because of the nested, and nested, and nested, and nested quotes. So I have pretty-printed it in the script learn.sh script at the GitHub repository for the LangID example. For dealing with nested quotes, follow the advice in the Wikipedia article about nested quotation.

With this call, we have stored a model in the file smo.model.dat, which chains the StringToWordVector filter, the AttributeSelection filter, and an SMO classifier by using the weka.classifiers.meta.FilteredClassifier and the weka.filters.MultiFilter classes, as I have explained in the post on Command Line Functions for Text Mining in WEKA.

One good point of WEKA is that we can learn a model in the command line and use it in a program. I have modified the MyFilteredClassifier.java program I used in my post describing A Simple Text Classifier in Java with WEKA, and I have committed it at the GITHub repository with the name LanguageIdentifier.java. I have created three sample test files as well, test_en.txt, test_fr.txt and test_sp.txt. The operation of the program is the following one:

$> javac LanguageIdentifier.java $> java LanguageIdentifier Usage: java LanguageIdentifier <fileData> <fileModel> $> java LanguageIdentifier test_en.txt smo.model.dat ===== Loaded text data: test_en.txt ===== This is a sample test for the language identifier demo. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' This is a sample test for the language identifier demo.' ===== Classified instance ===== Class predicted: EN $> java LanguageIdentifier test_fr.txt smo.model.dat ===== Loaded text data: test_fr.txt ===== Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.' ===== Classified instance ===== Class predicted: FR $> java LanguageIdentifier test_sp.txt smo.model.dat ===== Loaded text data: test_sp.txt ===== Esto es un texto de prueba para la demostración del identificador de idioma. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' Esto es un texto de prueba para la demostración del identificador de idioma.' ===== Classified instance ===== Class predicted: SP

So the program is correct on the three examples. Remember that you have to learn the model before using the program. As a side note, as the program only uses a FilteredClassifier object, you can change the script to accommodate a different algorithm. For instance, you can just change the text "weka.classifiers.functions.SMO" by "weka.classifiers.bayes.NaiveBayes" in the learn.sh script, and the program will be working the same way -- but with a different model.

Concluding Remarks

While being relatively simple, the Language Identification problem helps to identify the essential tasks we have to perform when building text classifiers with WEKA. It is a complete example in the sense that we have not only collected the dataset and learnt on it, but we have also dig a bit into the most suitable representation by playing with attribute selection and tentative classifier to visualize the data. It also demonstrates some basic configurations of the StringToWordVector filter, which is the most remarkable tool in WEKA for text mining.

If you have had the time to read all this post, and even tried the program: thank you! I hope it has been a valuable time investment. I am tempted to suggest you to modify the dataset to include more languages, as the problem I have addressed is relatively simple -- only three and quite different languages.

Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!

7 comentarios:

j1c1m1b1 dijo...: Excelente entrada de blog, bastante útil. Muchas gracias :); 5:45 p. m.
kashif dijo...: Hi Jose, I appreciate your work for providing expert opinion about problems beginners in weka and i learned alot from this blog..

I want to ask a question about StringToWordVector that why it requires us to fill the number of words to keep in vocabulary , should'nt it do it by itself depending on how many words founded in vocabulary?

Secondly if we set to 100000 and we have only 10,000 features would it still search for 1 million in dataset ?

Thanx in advance for help; 3:46 p. m.
Unknown dijo...: Thanks for the detailed explanation. Is there a way to use an identifier on the txt documents? I am in need of this for model evaluation. Thanks.; 12:55 p. m.
Jose Maria Gomez Hidalgo dijo...: Dear Laritza

Thanks for reading the post. I am afraid I do not understand you very well. Please, can you be more specific about what you need?

Regards and thanks again,

- JM; 2:18 p. m.
Anónimo dijo...: Images broken.; 7:24 p. m.
Anónimo dijo...: Estou tentando executar o learn.sh, mas apresenta o seguinte erro: bash: !-¿¡+*: event not found
Já tentei varias maneiras de resolver o problema, mas não conseguir criar o "smo.model.dat". teria como você disponibilizar?; 3:32 a. m.
ivanr dijo...: Este comentario ha sido eliminado por el autor.; 10:04 p. m.

Publicar un comentario

20.5.13

Language Identification as Text Classification with WEKA

7 comentarios: