26.4.13

URL Text Classification with WEKA, Part 1: Data Analysis

I have recently came across a website named SquidBlackList.org, which features a number or URL lists for safe web browsing using the open source proxy Squid. In particular, it features a quite big porn domains list, so I wondered: Is it possible to make a text classification system with WEKA to detect porn domains using the text in the URLs?

Just to note that SquidBlackList on porn (and most of the rest of the lists they provide is licensed under Creative Commons Attribution 3.0 Unported License: Blacklists (Squidblacklist.org) / CC BY 3.0

The Filtering Problem

Most web filtering systems work by using a manually classified list of URLs into a list of categories that are used to define filtering profiles (e.g. block porn but allow press). The URL lists or database must be manually maintained, and it has to be quite comprehensive regarding user browsing behaviour. As (aggregated) web browsing follows a Zipfian distribution (that is, relatively few URLs accumulate most of the traffic), you can provide a rather effective service by ensuring that your URL database covers the most popular URLs. URL-based filtering is rather efficient (if your database is well implemented), and it can easily cover around 95% of the web traffic (in terms of #requests, not in terms or #URLs).

However, covering the remaining 5% requires performing some kind of analysis. My target here is dynamically classifying that 5% of web requests (which may account for millions of URLs or even just domains) into two classes: notporn and porn. This way, we can cover the 100% of the traffic, and it is likely that we concentrate our classification mistakes (that may be possible at the URL database as well) only into that small 5% - so our filter can be 98% effective or more.

Why analyzing the URL text? For a matter of efficiency - you do not have to go to the Internet and get the actual Web content in order to analyze it, so all the processing is local to the proxy and you eventually avoid performing unnecessary Web requests at the proxy itself.

Collecting the Dataset

So we start with an 880k porn domains list, but although it is possible to learn only from positive examples, we may expect better effectiveness if we collect negative examples (not porn domains). A handy resource is the Top 1M Sites list by Alexa, a Web research company that provides this ranked list in a daily basis. Having 1M negative examples and 880k positive examples makes a good class balance and quite populated dataset -- nice for learning, specially when its instances are relatively short text sequences (e.g. google.com vs. porn.com).

First we have to make both lists comparable. The format of the Alexa list is <rank>,<domain>, while the format of the Squid black list is <dot><domain> (in order to match the Squid URL list format). A couple of cut and sed commands will do the trick.

Then we can just add the class and mix the lists.

Cleaning the Dataset, first step

But... Hey, Internet is for porn! -- we should expect that some of the URLs in the Alexa ranking are pornographic. In fact, a simple search demonstrate it:

$ grep porn alexa.csv | more
pornhub.com
youporn.com
...
$ grep porn alexa.csv | wc -l
5719

We can just substract the porn list from the Alexa list with a handy grep:

grep -f porn.csv -v alexa.csv > alexaclean.csv

But it takes a loooooong time, so I prefer to sort Alexa list, transforming it to Linux format (as the original one has DOS format), and use comm:

$ sort alexa.csv > alexasorted.csv
$ fromdos alexasorted.csv
$ comm -23 alexasorted.csv porn.csv > alexaclean.csv
$ wc -l alexaclean.csv
975088 alexaclean.csv

Good point, only 25k URLs are pornographic... Well, lets check:

$ grep porn alexaclean.csv | head
001porno.com
0dayporn.org
1000porno.net
...

So we still have some porn in there.

Cleaning the Dataset, second step

Cleaning Alexa list from porn is a bit more complex. How to find those popular porn sites, if they are not even in such a comprehensive list as the Squidblacklist one? Another resource comes to help, and it is the sex-related search engine PornMD. This engine has recently published a list of popular porn searches in the form of a dynamic infography named Global Internet Porn Habits:

So, if you collect a list of the top searches in five of the biggest speaking countries, you get:

Cleaning the list from duplicated words, adding "porn", "sex" and "xxx" (rule of thumb), and computing the number of domains they occur in the Alexa (cleaned) and the Squidblacklist lists, we get:

Looking at the list, a relatively safe proportion between the number of occurrences in Squid's versus Alexa's (clean) list is 9 -- this way, we keep most obvious words and remove the most ambiguous ones (although there are some borderline examples, as "asian"). We can see the effects:

$ grep "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv | wc -l
17389
$ grep "porn\|sex\|xxx" alexaclean.csv | wc -l
12342
$ grep -v "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv > alexacleanfinal.csv
$ wc -l alexacleanfinal.csv
964735 alexacleanfinal.csv

You can see that just "porn", "sex" and "xxx" account for 70,97% of domains, so there is some domain knowledge in the process. I must note I may use another, much more extensive list of porn-related searches like the one featured PornMD Most Popular page.

Additional Analysis

To get a feeling of how the previous porn-related keywords are distributed across the original Alexa ranking, I have computed the number of lines (domains) they occur in 100k intervals, to get the following chart:

Where #query1 represents the number of occurrences of "porn\|sex\|xxx" and #query2 represents the full list of keywords. The growth is nearly linear with an average of 1234.2 URLs per interval in #query1, and 1738.9 URLs per interval in #query2. The curves are smooth, and there are more domains in the first intervals (e.g. 1482 hits in the first 100k Alexa URLs for #query1) than in the latest ones (e.g. 1077 hits in the last 100k Alexa URLs for #query1).

There are other dataset statistics that may provide better insights regarding the classification problem, or in other words, that may be more informative or predictive in terms of classification accuracy. For instance:

  • What is the length of an average domain name in each category?
  • How many points and/or dashes do domain have in average per category?
  • Which is the distribution of different TLDs (Top Level Domains) across both categories?

Can you imagine any other interesting statistics?

The Dataset

Once we have got the original Squidblacklist and the Alexa cleaned one (after substraction and removing the keyword hitting lines), we add some format to get a WEKA ARFF file. For instance, 0000free.com must be transformed into '0000free.com',safe. A bit of sed trickery does the job, and then we mix the lists with the following command:

$ paste -d '\n' alexacleanfinal.csv porn.csv > urllist.csv

The rationale behind mixing the lists is that some learning algorithms are dependent on the order of examples, and for those algorithms it is clever not to expose first all the examples of one class, the other class' ones. As the paste command adds new lines when one of the lists finish, we have to remove double CRs (\n\n) with another sed call, and we finally add the ARFF header to get a file starting the following way:

@relation URLs

@attribute urltext String
@attribute class {safe,porn}

@data
'0000free.com',safe
'0000000000000000000sex.com',porn
'0000.jp',safe
'000000000gratisporno.ontheweb.nl',porn
...

I have left that file named urllist.arff in my GitHub folder for your convenience, so you can start playing with it. Beware, it is over 40Mb.

So that is all for the moment. Stay tuned for my next steps if you liked this post.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topic!

8.4.13

A Simple Text Classifier in Java with WEKA

In previous posts [1, 2, 3], I have shown how to make use of the WEKA classes FilteredClassifier and MultiFilter in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the Explorer GUI provided by WEKA, and its command-line interface.

In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.

Following Salton, and Belkin and Croft, the process of text classification involves two main steps:

  • Representing your text database in order to enable learning, and to train a classifier on it.
  • Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.

In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model to classify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

  1. Loading the text dataset.
  2. Evaluating the classifier.
  3. Training the classifier.
  4. Storing the classifier.

The most interesting parts of the process are:

  • We read the dataset by simply using the method getData() of an ArffReader object that wraps a BufferedReader.
  • We programmatically create the classifier by combining a StringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifier class discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

trainData.setClassIndex(0);
filter = new StringToWordVector();
filter.setAttributeIndices("last");
classifier = new FilteredClassifier();
classifier.setFilter(filter);
classifier.setClassifier(new NaiveBayes());

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a new NaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).

You can execute this class with the following commands to get the following output:

$>javac MyFilteredLearner.java
$>java MyFilteredLearner smsspam.small.arff myClassifier.dat
===== Loaded dataset: smsspam.small.arff =====

Correctly Classified Instances 187 93.5 %
Incorrectly Classified Instances 13 6.5 %
Kappa statistic 0.7277
Mean absolute error 0.0721
Root mean squared error 0.2568
Relative absolute error 25.8792 %
Root relative squared error 69.1763 %
Coverage of cases (0.95 level) 94 %
Mean rel. region size (0.95 level) 51.75 %
Total Number of Instances 200

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam
0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham
Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965
===== Evaluating on filtered (training) dataset done =====
===== Training on filtered (training) dataset done =====
===== Saved model: myClassifier.dat =====

The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));
System.out.println(eval.toSummaryString());

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.

Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

  1. Reading the text to be classified from a file.
  2. Reading the model or classifier from a file.
  3. Creating the instance.
  4. Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

// Create the attributes, class and text
FastVector fvNominalVal = new FastVector(2);
fvNominalVal.addElement("spam");
fvNominalVal.addElement("ham");
Attribute attribute1 = new Attribute("class", fvNominalVal);
Attribute attribute2 = new Attribute("text",(FastVector) null);
// Create list of instances with one element
FastVector fvWekaAttributes = new FastVector(2);
fvWekaAttributes.addElement(attribute1);
fvWekaAttributes.addElement(attribute2);
instances = new Instances("Test relation", fvWekaAttributes, 1);
// Set class index
instances.setClassIndex(0);
// Create and add the instance
DenseInstance instance = new DenseInstance(2);
instance.setValue(attribute2, text);
// instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text);
instances.add(instance);

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).

So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to be null, but casted to FastVector. Then we create an Instances object by using a FastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.

The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.

The rest of the methods are (more or less) straightforward if you follow the documentation (weka - Programmatic Use, and weka - Use WEKA in your Java code). You get the class prediction on your text with the following lines:

double pred = classifier.classifyInstance(instances.instance(0));
System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java (that is stored in myClassifier.dat), then you get the following result:

$>javac MyFilteredClassifier.java
$>java MyFilteredClassifier smstest.txt myClassifier.dat
===== Loaded text data: smstest.txt =====
this is spam or not, who knows?
===== Loaded model: myClassifier.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'

@attribute class {spam,ham}
@attribute text string

@data
?,' this is spam or not, who knows?'
===== Classified instance =====
Class predicted: ham

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.

For those interested on using the classifiers discussed in my previous posts (I mean including AttributeSelection, and using PART and SMO as classifiers), the only part of this code that you have to change is the learn() and evaluate() methods in MyFilteredLearner.java. Just play with it, and have fun.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for futher articles on this topic!

UPDATE (June 26th, 2013): Since I wrote this post, I have moved my code examples and other stuff to a GitHub repository. I have just updated the links.

1.4.13

Command Line Functions for Text Mining in WEKA

In previous posts I have explained how to chain filters and classifiers in WEKA, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and how to integrate feature selection in the text classification process. For this purpose, I have used the FilteredClassifier and the MultiFilter in the Explorer GUI provided by WEKA. Now it is time to do so in the command line.

WEKA essentially provides three usage modes:

  1. Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
  2. Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
  3. Using the classes programmatically, that is, in your own programs in Java.

One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.

I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like PART (which I used in my previous posts), you get the following options:

$>java weka.classifiers.rules.PART
Weka exception: No training file and no object input file given.
General options:
-h or -help
Output help information.
-synopsis or -info
Output synopsis for classifier (use in conjunction with -h)
-t <name of training file>
Sets training file.
-T <name of test file>
Sets test file. If missing, a cross-validation will be performed
on the training data.
...
Options specific to weka.classifiers.rules.PART:
-C <pruning confidence>
Set confidence threshold for pruning.
(default 0.25)
...

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:

  • Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
  • Training a classifier and storing the model in a file for further use.
  • Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.

In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the SMS Spam Collection, namely the smsspam.small.arff file. As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation ===
Correctly Classified Instances 173 86.5 %
Incorrectly Classified Instances 27 13.5 %
Kappa statistic 0.4181
Mean absolute error 0.1625
Root mean squared error 0.3523
Relative absolute error 58.2872 %
Root relative squared error 94.9031 %
Total Number of Instances 200

=== Confusion Matrix ===

a b <-- classified as
13 20 | a = spam
7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:

  • -t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
  • -c 1 to specify the first attribute as the class.
  • -x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
  • -v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.

Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.

In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation ===
Correctly Classified Instances 167 83.5 %
Incorrectly Classified Instances 33 16.5 %
Kappa statistic 0.1959
Mean absolute error 0.1967
Root mean squared error 0.38
Relative absolute error 70.53 %
Root relative squared error 102.3794 %
Total Number of Instances 200

=== Confusion Matrix ===

a b <-- classified as
6 27 | a = spam
6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMO implementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART to weka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post.

While most of the options are the same as in the previous command, two things deserve special attention in this one:

  • We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the InfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.
  • As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
    • The whole MultiFilter filter, enclosed by the isolated quotation marks (").
    • The AttributeSelection filter, enclosed by the escaped quotation mark (\").
    • The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
  • So many escaping symbols make it a bit dirty, but still functional.

Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

NOTE: You can find the collection I used in this post, along with other stuff related to WEKA and text mining in my Text Mining in WEKA page.