URL Text Classification with WEKA, Part 1: Data Analysis

I have recently came across a website named SquidBlackList.org, which features a number or URL lists for safe web browsing using the open source proxy Squid. In particular, it features a quite big porn domains list, so I wondered: Is it possible to make a text classification system with WEKA to detect porn domains using the text in the URLs?

Just to note that SquidBlackList on porn (and most of the rest of the lists they provide is licensed under Creative Commons Attribution 3.0 Unported License: Blacklists (Squidblacklist.org) / CC BY 3.0

The Filtering Problem

Most web filtering systems work by using a manually classified list of URLs into a list of categories that are used to define filtering profiles (e.g. block porn but allow press). The URL lists or database must be manually maintained, and it has to be quite comprehensive regarding user browsing behaviour. As (aggregated) web browsing follows a Zipfian distribution (that is, relatively few URLs accumulate most of the traffic), you can provide a rather effective service by ensuring that your URL database covers the most popular URLs. URL-based filtering is rather efficient (if your database is well implemented), and it can easily cover around 95% of the web traffic (in terms of #requests, not in terms or #URLs).

However, covering the remaining 5% requires performing some kind of analysis. My target here is dynamically classifying that 5% of web requests (which may account for millions of URLs or even just domains) into two classes: notporn and porn. This way, we can cover the 100% of the traffic, and it is likely that we concentrate our classification mistakes (that may be possible at the URL database as well) only into that small 5% - so our filter can be 98% effective or more.

Why analyzing the URL text? For a matter of efficiency - you do not have to go to the Internet and get the actual Web content in order to analyze it, so all the processing is local to the proxy and you eventually avoid performing unnecessary Web requests at the proxy itself.

Collecting the Dataset

So we start with an 880k porn domains list, but although it is possible to learn only from positive examples, we may expect better effectiveness if we collect negative examples (not porn domains). A handy resource is the Top 1M Sites list by Alexa, a Web research company that provides this ranked list in a daily basis. Having 1M negative examples and 880k positive examples makes a good class balance and quite populated dataset -- nice for learning, specially when its instances are relatively short text sequences (e.g. google.com vs. porn.com).

First we have to make both lists comparable. The format of the Alexa list is <rank>,<domain>, while the format of the Squid black list is <dot><domain> (in order to match the Squid URL list format). A couple of cut and sed commands will do the trick.

Then we can just add the class and mix the lists.

Cleaning the Dataset, first step

But... Hey, Internet is for porn! -- we should expect that some of the URLs in the Alexa ranking are pornographic. In fact, a simple search demonstrate it:

$ grep porn alexa.csv | more
$ grep porn alexa.csv | wc -l

We can just substract the porn list from the Alexa list with a handy grep:

grep -f porn.csv -v alexa.csv > alexaclean.csv

But it takes a loooooong time, so I prefer to sort Alexa list, transforming it to Linux format (as the original one has DOS format), and use comm:

$ sort alexa.csv > alexasorted.csv
$ fromdos alexasorted.csv
$ comm -23 alexasorted.csv porn.csv > alexaclean.csv
$ wc -l alexaclean.csv
975088 alexaclean.csv

Good point, only 25k URLs are pornographic... Well, lets check:

$ grep porn alexaclean.csv | head

So we still have some porn in there.

Cleaning the Dataset, second step

Cleaning Alexa list from porn is a bit more complex. How to find those popular porn sites, if they are not even in such a comprehensive list as the Squidblacklist one? Another resource comes to help, and it is the sex-related search engine PornMD. This engine has recently published a list of popular porn searches in the form of a dynamic infography named Global Internet Porn Habits:

So, if you collect a list of the top searches in five of the biggest speaking countries, you get:

Cleaning the list from duplicated words, adding "porn", "sex" and "xxx" (rule of thumb), and computing the number of domains they occur in the Alexa (cleaned) and the Squidblacklist lists, we get:

Looking at the list, a relatively safe proportion between the number of occurrences in Squid's versus Alexa's (clean) list is 9 -- this way, we keep most obvious words and remove the most ambiguous ones (although there are some borderline examples, as "asian"). We can see the effects:

$ grep "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv | wc -l
$ grep "porn\|sex\|xxx" alexaclean.csv | wc -l
$ grep -v "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv > alexacleanfinal.csv
$ wc -l alexacleanfinal.csv
964735 alexacleanfinal.csv

You can see that just "porn", "sex" and "xxx" account for 70,97% of domains, so there is some domain knowledge in the process. I must note I may use another, much more extensive list of porn-related searches like the one featured PornMD Most Popular page.

Additional Analysis

To get a feeling of how the previous porn-related keywords are distributed across the original Alexa ranking, I have computed the number of lines (domains) they occur in 100k intervals, to get the following chart:

Where #query1 represents the number of occurrences of "porn\|sex\|xxx" and #query2 represents the full list of keywords. The growth is nearly linear with an average of 1234.2 URLs per interval in #query1, and 1738.9 URLs per interval in #query2. The curves are smooth, and there are more domains in the first intervals (e.g. 1482 hits in the first 100k Alexa URLs for #query1) than in the latest ones (e.g. 1077 hits in the last 100k Alexa URLs for #query1).

There are other dataset statistics that may provide better insights regarding the classification problem, or in other words, that may be more informative or predictive in terms of classification accuracy. For instance:

  • What is the length of an average domain name in each category?
  • How many points and/or dashes do domain have in average per category?
  • Which is the distribution of different TLDs (Top Level Domains) across both categories?

Can you imagine any other interesting statistics?

The Dataset

Once we have got the original Squidblacklist and the Alexa cleaned one (after substraction and removing the keyword hitting lines), we add some format to get a WEKA ARFF file. For instance, 0000free.com must be transformed into '0000free.com',safe. A bit of sed trickery does the job, and then we mix the lists with the following command:

$ paste -d '\n' alexacleanfinal.csv porn.csv > urllist.csv

The rationale behind mixing the lists is that some learning algorithms are dependent on the order of examples, and for those algorithms it is clever not to expose first all the examples of one class, the other class' ones. As the paste command adds new lines when one of the lists finish, we have to remove double CRs (\n\n) with another sed call, and we finally add the ARFF header to get a file starting the following way:

@relation URLs

@attribute urltext String
@attribute class {safe,porn}


I have left that file named urllist.arff in my GitHub folder for your convenience, so you can start playing with it. Beware, it is over 40Mb.

So that is all for the moment. Stay tuned for my next steps if you liked this post.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topic!

3 comentarios:

Aho dijo...

the command "grep -f porn.csv"
is slow because the grep dont use the "fancy" algorithms that Aho included in "fgrep" in the 70's

The command fgrep -f porn.csv will complete the job in 10 or 100 times less time.

Maria dijo...


I want to perform text mining using Weka Explorer, and I'm new in this area. I do not know how to perform text mining,i.e., what are the topics of text mining that allow me do papers, and the ideal way to do so?
What are the recommended settings for SringToWordVector filter?

Any assistance would be greatly apreciated.


Jose Maria Gomez Hidalgo dijo...

Dear Maria

Text mining involves getting valuable, previously unknown knowledge from big amounts of text. It includes many kinds of tasks from the point of view of kind of Machine Learning (classification, clustering), the granularity of text items (term classification like POS-Tagging or Word Sense Disambiguation, or document classification like Text Categorization, Text Retrieval), and applications (search engines, sentiment analysis, topic labelling, spam filtering, etc.). it is impossible to summarize all open research questions in a single comment - or blog post!

Moreover, I find very dissapointing to think about research as "do papers". My suggestion is that you first get some backgroung by reading a book and some papers.

For the book, this can help:

Foundations of Statistical Natural Language Processing, Manning & Schutze, 1999. (http://nlp.stanford.edu/fsnlp/)

Regarding papers, never miss this one:

Machine learning in automated text categorization, F. Sebastiani, 2002. (http://dl.acm.org/citation.cfm?id=505283)

Alternatively, you can join a MOOC, however I cannot recommend any. There are two on WEKA, one introductory and one advanced.

Regarding the parameters for the STWV filter, they greatly depend on the task. For instance, in spam filtering, using a stoplist often hurts performance; however, in topic labelling it is a must. At least I recommend to always make a quick test with default parameters and check the results; this will probably guide you in the selection of tokenization, stemming, stoplisting, weighting, etc.

Hope this helps. Best regards and good luck!