A beautiful piece of Data Mining criticism

As Data Mining enters the mainstream, the market with more and more applications, and reaches the average user through a range of online applications, people gets more and more conscious about what can be done, but more importantly, how it is being done. And lack of information about data, methods, and so, let applications to show anything you did not expect (and far from reality, also, somtimes).

"You have to feel it yourself", may have thought Aaron Zinman, a researcher at the MIT Sociable Media Group. "And you will, with Personas". And he has prepared an online piece, Personas, framed by the installation by the Sociable Media Group at the MIT Museum. You can get a feeling of the installation by watching the MIT TV Video:

The philosophy of the installation is (quoting Aaron):

In a world where fortunes are sought through data-mining vast information repositories, the computer is our indispensable but far from infallible assistant. Personas demonstrates the computer's uncanny insights and its inadvertent errors, such as the mischaracterizations caused by the inability to separate data from multiple owners of the same name. It is meant for the viewer to reflect on our current and future world, where digital histories are as important if not more important than oral histories, and computational methods of condensing our digital traces are opaque and socially ignorant.

In Personas, the user enters his/her name and gets a bunch of categories that are expected to explain what is around him/her on the Web. The application runs in two steps:

  1. Collecting information about the name by querying Yahoo! with specially crafted queries, and post-processing the hits to avoid hate speech and other irrelevant material.
  2. Apply a unsupervised categorization process named Latent Dirichlet Allocation, that assigns a number of keywords and weights (shown as the size of the final bars) to the name. The basic data for this categorization has been collected from 2 million queries.

I strongly recommend to go through the explanation in the read more link inside Personas.

For instance, Personas starts with the query field:

The process is fully "visual":

And you get your Personas characterization.

To what extent does this information shows a real picture of me? Well, at least almost all the hits by the system are mine, but correlation is, eh... say a bit strange. Sports? Genealogy?...FAME?

Ok, for me the goal is done. What is bad at the process (if there is something wrong, or I am just disturbed)? Try to guess without precise information about the process. That is the goal. And it is done.