Nihil Obstat: marzo 2011

14.3.11

CFP: First International CNCCS Workshop on Security Aspects for Online Social Networks (CNCCS - SAONS)

First International CNCCS Workshop on Security Aspects for Online Social Networks (CNCCS - SAONS)
Vienna, Austria, end of August

To be held in conjunction with the Sixth International Conference on Availability, Reliability and Security (ARES 2011)

Online social networks are one of the most used Internet services and they consume most of the time users spend connected to the Internet. These sites allow sharing knowledge, help in finding and integrating communities and provide tools to develop activities together. However, they are prone to misuse such as identity theft, malware, spam and data leaking. The Spanish National Advisory Council on Cyber-Security (CNCCS) is the main sponsor of this workshop, which aims to bring together security research and industry with innovative and practical ideas in order to secure online social networks.

11.3.11

Recent community works

A short list of recent community works:

Program Committee for Spanish Natural Language Processing Conference (SEPLN), September 2011.
Guest Editor for the International Journal of Electronic Commerce Special Issue on Mining Social Media, Spring 2011.
Program Committee for the First Spanish Information Retrieval Conference (Congreso Español de Recuperación de Información), June 2010.
Review for the Institution of Engineering and Technology Information Security Journal, December 2010.
Review for the Elsevier Information Sciences Journal, September 2010.
Co-Organizer of the Workshop NLP in the Enterprise: Envisioning the Next 10 Years (PLN-E), September, 2010.
Review for the ACM Transactions on Knowledge Discovery from Data, July 2010.

Besides, the Proceedings of the Workshop NLP in the Enterprise: Envisioning the Next 10 Years have been published at CEUR with the following reference:

José Carlos Cortizo, José María Gómez, Francisco Manuel Rangel, Victor Peinado, Hugo Zaragoza, Francisco M. Carrero (eds.): Proceedings of the Workshop NLP in the Enterprise: Envisioning the Next 10 Years, Valencia, Spain, September 7, 2010, CEUR-WS.org, ISSN 1613-0073, online urn:nbn:de:0074-697-0

Also, the Proceedings of the First Workshop on Mining Social Media (2009), from which some authors have extended their work for the Special Issue on Mining Social Media above, are available at Bubok.

Tag: SSP XD

9.3.11

Why (a lot of) data helps

On my years of addressing heterogenous text mining problems, I have many often faced a lack of data problem. Because the question I make myself is: where can I find the labelled data I need for my problem at hand? And sometimes I find the answer, but many other times I do not.

This is my fault, data is out there.

But it was not the data I was seeking. It can be better!

Think about the size of the Web. Think about the size of the Project Guttemberg corpus. Think about the trillions of public tweets. Let me stress that I am not saying "think about that TREC or Reuters corpus", than now can go over billions of documents, hundreds of billions of words. I am thinking about purely unstructured data available with no purpose, with no scientific goal, but plenty of examples of real language usage.

And here it comes the paper by Halevy, Norvig, and Pereira at IEEE Intelligent Systems:

Alon Halevy, Peter Norvig, Fernando Pereira, "The Unreasonable Effectiveness of Data," IEEE Intelligent Systems, pp. 8-12, March/April, 2009.

If you face the same problem, you must read this paper.

And think about it.

Some perls you should think about:

(...) invariably, simple models and a lot of data trump more elaborate models based on less data.

For those with experience in smallscale machine learning who are worried about the curse of dimensionality and overfitting of models to data, note that all the experimental evidence from the last decade suggests that throwing away rare events is almost always a bad idea, (...)

An do not miss final recommendations:

Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data.
Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.
For natural language applications, trust that human language has already evolved words for the important concepts.
See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words.

Now search for the title and read also others' opinions.