On my years of addressing heterogenous text mining problems, I have many often faced a lack of data problem. Because the question I make myself is: where can I find the labelled data I need for my problem at hand? And sometimes I find the answer, but many other times I do not.
This is my fault, data is out there.
But it was not the data I was seeking. It can be better!
Think about the size of the Web. Think about the size of the Project Guttemberg corpus. Think about the trillions of public tweets. Let me stress that I am not saying "think about that TREC or Reuters corpus", than now can go over billions of documents, hundreds of billions of words. I am thinking about purely unstructured data available with no purpose, with no scientific goal, but plenty of examples of real language usage.
And here it comes the paper by Halevy, Norvig, and Pereira at IEEE Intelligent Systems:
Alon Halevy, Peter Norvig, Fernando Pereira, "The Unreasonable Effectiveness of Data," IEEE Intelligent Systems, pp. 8-12, March/April, 2009.
If you face the same problem, you must read this paper.
And think about it.
Some perls you should think about:
(...) invariably, simple models and a lot of data trump more elaborate models based on less data.
For those with experience in smallscale machine learning who are worried about the curse of dimensionality and overfitting of models to data, note that all the experimental evidence from the last decade suggests that throwing away rare events is almost always a bad idea, (...)
An do not miss final recommendations:
- Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data.
- Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.
- For natural language applications, trust that human language has already evolved words for the important concepts.
- See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words.
Now search for the title and read also others' opinions.