The focus of the ECML/PKDD 2010 Discovery Challenge is Web Content Quality.
High quality is not simply the opposite of Web Spam. The recent Web Spam Challenges have explored the aspects of filtering as a binary decision. In this year's Discovery Challenge we target at more and different aspects. We want to develop site-level classification for the genre of the web sites (editorial, news, commercial, educational, "deep Web", or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality.
The data set will consist of sample Web hosts from Europe in three languages (English, French and German). The training and testing samples will be biased towards the interesting aspects and cleansed manually from mixed sites, Web hosting, and adult content. Features similar to those used to filter Web spam based on content and linkage information will be provided on the host level, along with natural language processing annotations of a large set of sample pages.
Preliminary description of tasks.
1. Classification task
A ranked list is required for the English documents for all categories (news, educational, spam; readability, authority, neutrality). Evaluation is in terms of average NDCG.
2. Quality task
Quality is measured as an aggregate function of all content type and quality. A single ranked list is required and is evaluated in terms of NDCG.
3. Multilingual quality task
Quality predictions for the non-English language sites is required as in Task 2. Evaluation is is in terms of average NDCG.
Important dates (tentative).
- Mid-February : description of the content types (news, educational, spam etc) and properties (readability, authoritativity, Neutrality) along with the assessor guidelines are out.
- Mid-March : Data set and training labels are out.
- Early June : Result submission deadline
- Early June : Results and testing labels are out.
- End of June : Paper submission deadline.
- Early July : Notification of acceptance.
- End of July : Workshop proceedings (camera-ready) deadline
- September 20 : ECML PKDD Discovery Challenge Workshop on Web Content Quality.
Cash prizes and travel grants for the best submissions will be provided. These will be provided by major Internet companies. Details will be available soon.
Publications describing the design of the best systems will be peer reviewed and published.