Some datasets and resources I have recently found (although they may be old):
- MSH WSD: a data set for Word Sense Disambiguation WSD based on a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE.
- BioNOT: a searchable database of negated biomedical sentences. The database consists of more than 32 million negated sentences at PubMed.
- Gazetiki: a geographical database that contains 8323702 geographical names coming from Geonames and from different Web sources, with the latter representing over 1 million items, with the addition of a popularity score which was calculated based on the usage of a place name in a geotagged dataset.
- DBpedia Spotlight: a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and Name Resolution.
- Google BigQuery Service: a SQL-like tool for analyzing massive datasets, as a web service that enables you to do interactive analysis of massively large datasets-up to billions of rows.
- Common Crawl: a freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation.