In a recent thread at the SentimentAI group (list), a number of links to datasets for training / testing opinion mining / sentiment classifiers over Twitter have been contributed. I list them here for the case somebody considers this information useful:
- Three datasets provided by Hassan Saif, including an annotated subset of the Stanford Twitter Sentiment Corpus, and two for the specific topics of the Health Care Reform and the Obama-McCain Debate.
- The Stanford Twitter Corpus itself, provided by Alec Go and others at Sentiment140. You can download the ST Corpus directly (70Mb).
- The Sanders Analytics Twitter Sentiment Corpus , provided by Niek Sanders.
- The mejaj datasets , provided by Nibir Bora and others.
- The SemEval-2013: Sentiment Analysis in Twitter evaluation campaign (or competition) dataset. Note the competition is still active, you can join it! Check the dates at the SemEval-2013 website.
- The RepLab 2012 Profiling task dataset. The profiling task is a bit different from the standard sentiment classification task. For instance, factual tweets can imply bad reputation ("Lehmann Brothers goes bankrupt") and negative sentiment tweets can imply good reputation ("R.I.P. Michael Jackson. We'll miss you").
- UPDATE (8/10/2013): Contributed by Eugenio Martínez Cámara (thanks!), the Spanish-language dataset used in the TASS workshop organized at the anual meeting of the SEPLN.