MASC corpus by ANC, a resource for linguistic analysis and opinion mining

The American National Corpus project has released a subset of annotated texts of several genres and with a number of annotations, the Manually Annotated Sub-Corpus, free for research and commercial purposes. It includes comprises roughly 25K words:

  • Genres: Court transcript, Debate transcript, Email, Essay, Newspaper/newswire, Technical, Twitter, Blog, spam etc.
  • Annotations: Token, Part of speech, Sentence boundary, Shallow parse (noun chunk, verb chunk), Named entities (person, location, organization, date), Penn Treebank syntax, and Opinion annotation. All of these annotations are distributed in GrAF format.

