It is not strange that, with the overload of user-generated content, there is an increasing interest on processing chat/SMS-like language. Social Networks, virtual worlds, MMORPGs and chat rooms are plagued with emoticons, abbreviations, typos and channel codes that make the task of processing user-generated text a nightmare. In this post I list a number of resources and approaches that may be useful for researchers and practitioners of Natural Language Processing regarding this problem, which following the course by Richard Sproat and Steven Bedrick, I call Text Normalization .
Text Normalization can be seen as translation from informal language to standard English-Spanish-whatever. The most simple approach you can follow is a word by word translation using a dictionary. This approach is followed by online lingo translators like Lingo2Word and Transl8it!. In fact, you can reproduce this work using the Lingo2Word dictionary (click on the header links). I have followed this approach as a baseline in several projects and works, like WENDY - WEb-access coNfidence for chilDren and Young (web page in Spanish, the paper: " Combining Predation Heuristics and Chat-Like Features in Sexual Predator Identification " in English).
Another knowledge-based alternative is manually coding normalization rules. An example is the tool Deflog, which is a program that decodes the usual expressions used in the picture-oriented social network Fotolog. In this network, the majority of (Spanish-language) users make use of specific language codes like repeating vowels ("I liiiiiiiiiiiiike iiiiiiit"), alternating upper and lowercase ("YoU WiLL LiKe It"), and so on. The program encodes a number of functions that "correct" word tokens, each function for a particular code. While the functions mostly apply to Spanish and Fotolog, a linguist may derive their own rules for another domain (e.g. Twitter).
These are obviously baselines. There much more sophisticated methods, mostly based on statistical methods; I provide a list here that complements the reading list in the course by Sproat and Bedrick:
- Bo Han, Paul Cook and Timothy Baldwin, Lexical Normalisation of Short Text Messages, In ACM Transactions on Intelligent Systems and Technology (TIST) 4(1), pp. 5:1-5:27, 2013.
- Tim Schlippe, Chenfei Zhu, Daniel Lemcke, and Tanja Schultz. Statistical Machine Translation based Text Normalization with Crowdsourcing. In Proceedings of The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), Vancouver, Canada, 26-31 May 2013.
- Bo Han, Paul Cook and Timothy Baldwin, Automatically Constructing a Normalisation Dictionary for Microblogs, In EMNLP-CoNLL 2012, 421-432, Jeju, Republic of Korea.
- Bo Han and Timothy Baldwin, Lexical normalisation of short text messages: Makn sens a #twitter, In ACL 2011, 368-378, Portland, OR, USA.
- Tim Schlippe, Chenfei Zhu, Jan Gebhardt, Tanja Schultz. Text Normalization based on Statistical Machine Translation and Internet User Support. In Proceedings of The 11th Annual Conference of the International Speech Communication Association (Interspeech 2010), Makuhari, Japan, 26-30 September 2010.
- Carlos Henriquez, Adolfo Hernández H., A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. Proceedings of the CAW2 (Content Analysis in Web 2.0) Workshop, April 2009.
You can get some more papers by tracking the referenced literature or by searching these papers for citations.
As a final note, remember that text normalization is not always a good idea. I mean, for some problems it would be nice to keep the original abbreviations, emoticons and so as they can be representative of the style, genre, an author or a particular age.
I hope these works will suggest you other methods for your problem at hand. As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!