Chat or What: Approaching Text Normalization in Chats and Social Networks

It is not strange that, with the overload of user-generated content, there is an increasing interest on processing chat/SMS-like language. Social Networks, virtual worlds, MMORPGs and chat rooms are plagued with emoticons, abbreviations, typos and channel codes that make the task of processing user-generated text a nightmare. In this post I list a number of resources and approaches that may be useful for researchers and practitioners of Natural Language Processing regarding this problem, which following the course by Richard Sproat and Steven Bedrick, I call Text Normalization .

Text Normalization can be seen as translation from informal language to standard English-Spanish-whatever. The most simple approach you can follow is a word by word translation using a dictionary. This approach is followed by online lingo translators like Lingo2Word and Transl8it!. In fact, you can reproduce this work using the Lingo2Word dictionary (click on the header links). I have followed this approach as a baseline in several projects and works, like WENDY - WEb-access coNfidence for chilDren and Young (web page in Spanish, the paper: " Combining Predation Heuristics and Chat-Like Features in Sexual Predator Identification " in English).

Another knowledge-based alternative is manually coding normalization rules. An example is the tool Deflog, which is a program that decodes the usual expressions used in the picture-oriented social network Fotolog. In this network, the majority of (Spanish-language) users make use of specific language codes like repeating vowels ("I liiiiiiiiiiiiike iiiiiiit"), alternating upper and lowercase ("YoU WiLL LiKe It"), and so on. The program encodes a number of functions that "correct" word tokens, each function for a particular code. While the functions mostly apply to Spanish and Fotolog, a linguist may derive their own rules for another domain (e.g. Twitter).

These are obviously baselines. There much more sophisticated methods, mostly based on statistical methods; I provide a list here that complements the reading list in the course by Sproat and Bedrick:

You can get some more papers by tracking the referenced literature or by searching these papers for citations.

As a final note, remember that text normalization is not always a good idea. I mean, for some problems it would be nice to keep the original abbreviations, emoticons and so as they can be representative of the style, genre, an author or a particular age.

I hope these works will suggest you other methods for your problem at hand. As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

3 comentarios:

Mariana Soffer dijo...

Great post thanks :)

Real Vision dijo...

Sharing this on Facebook groups on AI

Jose Maria Gomez Hidalgo dijo...

Thank you. Which groups do you mean?