tag:blogger.com,1999:blog-36589303.post7228929015387305231..comments2024-01-22T09:48:10.802+01:00Comments on Nihil Obstat: URL Text Classification with WEKA, Part 1: Data AnalysisJose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-36589303.post-78501691847210910932015-01-28T13:48:49.443+01:002015-01-28T13:48:49.443+01:00Dear Maria
Text mining involves getting valuable,...Dear Maria<br /><br />Text mining involves getting valuable, previously unknown knowledge from big amounts of text. It includes many kinds of tasks from the point of view of kind of Machine Learning (classification, clustering), the granularity of text items (term classification like POS-Tagging or Word Sense Disambiguation, or document classification like Text Categorization, Text Retrieval), and applications (search engines, sentiment analysis, topic labelling, spam filtering, etc.). it is impossible to summarize all open research questions in a single comment - or blog post!<br /><br />Moreover, I find very dissapointing to think about research as "do papers". My suggestion is that you first get some backgroung by reading a book and some papers.<br /><br />For the book, this can help:<br /><br />Foundations of Statistical Natural Language Processing, Manning & Schutze, 1999. (http://nlp.stanford.edu/fsnlp/)<br /><br />Regarding papers, never miss this one:<br /><br />Machine learning in automated text categorization, F. Sebastiani, 2002. (http://dl.acm.org/citation.cfm?id=505283)<br /><br />Alternatively, you can join a MOOC, however I cannot recommend any. There are two on WEKA, one introductory and one advanced.<br /><br />Regarding the parameters for the STWV filter, they greatly depend on the task. For instance, in spam filtering, using a stoplist often hurts performance; however, in topic labelling it is a must. At least I recommend to always make a quick test with default parameters and check the results; this will probably guide you in the selection of tokenization, stemming, stoplisting, weighting, etc.<br /><br />Hope this helps. Best regards and good luck!<br /><br />JMJose Maria Gomez Hidalgohttps://www.blogger.com/profile/17053588779560658723noreply@blogger.comtag:blogger.com,1999:blog-36589303.post-55217316142510412292015-01-27T10:17:53.610+01:002015-01-27T10:17:53.610+01:00Hi,
I want to perform text mining using Weka Exp...Hi, <br /><br />I want to perform text mining using Weka Explorer, and I'm new in this area. I do not know how to perform text mining,i.e., what are the topics of text mining that allow me do papers, and the ideal way to do so?<br /> What are the recommended settings for SringToWordVector filter?<br /><br />Any assistance would be greatly apreciated.<br /><br />Thanks.<br />MariaMariahttps://www.blogger.com/profile/12771006811809081071noreply@blogger.comtag:blogger.com,1999:blog-36589303.post-33637643319806936262014-02-11T20:28:31.366+01:002014-02-11T20:28:31.366+01:00the command "grep -f porn.csv"
is slow...<br />the command "grep -f porn.csv" <br />is slow because the grep dont use the "fancy" algorithms that Aho included in "fgrep" in the 70's<br /><br />The command fgrep -f porn.csv will complete the job in 10 or 100 times less time. <br /><br />Ahohttp://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithmnoreply@blogger.com