www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 9 The use of function words is less defined by the content of the document and more by the decisions made by the author. This makes them good candidates for separating the authorship traits between different users. For instance, while many Americans are particular about the different in usage between that and which in a sentence, people from other countries, such as Australia, are less particular about this. This means that some Australians will lean towards almost exclusively using one word or the other, while others may use which much more. This difference, combined with thousands of other nuanced differences, makes a model of authorship. Counting function words We can count function words using the CountVectorizer class we used in Chapter 6, Social Media Insight Using Naive Bayes. This class can be passed a vocabulary, which is the set of words it will look for. If a vocabulary is not passed (we didn't pass one in the code of Chapter 6), then it will learn this vocabulary from the dataset. All the words are in the training set of documents (depending on the other parameters of course). First, we set up our vocabulary of function words, which is just a list containing each of them. Exactly which words are function words and which are not is up for debate. I've found this list, from published research, to be quite good: function_words = ["a", "able", "aboard", "about", "above", "absent", "according" , "accordingly", "across", "after", "against", "ahead", "albeit", "all", "along", "alongside", "although", "am", "amid", "amidst", "among", "amongst", "amount", "an", "and", "another", "anti", "any", "anybody", "anyone", "anything", "are", "around", "as", "aside", "astraddle", "astride", "at", "away", "bar", "barring", "be", "because", "been", "before", "behind", "being", "below", "beneath", "beside", "besides", "better", "between", "beyond", "bit", "both", "but", "by", "can", "certain", "circa", "close", "concerning", "consequently", "considering", "could", "couple", "dare", "deal", "despite", "down", "due", "during", "each", "eight", "eighth", "either", "enough", "every", "everybody", "everyone", "everything", "except", "excepting", "excluding", "failing", "few", "fewer", "fifth", "first", "five", "following", "for", "four", "fourth", "from", "front", "given", "good", "great", "had", "half", "have", "he", "heaps", "hence", "her", "hers", "herself", "him", "himself", "his", "however", "i", "if", "in", "including", "inside", [ 193 ]

Authorship Attribution "instead", "into", "is", "it", "its", "itself", "keeping", "lack", "less", "like", "little", "loads", "lots", "majority", "many", "masses", "may", "me", "might", "mine", "minority", "minus", "more", "most", "much", "must", "my", "myself", "near", "need", "neither", "nevertheless", "next", "nine", "ninth", "no", "nobody", "none", "nor", "nothing", "notwithstanding", "number", "numbers", "of", "off", "on", "once", "one", "onto", "opposite", "or", "other", "ought", "our", "ours", "ourselves", "out", "outside", "over", "part", "past", "pending", "per", "pertaining", "place", "plenty", "plethora", "plus", "quantities", "quantity", "quarter", "regarding", "remainder", "respecting", "rest", "round", "save", "saving", "second", "seven", "seventh", "several", "shall", "she", "should", "similar", "since", "six", "sixth", "so", "some", "somebody", "someone", "something", "spite", "such", "ten", "tenth", "than", "thanks", "that", "the", "their", "theirs", "them", "themselves", "then", "thence", "therefore", "these", "they", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "till", "time", "to", "tons", "top", "toward", "towards", "two", "under", "underneath", "unless", "unlike", "until", "unto", "up", "upon", "us", "used", "various", "versus", "via", "view", "wanting", "was", "we", "were", "what", "whatever", "when", "whenever", "where", "whereas", "wherever", "whether", "which", "whichever", "while", "whilst", "who", "whoever", "whole", "whom", "whomever", "whose", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves"] Now, we can set up an extractor to get the counts of these function words. We will fit this using a pipeline later: from sklearn.feature_extraction.text import CountVectorizer extractor = CountVectorizer(vocabulary=function_words) [ 194 ]

Authorship Attribution<br />

"instead", "into", "is", "it", "its", "itself", "keeping",<br />

"lack", "less", "like", "little", "loads", "lots", "majority",<br />

"many", "masses", "may", "me", "might", "mine", "minority",<br />

"minus", "more", "most", "much", "must", "my", "myself",<br />

"near", "need", "neither", "nevertheless", "next", "nine",<br />

"ninth", "no", "nobody", "none", "nor", "nothing",<br />

"notwithstanding", "number", "numbers", "of", "off", "on",<br />

"once", "one", "onto", "opposite", "or", "other", "ought",<br />

"our", "ours", "ourselves", "out", "outside", "over", "part",<br />

"past", "pending", "per", "pertaining", "place", "plenty",<br />

"plethora", "plus", "quantities", "quantity", "quarter",<br />

"regarding", "remainder", "respecting", "rest", "round",<br />

"save", "saving", "second", "seven", "seventh", "several",<br />

"shall", "she", "should", "similar", "since", "six", "sixth",<br />

"so", "some", "somebody", "someone", "something", "spite",<br />

"such", "ten", "tenth", "than", "thanks", "that", "the",<br />

"their", "theirs", "them", "themselves", "then", "thence",<br />

"therefore", "these", "they", "third", "this", "those",<br />

"though", "three", "through", "throughout", "thru", "thus",<br />

"till", "time", "to", "tons", "top", "toward", "towards",<br />

"two", "under", "underneath", "unless", "unlike", "until",<br />

"unto", "up", "upon", "us", "used", "various", "versus",<br />

"via", "view", "wanting", "was", "we", "were", "what",<br />

"whatever", "when", "whenever", "where", "whereas",<br />

"wherever", "whether", "which", "whichever", "while",<br />

"whilst", "who", "whoever", "whole", "whom", "whomever",<br />

"whose", "will", "with", "within", "without", "would", "yet",<br />

"you", "your", "yours", "yourself", "yourselves"]<br />

Now, we can set up an extractor to get the counts of these function words. We will fit<br />

this using a pipeline later:<br />

from sklearn.feature_extraction.text import CountVectorizer<br />

extractor = CountVectorizer(vocabulary=function_words)<br />

[ 194 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!