Automated Marketing Research Using Online Customer Reviews

Borrowing from the information retrieval community, our phrase � word matrix is a 

representation of the vector-space model (VSM). More formally, j � J is a word in the set of all words; i 

� I is a phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word 

sequences I = {| j � J}. We define an initial phrase � word matrix as a simple variation on the term- 

frequency inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983): 

Matrix(i,j) = (TFij � IPFj) (A1.1) 

where the term frequency �TF ij � counts the total number of occurrences of word j in the instances of 

phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more 

helpful in distinguishing between different product attributes because they only appear in a fraction of the 

total number of unique phrases. If |I| represents the total number of unique phrases in the review 

collection, nj counts the total number of unique phrases containing word j. 

A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like 

"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in 

the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a 

second set of K phrases using online reviews for an unrelated product domain. Intuitively, words 

appearing in the reviews for unrelated products are less likely to represent relevant product attributes for 

the focal one. For example, words describing digital camera attributes are less likely to also appear in 

vacuum cleaner reviews. 

Formally, for a set of (I') phrases drawn from the set of finite word sequences over j � J, we 

calculate rank(j) = rank(TF'ij�IPF'j) where higher weighted frequencies correspond to higher rank. Note 

that multiple words may share the same rank; if we define words that do not appear in any phrase as 

having IPF'j = 0, then we may say: 

Matrix(i,j) = TF rank� 

j�� 

�IPF � IPF' 

� 

ij 

� (A1.2) 

Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF' 

j 

j 

2

Previous page

Next page

1

2

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Automated Marketing Research Using Online Customer Reviews

Create successful ePaper yourself

Delete template?

Save as template?