Automated Marketing Research Using Online Customer Reviews

More documents

Recommendations

Info

propagate_bounds(phrase, tok_candidates, tok_exclusion, tok_assign, schema) [1] # marshall prior assignments [2] unassigned_tok = {t|t�phrase � t�assign_d} [3] unassigned_attr = {a|a�schema � �t(t�phrase � a�tok_assign[t])} [4] for each t in unassigned_tok: [5] tok_exclusion[t] = (t � (unassigned_tok – t))⋃ tok_exclusion[t] [6] possible_assign = {a|a�(unassigned_attr ⋂ tok_candidates[t])} [7] boundary_list = {(t,[possible_assign])} ⋃ boundary_list [8] recurse_boundary(boundary_list, tok_exclusion, tok_assign) Figure WA1.2 Propagate boundary constraints clustering of phrases into product attributes and the subsequent assignment of words to attribute properties are inherently imperfect. Inconsistencies may emerge for any number of reasons including: Poor parsing, the legitimate appearance of one word multiple times within a single phrase (e.g. the phrase ‘digital zoom and optical zoom’ duplicates the word ‘zoom’) or even “inaccuracies” by the human reviewers who write the text that is being automatically processed. This could result in a single attribute property divided over multiple table columns. For example, some reviews might write "SmartMedia" as a single word and others might use "Smart" and "media" as two separate words. Alternatively, multiple product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera' are clustered together based upon their common use of the word 'compact,' yet refer to distinct attributes. To address the problem of robustness in the face of noisy clusters that include references to additional product attributes or have different properties for the same attributes, we extend our CLP approach to simultaneously cluster phrases and assign words. By modeling reviews as a graph of phrases, we can apply the same CLP in a pre-assignment step to filter a single (noisy) cluster of phrases. As alluded to in Appendix B.2, we generate a graph where phrases are nodes, and edges represent the co-occurrence of two phrases within the same review. The extended CLP then prunes phrases by recursively applying co- occurrence constraints; two phrases in the same review cannot describe the same attribute just as two words in the same phrase cannot describe the same attribute dimension. The same assignment representation removes phrases that are not central to the product attribute at the heart of a particular 8
phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or represent conflicting constraints are simply excluded from the subcluster. Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to represent distinct product attributes, so we assume that meaningful tables should contain minimal word overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters. First, because each table itself separates tokens into attribute properties (columns), meaningful tables will not hold too small a percentage of the overall number of tokens. Second, we assume that meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no other table, then the intra-table token frequency should match the frequency of the initial k-means cluster; likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2 statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets of text with a measure that weights higher-frequency tokens with greater significance than lower frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure that individual tables reflect an appropriate percentage of tokens from the initial cluster. After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts to calculate rank order statistics. We compare the token rank order from each constituent table to that in the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or as synonymous words for the same product attribute as determined by a manual reading. 9
Page 1 and 2: Automated Marketing Research Using
Page 4 and 5: [3] Phrase clustering We cluster th
Page 6 and 7: More formally, we assume that phras
Page 10 and 11: Web Appendix B. Automatically gener
Page 12 and 13: 14. Support (service) support produ
Page 14 and 15: 26. Size small, broke, convenient,
Page 16 and 17: 38. Picture quality extremely, swit
Page 18 and 19: VOC Only Survey attribute Auto Fami
Page 20 and 21: Web Appendix D. Interpreting the Co

Automated Marketing Research Using Online Customer Reviews

Create successful ePaper yourself

Delete template?

Save as template?