22.01.2013 Views

Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using Online Customer Reviews

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or<br />

represent conflicting constraints are simply excluded from the subcluster.<br />

Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent<br />

distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to<br />

represent distinct product attributes, so we assume that meaningful tables should contain minimal word<br />

overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters.<br />

First, because each table itself separates tokens into attribute properties (columns), meaningful<br />

tables will not hold too small a percentage of the overall number of tokens. Second, we assume that<br />

meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no<br />

other table, then the intra-table token frequency should match the frequency of the initial k-means cluster;<br />

likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order<br />

of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2<br />

statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be<br />

tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets<br />

of text with a measure that weights higher-frequency tokens with greater significance than lower<br />

frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure<br />

that individual tables reflect an appropriate percentage of tokens from the initial cluster.<br />

After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts<br />

to calculate rank order statistics. We compare the token rank order from each constituent table to that in<br />

the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor<br />

extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in<br />

both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In<br />

the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or<br />

as synonymous words for the same product attribute as determined by a manual reading.<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!