Automated Marketing Research Using Online Customer Reviews
Automated Marketing Research Using Online Customer Reviews
Automated Marketing Research Using Online Customer Reviews
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or<br />
represent conflicting constraints are simply excluded from the subcluster.<br />
Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent<br />
distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to<br />
represent distinct product attributes, so we assume that meaningful tables should contain minimal word<br />
overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters.<br />
First, because each table itself separates tokens into attribute properties (columns), meaningful<br />
tables will not hold too small a percentage of the overall number of tokens. Second, we assume that<br />
meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no<br />
other table, then the intra-table token frequency should match the frequency of the initial k-means cluster;<br />
likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order<br />
of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2<br />
statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be<br />
tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets<br />
of text with a measure that weights higher-frequency tokens with greater significance than lower<br />
frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure<br />
that individual tables reflect an appropriate percentage of tokens from the initial cluster.<br />
After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts<br />
to calculate rank order statistics. We compare the token rank order from each constituent table to that in<br />
the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor<br />
extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in<br />
both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In<br />
the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or<br />
as synonymous words for the same product attribute as determined by a manual reading.<br />
9