Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using Online Customer 

Reviews 

Thomas Y. Lee and Eric T. Bradlow 

Web Appendix A. Algorithmic Details 

In this Appendix, we elaborate on specific details of the text-processing algorithms. 

[1] Data collection and pre-processing 

After identifying the source of reviews, we wrote a program in the Python programming 

language. For each review, we get the product identifier used by Epinions.com to uniquely identify a 

product, the list of Pros, and the list of Cons. Product brand names are excerpted from the Epinions.com 

product identifier 1 and inserted into a MySQL database. It is also important to note that one could 

separate the selected reviews by pre-defined segments (i.e. demographic clusters, and hence produce 

segment-level Pro-Con lists that would be analyzed distinctly), or attempt to (but is beyond the scope of 

this research) to simultaneously infer latent segments and market structure simultaneously. 

To construct the matrix of word vectors, we focus on each phrase. For now, we do not 

distinguish between whether a phrase appears as a Pro or a Con, focusing only on grouping together those 

phrases that discuss a common product attribute. As a standard preprocessing step in text mining, we 

normalize words as follows. Delete all stop-words and stem the remaining text (Salton and McGill 1983). 

Stop-words, like grammatical articles, conjunctions, prepositions, etc. are meaningless for purposes of 

product attribute identification so they are removed. For example, after pruning, the phrase "Only 8 mb 

Smart media card included" becomes "8 mb Smart media card included." Reduce words to their root 

form by stemming. We use the Porter stemmer to find equivalences between singular, plural, past and 

present tense forms of individual words used by customers. Thus, "includes" and included" are both 

reduced to the root "includ." 

[2] Vector space model and word importance

Borrowing from the information retrieval community, our phrase � word matrix is a 

representation of the vector-space model (VSM). More formally, j � J is a word in the set of all words; i 

� I is a phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word 

sequences I = {| j � J}. We define an initial phrase � word matrix as a simple variation on the term- 

frequency inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983): 

Matrix(i,j) = (TFij � IPFj) (A1.1) 

where the term frequency �TF ij � counts the total number of occurrences of word j in the instances of 

phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more 

helpful in distinguishing between different product attributes because they only appear in a fraction of the 

total number of unique phrases. If |I| represents the total number of unique phrases in the review 

collection, nj counts the total number of unique phrases containing word j. 

A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like 

"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in 

the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a 

second set of K phrases using online reviews for an unrelated product domain. Intuitively, words 

appearing in the reviews for unrelated products are less likely to represent relevant product attributes for 

the focal one. For example, words describing digital camera attributes are less likely to also appear in 

vacuum cleaner reviews. 

Formally, for a set of (I') phrases drawn from the set of finite word sequences over j � J, we 

calculate rank(j) = rank(TF'ij�IPF'j) where higher weighted frequencies correspond to higher rank. Note 

that multiple words may share the same rank; if we define words that do not appear in any phrase as 

having IPF'j = 0, then we may say: 

Matrix(i,j) = TF rank� 

j�� 

�IPF � IPF' 

� 

ij 

� (A1.2) 

Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF' 

j 

j 

2

[3] Phrase clustering 

We cluster the vectors (the matrix rows) so that all phrases describing the same product attribute 

are grouped together. More formally, given the phrase � word matrix(i,j) over the set of I phrases and the 

set of words J, we seek to separate phrases into a set C of k mutually exclusive and exhaustive clusters 

We use the cosine measure of angular distance between vectors to calculate similarity. The cosine 

measure is then applied to the phrase � word matrix using the K-means clustering algorithm. As noted 

earlier, while any number of clustering algorithms is acceptable, we selected K-means for its simplicity 

and its familiarity to both the text-mining and marketing communities. 

The quality, QC, of a K-means clustering, C, is calculated by the sum of the distances from each 

vector in a cluster to that vector's centroid. Following (Zhao and Karypis 2002), this metric is more 

simply defined as the sum of the length of the composite vectors: 

�v centroid�c 

�� 

� composite�c 

� 

� � cos 

i � 

QC � 

, 

�ci�C�v�ci�ci�C where composite �ci � � �v (1) 

�v� 

Because K-means is known to be extremely sensitive to its initial conditions, we repeat the algorithm ten 

times, beginning with a new, random set of k centers and pick the solution that maximizes QC. 

[4] Attributes and their dimensions 

A critical step in our approach is to discover not only what product attributes customers are 

discussing but also the granularity with which those attributes are discussed. Specifically, we seek to 

elicit the attribute dimensions that customers use in their reviews. To discover attributes, we assume that 

each phrase corresponds to a distinct product attribute. To discover attribute dimensions, we will assume 

that each word in the phrase corresponds to a distinct dimension. Discovering attributes then reduces to 

the assignment of particular words to attribute dimensions. 

ci 

i 

4

Conceptually, we model this process as a constrained optimization problem. Abusing our 

previous notation slightly, assume a set of phrases I composed from the set of words J and a set of 

attribute dimensions D. We have J � D binary decision variables Xjd where Xjd is 1 if word j is 

assigned to dimension d. There are I � J {0, 1} variables representing a constraint matrix where Yij is 

1 or 0 depending upon whether word j appears in phrase i. Thus, our objective is to: 

max 

s. 

t. 

� X 

� 

�i Y * X � 1 

X 

jd 

jd 

J 

ij 

binary 

The graph partitioning algorithm used to set the parameters I, J, and D and the constrained logic 

program (CLP) by which we solve the optimization are implemented in Python and detailed next. 

[5] Graph representations 

To discover attributes, we assume that each customer review phrase corresponds to a distinct 

product attribute. To discover attribute dimensions, we assume that each word in the phrase corresponds 

to a distinct dimension. Discovering attribute dimensions then reduces to the assignment of particular 

words to attribute dimensions. But how do we know how many dimensions there are in the assignment 

problem? Is it possible that the assignment optimization has no feasible solution because of conflicting 

constraints due to noise from the vagaries of human language? To solve this problem, we generate a 

graph of all words in the cluster. Each word is a node and arcs are defined by the co-occurrence of two 

words in the same phrase. We partition the graph into (possibly overlapping) sub-graphs by searching for 

maximal cliques. Intuitively, each sub-graph represents a maximal subset of words and phrases for which 

an optimal solution exists. The size of the maximal clique sets the number of attributes |D|. The sub- 

graph (words J and phrases I) define the optimization. 

jd 

5

More formally, we assume that phrases and words are preprocessed and normalized into words as 

before. A graph G = (V,E) is a pair of the set of vertices V and the set of edges E. An edge in E is a 

connection between two vertices and may be represented as a pair (vi,vj) � V. Each phrase (word) 

represents a vertex v in the graph; edges are defined by phrase pairs within a review (word pairs within a 

phrase). An N-partite graph is a connected graph where there are no edges in any set of vertices Vi. A 

clique of size N simulates a plays the role of arelational schema and can be extended to an N-partite 

graph by substituting each vertice vi of the clique with a set of vertices Vi. A database table with disjoint 

columns thus represents an N-partite graph where the size of the clique defines the number of columns 

and each word in the clique “names” a column. A maximal-complete-N-partite graph is a complete-N- 

partite graph not contained in any other such graph; in other words, the initial clique is maximal. The 

corresponding database table of phrases represents the existing product attribute space, and the maximal- 

complete-N-partite graph includes possibly novel combinations of previously unpaired attributes and/or 

attribute properties. 

To relate the graph back to customer reviews, we say that a product attribute is constructed from k 

dimensions. Each dimension names a domain (D). Each domain D is defined by a finite set of words that 

includes the value NULL for review phrases where customers fail to mention one or more attribute 

dimension(s). The Cartesian product of domains D1 …Dk is the set of all k-tuples {t1…tk | ti � Di}. Each 

phrase is simply one such k-tuple and the set of all phrases in the cluster simply defines a finite subset of 

the Cartesian product. A relational schema is simply a mapping of attribute properties A1 …Ak to domains 

D1 … Dk. Note the strong, implicit assumption that a maximal clique, taken over a word graph, is a proxy 

for the proper number of attribute dimensions. Under this assumption, it is easy to see how searching for 

cliques within the graph results in a table. 

6

[6] Constrained Logic Programming 

To align words into their corresponding attribute dimensions, we frame the task as a 

mathematical assignment problem and resolve the problem using a bounds consistency approach. We 

define the assignment using the maximal clique that corresponds to the schema for each product attribute 

table (see Figure WA1.1). In the bounds consistency approach, we invert the constraints (tok_exclusion) 

to express the complementary set of candidate assignments (tok_candidates) for each attribute dimension. 

If the phrase constraints, taken together, are internally consistent, then the candidate assignments 

(tok_assign)for a given token are simply the intersection of all candidate assignments as defined by all 

phrases in the cluster containing that token. 

We transform the mutual exclusivity constraint represented by each phrase into a set of candidate 

assignments using the algorithm in Figure WA1.2. Note that we need only propagate the mutual 

exclusivity of words that are previously unassigned. Accordingly, for each unassigned token in a given 

phrase, the set of candidate assignments is the intersection of the possible assignments based upon the 

current phrase and all candidate assignments from earlier phrases containing the same token. We 

maintain a list of active tokens boundary_list to avoid rescanning the set of all tokens every time the 

possible assignments for a given token is updated. 

Finally, the K-means clustering used to separate review phrases into distinct product attributes is 

a noisy process. The clustering can easily result in the inclusion of spurious phrases. Both the initial 

process_phrases(p_list) 

[1] schema = find_maximal_clique(p_list) 

[2] order phrases by length 

[3] for each phrase p: 

[4] # initialize data structures 

[5] tok_exclusion – for each tok, mutually exclusive tokens 

[6] tok_candidates – for each tok, valid candidate assignments 

[7] tok_assign – for each tok, the dimension assignment 

[8] # propagate the constraints for each successive phrase 

[9] tok_candidates, tok_exclusion, tok_assign = 

[10] propagate_bounds(phrase, tok_candidates, 

[11] tok_exclusion, tok_assign, schema) 

[12] 

Figure WA1.1 Logical Assignment 

7

propagate_bounds(phrase, tok_candidates, tok_exclusion, tok_assign, schema) 

[1] # marshall prior assignments 

[2] unassigned_tok = {t|t�phrase � t�assign_d} 

[3] unassigned_attr = {a|a�schema � �t(t�phrase � a�tok_assign[t])} 

[4] for each t in unassigned_tok: 

[5] tok_exclusion[t] = (t � (unassigned_tok – t))⋃ tok_exclusion[t] 

[6] possible_assign = {a|a�(unassigned_attr ⋂ tok_candidates[t])} 

[7] boundary_list = {(t,[possible_assign])} ⋃ boundary_list 

[8] recurse_boundary(boundary_list, tok_exclusion, tok_assign) 

Figure WA1.2 Propagate boundary constraints 

clustering of phrases into product attributes and the subsequent assignment of words to attribute 

properties are inherently imperfect. Inconsistencies may emerge for any number of reasons including: 

Poor parsing, the legitimate appearance of one word multiple times within a single phrase (e.g. the phrase 

‘digital zoom and optical zoom’ duplicates the word ‘zoom’) or even “inaccuracies” by the human 

reviewers who write the text that is being automatically processed. This could result in a single attribute 

property divided over multiple table columns. For example, some reviews might write "SmartMedia" as a 

single word and others might use "Smart" and "media" as two separate words. Alternatively, multiple 

product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera' are clustered 

together based upon their common use of the word 'compact,' yet refer to distinct attributes. 

To address the problem of robustness in the face of noisy clusters that include references to additional 

product attributes or have different properties for the same attributes, we extend our CLP approach to 

simultaneously cluster phrases and assign words. By modeling reviews as a graph of phrases, we can 

apply the same CLP in a pre-assignment step to filter a single (noisy) cluster of phrases. As alluded to in 

Appendix B.2, we generate a graph where phrases are nodes, and edges represent the co-occurrence of 

two phrases within the same review. The extended CLP then prunes phrases by recursively applying co- 

occurrence constraints; two phrases in the same review cannot describe the same attribute just as two 

words in the same phrase cannot describe the same attribute dimension. The same assignment 

representation removes phrases that are not central to the product attribute at the heart of a particular 

8

phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or 

represent conflicting constraints are simply excluded from the subcluster. 

Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent 

distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to 

represent distinct product attributes, so we assume that meaningful tables should contain minimal word 

overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters. 

First, because each table itself separates tokens into attribute properties (columns), meaningful 

tables will not hold too small a percentage of the overall number of tokens. Second, we assume that 

meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no 

other table, then the intra-table token frequency should match the frequency of the initial k-means cluster; 

likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order 

of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2 

statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be 

tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets 

of text with a measure that weights higher-frequency tokens with greater significance than lower 

frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure 

that individual tables reflect an appropriate percentage of tokens from the initial cluster. 

After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts 

to calculate rank order statistics. We compare the token rank order from each constituent table to that in 

the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor 

extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in 

both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In 

the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or 

as synonymous words for the same product attribute as determined by a manual reading. 

9

Web Appendix B. Automatically generated attributes and their corresponding properties and levels 

1. Picture (what/where) 

tough picture 

wildlife, watersports, vivid, underexposure, unclear, unattractive, took, surprisingly, sun, streaky, crisp, sharpest, color 

regret, recommend, quality, printable, proof, outside, noisy, minimum, maximum, lost, kid, immediately, hit, 

guess, fully, frame, far, excellent, even, definitely, daytime, counter, construct, base, adequate, absolutely 

vibrant underexpose 

plenty, gallery 

consist 

stong, inaccuracy, blotchy 

2. Flash (memory, card, photo) 

1 flash type card 

slot 2 immediate 

meg bigger 

use cheap 

compact fragile 

hard 

3. Slow (start-up, turn on, recovery) 

shutter reaction, slow, fast, mode, set, control, active, auto 

exposure long, no, high quality, wide, action 

us speed automatic bit top 

recovery lag, second feature little 

wonder sluggish turbo button, operate time, adjust, manual, take, range 

1 need release delay 

virtual shot hard 2, response press 

4. Resolution 

capacity megapixel low effect 

resolution cost really high 

lcd 1.5 

5. Megapixel 

mpix 4.0 access life 

megapixel 5 set 

pixel re mega 

pixl improve meg 

stick 6.3 map 

4.10, 4.1, 3.4, 2.6, 2.3, 2.0, 11, 1.1, .8, 0 come 

6. Lens (cap, quality, manufacturer) 

easily cap pro lost lens quality 

leash avail variable no leica 

telephoto option average damage amazing 

loose nikon 

attach famous, distance 

canon 

10

7. Optical (zoom, viewfinder) 

low optical viewfinder chintzy zoom digital 

blurry wide telephoto 

16 variance lens 

x3 option 

x10 correction 

smallest, power, bigger, 7, large, lack, 

feature, decent, cheat, available, 2.8, 2.5x 

8. Body (design, construction) 

body feel fragile plastic 

look flimsy ugly 

built small nice 

compact solid metal 

camera fairly indo 

9. Print (size, quality, output) 

print photo hard quality average 

produce beautiful match image 

film 4x6 high 

perfect copies 

adjust, accuracy, misleading, 

inaccurate 

10. Zoom 

capable clear, digit, lack, fuzzy, limit, efficient zoom 2.5 

definitely incredible lens useless 

us, nice, fast loose tricky stink 

non feature no seamless 

little, function, somewhat benefit 2.5x 1.6x, distance, crap, 32 

true, really 3 lack, fuzziness, limit, efficiently 12 

11. Feel (mfr, construction) 

solid, camera look, old little design 

film feel point nice 

thing, simpler want, supp, resembles, bulkier, hard right build 

35 way, finish kind, somewhat us 

cheap, plentiful built 

starter, get extra, faulty awesome hold 

12. Menu 

control, menu versus us control option slow 

tough relative screen read manual, , 

cumbersome, bury, 

plain, inexpensive 

quirky, min sensitive 

easy manual up set, multiple, lot, difficult simple, complicated, scatter 

full, extensive, 

lack, no 

hidden, awkward 

custom, exposure preview option hidden function, scheme navigate petroglyph 

complicated 

decent familiar feature nice, layout, larger, 

imbed, hopeless, 

clear, camera, maze, 

ergonomic, annoy 

compress minimum interface, graphic design 

13. Shoot 

shoot adjust mode fast camera 

stitch slew, rapid, preset, multiple, lot, lack, numerous sound digital repetitive 

movie microphone point 

function clip, need look 

option infinitely just 

11

14. Support (service) 

support product terrible custom 

service tech inadequate 

need 

indifferent 

organize 

quick 

15. LCD 

see light screen hard lcd bright unimpressive, twist, stripe, dull, sharper, lackluster, innovative, 

deceptively, placement, location, flexible, clarity, accurate, 1.5 

sunlight soft difficult rotate 

night panel dim unprotected, articulated, huge, clear 

display, automatic outside little 

pretty 

review exposure 

daylight 

relatively 

16. Movie (audio, visual) 

control sound set unlimited length clip video 

second 30 take price length 

movie quicktime no avi 

lot audio vga mpeg mode 

quality capture useless 60 

capability feature 

17. USB 

platform, computer upload easy, weird given speed 

dock transfer file use connect 

usb, convenient charge brainless, flawless picture 

sound video, come, station mean 

archive large option pc 

cable travel slow camera 

serial download quick tv 

av link problem 

18. Focus 

focus auto fast 

soft 

nice 

19. Software 

interface benefit image no, slideshow, 

include, bundle, 

download, view, 

easyshare, zoom, browse 

useful, capable, 

improve, 

tv, external, provided, 

bundled, use, tedious, 

compatible, slightly, twain 

easy 

method, easily, 

custom 

software 

computer clunky computer, user friendly, average possible 

proprietary complicated special window retain 

fun manage clumsy tricky, function contrast 

mac, somewhat, 5 set weak, wonder, capture, suck, include, basic, hookup 

option, little, package, kodak decent, lack, 

creative 

low, install, say, reload program 

difficult xp 

12 

output, interface, box, way, 

pretty, extremely, 

particularly, generous

20. Cover (lens, LCD, battery) 

lcd cover screen no close step, interfere, care 

camera protect display slide tricky 

strap, built, auto, automatic lenses side fully zoom 

integrated lens open 

rubber afterthought lock 

retract batteries flimsy mechanical 

21. Price 

price place little higher, nice, reason 

long easy, ease point, fair use 

speed autofocus range time 

fast perfect super high 

cheap startup 

cost sluggish 

22. Memory/Screen 

need memory large lcd 

fairly screen travel view, review, 

use window additional protect 

proprietary panel sony dirty 

money upgrade, removable 

built 

23. Floppy (storage media) 

floppy access disk versatile easy 

use no 

storage 

operate 

cheap 

24. Disk 

disc computer use, regular 

easier disk load floppy 

media easi transfer 1.44mb 

diskette medium, space removable 

storage limited, choice 

cheap 

25. Battery (life, use, type) 

sure pretty, little, aa, 

eats, run, 4, 

power, ion, 

supply, built, 

take, back 

tomorrow 

recharge, extra, 

infolithium 

up use preview batteries 

lot, quick, nimh, 2, 

pack, charger, right, 

wonder 

included, 

inexpensive, keep, 

drain, suck 

fast, quickyi, alkaline, 

really, regular, lithium, , 

hour, no, need 

13 

battery 

life, time display thing cruddy just give 

ease remove, camera price weigh because socket 

bring expensive long last easy liion 

terrible real low charge lose 

system

26. Size 

small, broke, convenient, durable, 

intuitive, mechanical , near, big, 

problem, pleasant, case, bigger, 

smaller, larger, large, versatile, 

cumbersome, money, status, heavy, 

guide, nice, no, clearer, easily, dim 

design, tad, weight, slightly, kinda, lot, 

frustrating expensive, icon, capable, 

incredible, extremely, stick, chassis, 

function, mirror, possible, perfect, little, 

right, require, hard, cheap, somewhat, bit, 

2.2, digit, pretty, compare, damage, 

pocket, flip 

non, awkward a101 stuffed 

couple time finepix 

true, nearly, jean wish drop 

expect unusual, separate software 

crack unit, shape, moveable preset 

14 

sleek, lightweight, cam, , size, display, 

quirk, button, turn, color, complicated, 

ton, fairly, pro, ergonomic, solid, bulky, 

ergo, rugged, viewfinder, basic, highly, 

carry, pressure 

27. Photo quality 

photo loose 

unfocused wash 

min trait 

touch, suburb, white, tint, suction, stun, stitch, soft, share, landscape, rupture, retouch, result, realist, quality, pretty, 

plan, noisy, move, manage, lost, length, generous, fuzzy, fully, file, fabulous, eras, hard, amicable, alter, actual, 

accurate, 250, floppy, downtime 

unexpected 

0 

28. Low light 

auto lowlight little focus low difficult, option bit reliable 

dim light touchy set certain mix situation capable 

object hunt, finicky conditions auto, lot question, night grainy 

use lamp blind assist judge harder accurate 

aid hard, long, 

inability, 

difficult 

annoy, 

awkward, 

inadequate, 

laser 

fast, take, average, 

assist, slightly, range, 

system, level 

adjustable no, problem room condition, trouble, 

iffy, motion, way, 

lag, need, set, 

relatively, occasion 

imperfect 

especially, limit, 

sensitive, focus, dark, 

dismal, fare, cost, 

noisy, weak, finicky 

29. Control 

control obvious, parallax 

dist, underexposure, solid, ergonomic, unintuitive, tiny, simpler, sensitive, place, pad, odd, finicky, familiar, dummy, 

overexposure 

color, clear, basic, analog, clumsy, individual 

stigma casio 

frame, refund, size slight 

slightly full 

guess, time, big 

30. Macro (lens) 

macro design, zero, unimpressed, built, average, lack, awesome mode 

no, amazing, fussy, real, function 

super, great. fantastic, poor stupendous 

nice possible, ability 

incredible feature 

closeup, unsurpassed

31. MB (memory) 

mb media, picture included 8 smart card 

flash small lowly quality memory 

no usb 4, 32 16 flashcard 

compactflash 2, 7, 11, provided stick, take onboard, internal, run, wimpy, pricey 

come, skimpy little 128 way smartcard 

need avail size measly recommended 

32. Edit (in camera) 

edit software capable package 

effect onboard image limited 

no 

average 

33. Red eye 

problem flash eye red hue 

built massive indoors low background 

pic occasion poor yellow look, 

anti lot catch light 

show tint 

unbearable, frequent, easier, deflect, appear, eliminate, control 

34. Shutter (delay, lag) 

shutter lag turbo button, operate delay 

us speed automat bit top 

recovery sluggish second feature little 

wonderful reaction, slow, fast, long, no, mode, set, control, active, quality, wide, time, adjust, manual, take, 

exposure high, 

action 

rang 

virtual 1 need release auto 

shot hard 2, responsive press 

35. Features 

feature manual, newer, g2, extend, , switch, change, readily lot, additional, 6, need, avail, expert, document, lack 

place practice, neat, cool, extra range, array 

point wide, incredible, hard, access easy, ton, hard, difficult 

small, large, huge, sell, limit, impress, basic set, want, readable 

pro semi, level 

long, competitive list 

36. Instructions 

instruction clear 

unique from 

lot 

lose 

lack 

answer 

37. Adapter (AC) 

separate ac purchase no available adapter 

need Level 1: included optional power external 

50 comes case 

charger 

usb 

15

38. Picture quality 

extremely, switch poor picture quality 

surprisingly, printable, mediocre, amazingly image cap 

print design 

zoom distance 

durable 

lens 

39. Image (quality) 

inconsistent image quality 

webcam, super, profession, margin, hard, over, ok, nikon, mediocre, lcd, indifferent, wonder, unacceptable, 

satisfactory, terrible, problem, overprocessed, nice, mean, fair, expect, class, awful, average, astounding, 

astonishing, accept, medium, addicted, horrible, generally, control, case, before 

sharp 

incredible need 

color 

produce 

detail 

color 

16

Web Appendix C. User Survey 

In this Appendix, we list the digital camera product attributes (with duplicates eliminated) that are 

found exclusively in one or more online buying guides (Expert Only), learned automatically from the 

product reviews (VOC Only), or in both (Expert + VOC). The means for Familiarity and Importance as 

collected from our survey are reported on a 1 to 7 scale. To help align the attribute names used here with 

those in Table 2, we include a mapping from the automatically derived attribute clusters (auto) to those 55 

attributes used in the consumer survey. Note that in some cases, an automatically derived attribute is 

mapped to more than one survey (expert) attribute name and vice versa due to inconsistencies between the 

granularity with which an attribute is discussed in the expert guides and/or by the Voice of the Consumer. 

Expert Only 

Survey attribute Familiarity Importance 

battery source 6.42 6.05 

flash ext 4.91 3.09 

flash range 4.17 3.90 

image compress 4.80 4.62 

image sensor 2.37 3.25 

image stab 4.88 5.28 

man light sens 3.69 3.69 

manual exp 2.69 3.10 

manual light meter 2.38 2.86 

manual shut 3.09 3.24 

mem qty built-in 5.68 5.05 

movie fps 4.43 4.35 

movie output 3.90 4.20 

music play 4.57 3.80 

num sensors 2.25 3.18 

power adapt 6.05 4.87 

time lapse 4.45 3.76 

wide angle 3.57 3.57 

17

VOC Only 

Survey attribute Auto Familiarity Importance 

camera size size 6.35 5.87 

body (design) body 5.82 5.48 

download time USB 5.68 4.75 

feel (durability) feel; support service 5.30 5.43 

instructions instruction 6.07 4.18 

lcd brightness screen 4.62 4.57 

shutter lag slow; shutter 3.87 3.80 

twist lcd cover 3.80 3.62 

Expert + VOC 

Survey attribute Auto Familiarity Importance 

battery life battery 6.50 6.30 

cam soft edit 5.32 3.57 

cam type shoot 4.65 5.06 

comp cxn USB 6.13 5.75 

comp soft software 5.77 4.60 

ergonomic feel feel 5.15 4.83 

flash built-in red-eye 6.45 6.26 

flash mode low-light; red-eye 5.48 5.12 

lcd viewfinder lcd 5.00 5.15 

lens cap cover; lens 4.92 4.25 

lens type macro 3.80 4.20 

manual aper control 2.40 2.86 

manual focus control; focus 5.14 3.72 

mem capacity mb 5.30 5.68 

mem stor type disk; floppy; flash (drive) 5.63 5.47 

movie audio movie 4.85 4.48 

movie length movie 5.73 5.47 

movie res mpeg 4.97 5.15 

navigation menu 5.93 5.40 

optical viewfinder optical 

photo qual; picture; print; 

4.50 4.33 

picture quality 

image qual 6.35 6.60 

price price 6.31 6.45 

resolution resolution; megapixel 5.35 5.56 

18

shot modes features 5.55 4.67 

shutter delay slow; shutter 4.27 3.95 

shutter speed control 4.81 4.60 

white bal features 3.40 3.74 

zoom dig zoom 5.07 4.47 

zoom opt zoom; optical 5.52 5.21 

19

Web Appendix D. Interpreting the Correspondence Analysis dimensions. 

Correspondence Analysis is an approach to dimension reduction when analyzing high- 

dimensional, two-mode, two-way count data (Everitt and Dunn 2001). To interpret the dimensions in the 

reduced space, we regressed each brand’s factor scores on the derived attributes. For each figure in the 

paper, we report the results of the stepwise regression on F1 and F2. The reported results assume a 

probability for entry of .05 and a probability of removal of .1. 

To make the dimensions both interpretable and actionable, we follow the marketing literature in 

relating customer needs, as represented by user generated reviews, to actionable manufacturer 

specifications, as in the "House of Quality" (Hauser and Clausing 1988). Specifically, product attributes 

elicited from online reviews include different granularities ("zoom" versus "10x optical zoom") or reflect 

different levels of technical sophistication ("low-light settings" vs. "iso settings"). Consulting the set of 

professional buying guides, we manually mapped our 39 automatically generated attributes onto a coarser 

but actionable categorization of specifications. The visualizations are generated and interpreted using 

these meta-attributes. 

lens cap/cover resolution manual focus lens rechargeable battery image resolution/print qual 

feel durability price movie/video zoom image stabil manual controls 

start-up time pc connect navigation weight software picture types 

shutter-lag flash bult - in storage type size low-light/light control service/warranty 

shot-delay camera type white balance lcd autofocus 

Table WD4.1 Actionable meta-attributes for regressing explaining CA dimensions 

Figure 3. Mapping the market using customer reviews 

Summary of attribute selection for dimension F1. 

No. of 

Adjusted 

variables Attribute MSE R² R² 

1 navigation .000 .897 .882 

2 storage type .000 .992 .989 

3 movie/video .000 1.000 .999 


20

No. of 

Adjusted 


1 low light .000 .913 .900 

2 lens .000 .988 .983 

3 price .000 1.000 .999 

Figure 5. Market structure by Cons with Pros as supplementary points 


No. of 

Adjusted 


1 image resolution/print qual .000 .929 .918 

2 start-up time .000 .972 .963 

3 lens/cap cover .000 .989 .982 

4 shutter-lag .000 .996 .992 

5 service-warranty .000 1.000 .999 


No. of 

Adjusted 


1 lens .001 .689 .644 

2 navigation .000 .933 .911 

3 size .000 .973 .956 

21

Automated Marketing Research Using Online Customer Reviews

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?