Thomas Y. Lee and Eric T. Bradlow<br />

Web Appendix A. Algorithmic Details<br />

In this Appendix, we elaborate on specific details of the text-processing algorithms.<br />

[1] Data collection and pre-processing<br />

After identifying the source of reviews, we wrote a program in the Python programming<br />

language. For each review, we get the product identifier used by Epinions.com to uniquely identify a<br />

product, the list of Pros, and the list of Cons. Product brand names are excerpted from the Epinions.com<br />

product identifier 1 and inserted into a MySQL database. It is also important to note that one could<br />

separate the selected reviews by pre-defined segments (i.e. demographic clusters, and hence produce<br />

segment-level Pro-Con lists that would be analyzed distinctly), or attempt to (but is beyond the scope of<br />

this research) to simultaneously infer latent segments and market structure simultaneously.<br />

To construct the matrix of word vectors, we focus on each phrase. For now, we do not<br />

distinguish between whether a phrase appears as a Pro or a Con, focusing only on grouping together those<br />

phrases that discuss a common product attribute. As a standard preprocessing step in text mining, we<br />

normalize words as follows. Delete all stop-words and stem the remaining text (Salton and McGill 1983).<br />

Stop-words, like grammatical articles, conjunctions, prepositions, etc. are meaningless for purposes of<br />

product attribute identification so they are removed. For example, after pruning, the phrase "Only 8 mb<br />

Smart media card included" becomes "8 mb Smart media card included." Reduce words to their root<br />

form by stemming. We use the Porter stemmer to find equivalences between singular, plural, past and<br />

present tense forms of individual words used by customers. Thus, "includes" and included" are both<br />

reduced to the root "includ."<br />

[2] Vector space model and word importance

Borrowing from the information retrieval community, our phrase � word matrix is a<br />

representation of the vector-space model (VSM). More formally, j � J is a word in the set of all words; i<br />

� I is a phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word<br />

sequences I = {| j � J}. We define an initial phrase � word matrix as a simple variation on the term-<br />

frequency inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983):<br />

Matrix(i,j) = (TFij � IPFj) (A1.1)<br />

where the term frequency �TF ij � counts the total number of occurrences of word j in the instances of<br />

phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more<br />

helpful in distinguishing between different product attributes because they only appear in a fraction of the<br />

total number of unique phrases. If |I| represents the total number of unique phrases in the review<br />

collection, nj counts the total number of unique phrases containing word j.<br />

A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like<br />

"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in<br />

the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a<br />

second set of K phrases using online reviews for an unrelated product domain. Intuitively, words<br />

appearing in the reviews for unrelated products are less likely to represent relevant product attributes for<br />

the focal one. For example, words describing digital camera attributes are less likely to also appear in<br />

vacuum cleaner reviews.<br />

Formally, for a set of (I') phrases drawn from the set of finite word sequences over j � J, we<br />

calculate rank(j) = rank(TF'ij�IPF'j) where higher weighted frequencies correspond to higher rank. Note<br />

that multiple words may share the same rank; if we define words that do not appear in any phrase as<br />

having IPF'j = 0, then we may say:<br />

Matrix(i,j) = TF rank�<br />

j��<br />

�IPF � IPF'<br />

�<br />

ij<br />

� (A1.2)<br />

Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF'<br />

j<br />

j<br />


[3] Phrase clustering<br />

We cluster the vectors (the matrix rows) so that all phrases describing the same product attribute<br />

are grouped together. More formally, given the phrase � word matrix(i,j) over the set of I phrases and the<br />

set of words J, we seek to separate phrases into a set C of k mutually exclusive and exhaustive clusters<br />

We use the cosine measure of angular distance between vectors to calculate similarity. The cosine<br />

measure is then applied to the phrase � word matrix using the K-means clustering algorithm. As noted<br />

earlier, while any number of clustering algorithms is acceptable, we selected K-means for its simplicity<br />

and its familiarity to both the text-mining and marketing communities.<br />

The quality, QC, of a K-means clustering, C, is calculated by the sum of the distances from each<br />

vector in a cluster to that vector's centroid. Following (Zhao and Karypis 2002), this metric is more<br />

simply defined as the sum of the length of the composite vectors:<br />

�v centroid�c<br />

��<br />

� composite�c<br />

�<br />

� � cos<br />

i �<br />

QC �<br />

,<br />

�ci�C�v�ci�ci�C where composite �ci � � �v (1)<br />

�v�<br />

Because K-means is known to be extremely sensitive to its initial conditions, we repeat the algorithm ten<br />

times, beginning with a new, random set of k centers and pick the solution that maximizes QC.<br />

[4] Attributes and their dimensions<br />

A critical step in our approach is to discover not only what product attributes customers are<br />

discussing but also the granularity with which those attributes are discussed. Specifically, we seek to<br />

elicit the attribute dimensions that customers use in their reviews. To discover attributes, we assume that<br />

each phrase corresponds to a distinct product attribute. To discover attribute dimensions, we will assume<br />

that each word in the phrase corresponds to a distinct dimension. Discovering attributes then reduces to<br />

the assignment of particular words to attribute dimensions.<br />

ci<br />

i<br />


Conceptually, we model this process as a constrained optimization problem. Abusing our<br />

previous notation slightly, assume a set of phrases I composed from the set of words J and a set of<br />

attribute dimensions D. We have J � D binary decision variables Xjd where Xjd is 1 if word j is<br />

assigned to dimension d. There are I � J {0, 1} variables representing a constraint matrix where Yij is<br />

1 or 0 depending upon whether word j appears in phrase i. Thus, our objective is to:<br />

max<br />

s.<br />

t.<br />

� X<br />

�<br />

�i Y * X � 1<br />

X<br />

jd<br />

jd<br />

J<br />

ij<br />

binary<br />

The graph partitioning algorithm used to set the parameters I, J, and D and the constrained logic<br />

program (CLP) by which we solve the optimization are implemented in Python and detailed next.<br />

[5] Graph representations<br />

To discover attributes, we assume that each customer review phrase corresponds to a distinct<br />

product attribute. To discover attribute dimensions, we assume that each word in the phrase corresponds<br />

to a distinct dimension. Discovering attribute dimensions then reduces to the assignment of particular<br />

words to attribute dimensions. But how do we know how many dimensions there are in the assignment<br />

problem? Is it possible that the assignment optimization has no feasible solution because of conflicting<br />

constraints due to noise from the vagaries of human language? To solve this problem, we generate a<br />

graph of all words in the cluster. Each word is a node and arcs are defined by the co-occurrence of two<br />

words in the same phrase. We partition the graph into (possibly overlapping) sub-graphs by searching for<br />

maximal cliques. Intuitively, each sub-graph represents a maximal subset of words and phrases for which<br />

an optimal solution exists. The size of the maximal clique sets the number of attributes |D|. The sub-<br />

graph (words J and phrases I) define the optimization.<br />

jd<br />


More formally, we assume that phrases and words are preprocessed and normalized into words as<br />

before. A graph G = (V,E) is a pair of the set of vertices V and the set of edges E. An edge in E is a<br />

connection between two vertices and may be represented as a pair (vi,vj) � V. Each phrase (word)<br />

represents a vertex v in the graph; edges are defined by phrase pairs within a review (word pairs within a<br />

phrase). An N-partite graph is a connected graph where there are no edges in any set of vertices Vi. A<br />

clique of size N simulates a plays the role of arelational schema and can be extended to an N-partite<br />

graph by substituting each vertice vi of the clique with a set of vertices Vi. A database table with disjoint<br />

columns thus represents an N-partite graph where the size of the clique defines the number of columns<br />

and each word in the clique “names” a column. A maximal-complete-N-partite graph is a complete-N-<br />

partite graph not contained in any other such graph; in other words, the initial clique is maximal. The<br />

corresponding database table of phrases represents the existing product attribute space, and the maximal-<br />

complete-N-partite graph includes possibly novel combinations of previously unpaired attributes and/or<br />

attribute properties.<br />

To relate the graph back to customer reviews, we say that a product attribute is constructed from k<br />

dimensions. Each dimension names a domain (D). Each domain D is defined by a finite set of words that<br />

includes the value NULL for review phrases where customers fail to mention one or more attribute<br />

dimension(s). The Cartesian product of domains D1 …Dk is the set of all k-tuples {t1…tk | ti � Di}. Each<br />

phrase is simply one such k-tuple and the set of all phrases in the cluster simply defines a finite subset of<br />

the Cartesian product. A relational schema is simply a mapping of attribute properties A1 …Ak to domains<br />

D1 … Dk. Note the strong, implicit assumption that a maximal clique, taken over a word graph, is a proxy<br />

for the proper number of attribute dimensions. Under this assumption, it is easy to see how searching for<br />

cliques within the graph results in a table.<br />


[6] Constrained Logic Programming<br />

To align words into their corresponding attribute dimensions, we frame the task as a<br />

mathematical assignment problem and resolve the problem using a bounds consistency approach. We<br />

define the assignment using the maximal clique that corresponds to the schema for each product attribute<br />

table (see Figure WA1.1). In the bounds consistency approach, we invert the constraints (tok_exclusion)<br />

to express the complementary set of candidate assignments (tok_candidates) for each attribute dimension.<br />

If the phrase constraints, taken together, are internally consistent, then the candidate assignments<br />

(tok_assign)for a given token are simply the intersection of all candidate assignments as defined by all<br />

phrases in the cluster containing that token.<br />

We transform the mutual exclusivity constraint represented by each phrase into a set of candidate<br />

assignments using the algorithm in Figure WA1.2. Note that we need only propagate the mutual<br />

exclusivity of words that are previously unassigned. Accordingly, for each unassigned token in a given<br />

phrase, the set of candidate assignments is the intersection of the possible assignments based upon the<br />

current phrase and all candidate assignments from earlier phrases containing the same token. We<br />

maintain a list of active tokens boundary_list to avoid rescanning the set of all tokens every time the<br />

possible assignments for a given token is updated.<br />

Finally, the K-means clustering used to separate review phrases into distinct product attributes is<br />

a noisy process. The clustering can easily result in the inclusion of spurious phrases. Both the initial<br />

process_phrases(p_list)<br />

[1] schema = find_maximal_clique(p_list)<br />

[2] order phrases by length<br />

[3] for each phrase p:<br />

[4] # initialize data structures<br />

[5] tok_exclusion – for each tok, mutually exclusive tokens<br />

[6] tok_candidates – for each tok, valid candidate assignments<br />

[7] tok_assign – for each tok, the dimension assignment<br />

[8] # propagate the constraints for each successive phrase<br />

[9] tok_candidates, tok_exclusion, tok_assign =<br />

[10] propagate_bounds(phrase, tok_candidates,<br />

[11] tok_exclusion, tok_assign, schema)<br />

[12]<br />

Figure WA1.1 Logical Assignment<br />


propagate_bounds(phrase, tok_candidates, tok_exclusion, tok_assign, schema)<br />

[1] # marshall prior assignments<br />

[2] unassigned_tok = {t|t�phrase � t�assign_d}<br />

[3] unassigned_attr = {a|a�schema � �t(t�phrase � a�tok_assign[t])}<br />

[4] for each t in unassigned_tok:<br />

[5] tok_exclusion[t] = (t � (unassigned_tok – t))⋃ tok_exclusion[t]<br />

[6] possible_assign = {a|a�(unassigned_attr ⋂ tok_candidates[t])}<br />

[7] boundary_list = {(t,[possible_assign])} ⋃ boundary_list<br />

[8] recurse_boundary(boundary_list, tok_exclusion, tok_assign)<br />

Figure WA1.2 Propagate boundary constraints<br />

clustering of phrases into product attributes and the subsequent assignment of words to attribute<br />

properties are inherently imperfect. Inconsistencies may emerge for any number of reasons including:<br />

Poor parsing, the legitimate appearance of one word multiple times within a single phrase (e.g. the phrase<br />

‘digital zoom and optical zoom’ duplicates the word ‘zoom’) or even “inaccuracies” by the human<br />

reviewers who write the text that is being automatically processed. This could result in a single attribute<br />

property divided over multiple table columns. For example, some reviews might write "SmartMedia" as a<br />

single word and others might use "Smart" and "media" as two separate words. Alternatively, multiple<br />

product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera' are clustered<br />

together based upon their common use of the word 'compact,' yet refer to distinct attributes.<br />

To address the problem of robustness in the face of noisy clusters that include references to additional<br />

product attributes or have different properties for the same attributes, we extend our CLP approach to<br />

simultaneously cluster phrases and assign words. By modeling reviews as a graph of phrases, we can<br />

apply the same CLP in a pre-assignment step to filter a single (noisy) cluster of phrases. As alluded to in<br />

Appendix B.2, we generate a graph where phrases are nodes, and edges represent the co-occurrence of<br />

two phrases within the same review. The extended CLP then prunes phrases by recursively applying co-<br />

occurrence constraints; two phrases in the same review cannot describe the same attribute just as two<br />

words in the same phrase cannot describe the same attribute dimension. The same assignment<br />

representation removes phrases that are not central to the product attribute at the heart of a particular<br />


phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or<br />

represent conflicting constraints are simply excluded from the subcluster.<br />

Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent<br />

distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to<br />

represent distinct product attributes, so we assume that meaningful tables should contain minimal word<br />

overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters.<br />

First, because each table itself separates tokens into attribute properties (columns), meaningful<br />

tables will not hold too small a percentage of the overall number of tokens. Second, we assume that<br />

meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no<br />

other table, then the intra-table token frequency should match the frequency of the initial k-means cluster;<br />

likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order<br />

of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2<br />

statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be<br />

tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets<br />

of text with a measure that weights higher-frequency tokens with greater significance than lower<br />

frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure<br />

that individual tables reflect an appropriate percentage of tokens from the initial cluster.<br />

After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts<br />

to calculate rank order statistics. We compare the token rank order from each constituent table to that in<br />

the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor<br />

extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in<br />

both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In<br />

the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or<br />

as synonymous words for the same product attribute as determined by a manual reading.<br />


Web Appendix B. Automatically generated attributes and their corresponding properties and levels<br />

1. Picture (what/where)<br />

tough picture<br />

wildlife, watersports, vivid, underexposure, unclear, unattractive, took, surprisingly, sun, streaky, crisp, sharpest, color<br />

regret, recommend, quality, printable, proof, outside, noisy, minimum, maximum, lost, kid, immediately, hit,<br />

guess, fully, frame, far, excellent, even, definitely, daytime, counter, construct, base, adequate, absolutely<br />

vibrant underexpose<br />

plenty, gallery<br />

consist<br />

stong, inaccuracy, blotchy<br />

2. Flash (memory, card, photo)<br />

1 flash type card<br />

slot 2 immediate<br />

meg bigger<br />

use cheap<br />

compact fragile<br />

hard<br />

3. Slow (start-up, turn on, recovery)<br />

shutter reaction, slow, fast, mode, set, control, active, auto<br />

exposure long, no, high quality, wide, action<br />

us speed automatic bit top<br />

recovery lag, second feature little<br />

wonder sluggish turbo button, operate time, adjust, manual, take, range<br />

1 need release delay<br />

virtual shot hard 2, response press<br />

4. Resolution<br />

capacity megapixel low effect<br />

resolution cost really high<br />

lcd 1.5<br />

5. Megapixel<br />

mpix 4.0 access life<br />

megapixel 5 set<br />

pixel re mega<br />

pixl improve meg<br />

stick 6.3 map<br />

4.10, 4.1, 3.4, 2.6, 2.3, 2.0, 11, 1.1, .8, 0 come<br />

6. Lens (cap, quality, manufacturer)<br />

easily cap pro lost lens quality<br />

leash avail variable no leica<br />

telephoto option average damage amazing<br />

loose nikon<br />

attach famous, distance<br />

canon<br />


7. Optical (zoom, viewfinder)<br />

low optical viewfinder chintzy zoom digital<br />

blurry wide telephoto<br />

16 variance lens<br />

x3 option<br />

x10 correction<br />

smallest, power, bigger, 7, large, lack,<br />

feature, decent, cheat, available, 2.8, 2.5x<br />

8. Body (design, construction)<br />

body feel fragile plastic<br />

look flimsy ugly<br />

built small nice<br />

compact solid metal<br />

camera fairly indo<br />

9. Print (size, quality, output)<br />

print photo hard quality average<br />

produce beautiful match image<br />

film 4x6 high<br />

perfect copies<br />

adjust, accuracy, misleading,<br />

inaccurate<br />

10. Zoom<br />

capable clear, digit, lack, fuzzy, limit, efficient zoom 2.5<br />

definitely incredible lens useless<br />

us, nice, fast loose tricky stink<br />

non feature no seamless<br />

little, function, somewhat benefit 2.5x 1.6x, distance, crap, 32<br />

true, really 3 lack, fuzziness, limit, efficiently 12<br />

11. Feel (mfr, construction)<br />

solid, camera look, old little design<br />

film feel point nice<br />

thing, simpler want, supp, resembles, bulkier, hard right build<br />

35 way, finish kind, somewhat us<br />

cheap, plentiful built<br />

starter, get extra, faulty awesome hold<br />

12. Menu<br />

control, menu versus us control option slow<br />

tough relative screen read manual, ,<br />

cumbersome, bury,<br />

plain, inexpensive<br />

quirky, min sensitive<br />

easy manual up set, multiple, lot, difficult simple, complicated, scatter<br />

full, extensive,<br />

lack, no<br />

hidden, awkward<br />

custom, exposure preview option hidden function, scheme navigate petroglyph<br />

complicated<br />

decent familiar feature nice, layout, larger,<br />

imbed, hopeless,<br />

clear, camera, maze,<br />

ergonomic, annoy<br />

compress minimum interface, graphic design<br />

13. Shoot<br />

shoot adjust mode fast camera<br />

stitch slew, rapid, preset, multiple, lot, lack, numerous sound digital repetitive<br />

movie microphone point<br />

function clip, need look<br />

option infinitely just<br />


14. Support (service)<br />

support product terrible custom<br />

service tech inadequate<br />

need<br />

indifferent<br />

organize<br />

quick<br />

15. LCD<br />

see light screen hard lcd bright unimpressive, twist, stripe, dull, sharper, lackluster, innovative,<br />

deceptively, placement, location, flexible, clarity, accurate, 1.5<br />

sunlight soft difficult rotate<br />

night panel dim unprotected, articulated, huge, clear<br />

display, automatic outside little<br />

pretty<br />

review exposure<br />

daylight<br />

relatively<br />

16. Movie (audio, visual)<br />

control sound set unlimited length clip video<br />

second 30 take price length<br />

movie quicktime no avi<br />

lot audio vga mpeg mode<br />

quality capture useless 60<br />

capability feature<br />

17. USB<br />

platform, computer upload easy, weird given speed<br />

dock transfer file use connect<br />

usb, convenient charge brainless, flawless picture<br />

sound video, come, station mean<br />

archive large option pc<br />

cable travel slow camera<br />

serial download quick tv<br />

av link problem<br />

18. Focus<br />

focus auto fast<br />

soft<br />

nice<br />

19. Software<br />

interface benefit image no, slideshow,<br />

include, bundle,<br />

download, view,<br />

easyshare, zoom, browse<br />

useful, capable,<br />

improve,<br />

tv, external, provided,<br />

bundled, use, tedious,<br />

compatible, slightly, twain<br />

easy<br />

method, easily,<br />

custom<br />

software<br />

computer clunky computer, user friendly, average possible<br />

proprietary complicated special window retain<br />

fun manage clumsy tricky, function contrast<br />

mac, somewhat, 5 set weak, wonder, capture, suck, include, basic, hookup<br />

option, little, package, kodak decent, lack,<br />

creative<br />

low, install, say, reload program<br />

difficult xp<br />

12<br />

output, interface, box, way,<br />

pretty, extremely,<br />

particularly, generous

20. Cover (lens, LCD, battery)<br />

lcd cover screen no close step, interfere, care<br />

camera protect display slide tricky<br />

strap, built, auto, automatic lenses side fully zoom<br />

integrated lens open<br />

rubber afterthought lock<br />

retract batteries flimsy mechanical<br />

21. Price<br />

price place little higher, nice, reason<br />

long easy, ease point, fair use<br />

speed autofocus range time<br />

fast perfect super high<br />

cheap startup<br />

cost sluggish<br />

22. Memory/Screen<br />

need memory large lcd<br />

fairly screen travel view, review,<br />

use window additional protect<br />

proprietary panel sony dirty<br />

money upgrade, removable<br />

built<br />

23. Floppy (storage media)<br />

floppy access disk versatile easy<br />

use no<br />

storage<br />

operate<br />

cheap<br />

24. Disk<br />

disc computer use, regular<br />

easier disk load floppy<br />

media easi transfer 1.44mb<br />

diskette medium, space removable<br />

storage limited, choice<br />

cheap<br />

25. Battery (life, use, type)<br />

sure pretty, little, aa,<br />

eats, run, 4,<br />

power, ion,<br />

supply, built,<br />

take, back<br />

tomorrow<br />

recharge, extra,<br />

infolithium<br />

up use preview batteries<br />

lot, quick, nimh, 2,<br />

pack, charger, right,<br />

wonder<br />

included,<br />

inexpensive, keep,<br />

drain, suck<br />

fast, quickyi, alkaline,<br />

really, regular, lithium, ,<br />

hour, no, need<br />

13<br />

battery<br />

life, time display thing cruddy just give<br />

ease remove, camera price weigh because socket<br />

bring expensive long last easy liion<br />

terrible real low charge lose<br />


26. Size<br />

small, broke, convenient, durable,<br />

intuitive, mechanical , near, big,<br />

problem, pleasant, case, bigger,<br />

smaller, larger, large, versatile,<br />

cumbersome, money, status, heavy,<br />

guide, nice, no, clearer, easily, dim<br />

design, tad, weight, slightly, kinda, lot,<br />

frustrating expensive, icon, capable,<br />

incredible, extremely, stick, chassis,<br />

function, mirror, possible, perfect, little,<br />

right, require, hard, cheap, somewhat, bit,<br />

2.2, digit, pretty, compare, damage,<br />

pocket, flip<br />

non, awkward a101 stuffed<br />

couple time finepix<br />

true, nearly, jean wish drop<br />

expect unusual, separate software<br />

crack unit, shape, moveable preset<br />

14<br />

sleek, lightweight, cam, , size, display,<br />

quirk, button, turn, color, complicated,<br />

ton, fairly, pro, ergonomic, solid, bulky,<br />

ergo, rugged, viewfinder, basic, highly,<br />

carry, pressure<br />

27. Photo quality<br />

photo loose<br />

unfocused wash<br />

min trait<br />

touch, suburb, white, tint, suction, stun, stitch, soft, share, landscape, rupture, retouch, result, realist, quality, pretty,<br />

plan, noisy, move, manage, lost, length, generous, fuzzy, fully, file, fabulous, eras, hard, amicable, alter, actual,<br />

accurate, 250, floppy, downtime<br />

unexpected<br />

0<br />

28. Low light<br />

auto lowlight little focus low difficult, option bit reliable<br />

dim light touchy set certain mix situation capable<br />

object hunt, finicky conditions auto, lot question, night grainy<br />

use lamp blind assist judge harder accurate<br />

aid hard, long,<br />

inability,<br />

difficult<br />

annoy,<br />

awkward,<br />

inadequate,<br />

laser<br />

fast, take, average,<br />

assist, slightly, range,<br />

system, level<br />

adjustable no, problem room condition, trouble,<br />

iffy, motion, way,<br />

lag, need, set,<br />

relatively, occasion<br />

imperfect<br />

especially, limit,<br />

sensitive, focus, dark,<br />

dismal, fare, cost,<br />

noisy, weak, finicky<br />

29. Control<br />

control obvious, parallax<br />

dist, underexposure, solid, ergonomic, unintuitive, tiny, simpler, sensitive, place, pad, odd, finicky, familiar, dummy,<br />

overexposure<br />

color, clear, basic, analog, clumsy, individual<br />

stigma casio<br />

frame, refund, size slight<br />

slightly full<br />

guess, time, big<br />

30. Macro (lens)<br />

macro design, zero, unimpressed, built, average, lack, awesome mode<br />

no, amazing, fussy, real, function<br />

super, great. fantastic, poor stupendous<br />

nice possible, ability<br />

incredible feature<br />

closeup, unsurpassed

31. MB (memory)<br />

mb media, picture included 8 smart card<br />

flash small lowly quality memory<br />

no usb 4, 32 16 flashcard<br />

compactflash 2, 7, 11, provided stick, take onboard, internal, run, wimpy, pricey<br />

come, skimpy little 128 way smartcard<br />

need avail size measly recommended<br />

32. Edit (in camera)<br />

edit software capable package<br />

effect onboard image limited<br />

no<br />

average<br />

33. Red eye<br />

problem flash eye red hue<br />

built massive indoors low background<br />

pic occasion poor yellow look,<br />

anti lot catch light<br />

show tint<br />

unbearable, frequent, easier, deflect, appear, eliminate, control<br />

34. Shutter (delay, lag)<br />

shutter lag turbo button, operate delay<br />

us speed automat bit top<br />

recovery sluggish second feature little<br />

wonderful reaction, slow, fast, long, no, mode, set, control, active, quality, wide, time, adjust, manual, take,<br />

exposure high,<br />

action<br />

rang<br />

virtual 1 need release auto<br />

shot hard 2, responsive press<br />

35. Features<br />

feature manual, newer, g2, extend, , switch, change, readily lot, additional, 6, need, avail, expert, document, lack<br />

place practice, neat, cool, extra range, array<br />

point wide, incredible, hard, access easy, ton, hard, difficult<br />

small, large, huge, sell, limit, impress, basic set, want, readable<br />

pro semi, level<br />

long, competitive list<br />

36. Instructions<br />

instruction clear<br />

unique from<br />

lot<br />

lose<br />

lack<br />

answer<br />

37. Adapter (AC)<br />

separate ac purchase no available adapter<br />

need Level 1: included optional power external<br />

50 comes case<br />

charger<br />

usb<br />


38. Picture quality<br />

extremely, switch poor picture quality<br />

surprisingly, printable, mediocre, amazingly image cap<br />

print design<br />

zoom distance<br />

durable<br />

lens<br />

39. Image (quality)<br />

inconsistent image quality<br />

webcam, super, profession, margin, hard, over, ok, nikon, mediocre, lcd, indifferent, wonder, unacceptable,<br />

satisfactory, terrible, problem, overprocessed, nice, mean, fair, expect, class, awful, average, astounding,<br />

astonishing, accept, medium, addicted, horrible, generally, control, case, before<br />

sharp<br />

incredible need<br />

color<br />

produce<br />

detail<br />

color<br />


Web Appendix C. User Survey<br />

In this Appendix, we list the digital camera product attributes (with duplicates eliminated) that are<br />

found exclusively in one or more online buying guides (Expert Only), learned automatically from the<br />

product reviews (VOC Only), or in both (Expert + VOC). The means for Familiarity and Importance as<br />

collected from our survey are reported on a 1 to 7 scale. To help align the attribute names used here with<br />

those in Table 2, we include a mapping from the automatically derived attribute clusters (auto) to those 55<br />

attributes used in the consumer survey. Note that in some cases, an automatically derived attribute is<br />

mapped to more than one survey (expert) attribute name and vice versa due to inconsistencies between the<br />

granularity with which an attribute is discussed in the expert guides and/or by the Voice of the Consumer.<br />

Expert Only<br />

Survey attribute Familiarity Importance<br />

battery source 6.42 6.05<br />

flash ext 4.91 3.09<br />

flash range 4.17 3.90<br />

image compress 4.80 4.62<br />

image sensor 2.37 3.25<br />

image stab 4.88 5.28<br />

man light sens 3.69 3.69<br />

manual exp 2.69 3.10<br />

manual light meter 2.38 2.86<br />

manual shut 3.09 3.24<br />

mem qty built-in 5.68 5.05<br />

movie fps 4.43 4.35<br />

movie output 3.90 4.20<br />

music play 4.57 3.80<br />

num sensors 2.25 3.18<br />

power adapt 6.05 4.87<br />

time lapse 4.45 3.76<br />

wide angle 3.57 3.57<br />


VOC Only<br />

Survey attribute Auto Familiarity Importance<br />

camera size size 6.35 5.87<br />

body (design) body 5.82 5.48<br />

download time USB 5.68 4.75<br />

feel (durability) feel; support service 5.30 5.43<br />

instructions instruction 6.07 4.18<br />

lcd brightness screen 4.62 4.57<br />

shutter lag slow; shutter 3.87 3.80<br />

twist lcd cover 3.80 3.62<br />

Expert + VOC<br />

Survey attribute Auto Familiarity Importance<br />

battery life battery 6.50 6.30<br />

cam soft edit 5.32 3.57<br />

cam type shoot 4.65 5.06<br />

comp cxn USB 6.13 5.75<br />

comp soft software 5.77 4.60<br />

ergonomic feel feel 5.15 4.83<br />

flash built-in red-eye 6.45 6.26<br />

flash mode low-light; red-eye 5.48 5.12<br />

lcd viewfinder lcd 5.00 5.15<br />

lens cap cover; lens 4.92 4.25<br />

lens type macro 3.80 4.20<br />

manual aper control 2.40 2.86<br />

manual focus control; focus 5.14 3.72<br />

mem capacity mb 5.30 5.68<br />

mem stor type disk; floppy; flash (drive) 5.63 5.47<br />

movie audio movie 4.85 4.48<br />

movie length movie 5.73 5.47<br />

movie res mpeg 4.97 5.15<br />

navigation menu 5.93 5.40<br />

optical viewfinder optical<br />

photo qual; picture; print;<br />

4.50 4.33<br />

picture quality<br />

image qual 6.35 6.60<br />

price price 6.31 6.45<br />

resolution resolution; megapixel 5.35 5.56<br />


shot modes features 5.55 4.67<br />

shutter delay slow; shutter 4.27 3.95<br />

shutter speed control 4.81 4.60<br />

white bal features 3.40 3.74<br />

zoom dig zoom 5.07 4.47<br />

zoom opt zoom; optical 5.52 5.21<br />


Web Appendix D. Interpreting the Correspondence Analysis dimensions.<br />

Correspondence Analysis is an approach to dimension reduction when analyzing high-<br />

dimensional, two-mode, two-way count data (Everitt and Dunn 2001). To interpret the dimensions in the<br />

reduced space, we regressed each brand’s factor scores on the derived attributes. For each figure in the<br />

paper, we report the results of the stepwise regression on F1 and F2. The reported results assume a<br />

probability for entry of .05 and a probability of removal of .1.<br />

To make the dimensions both interpretable and actionable, we follow the marketing literature in<br />

relating customer needs, as represented by user generated reviews, to actionable manufacturer<br />

specifications, as in the "House of Quality" (Hauser and Clausing 1988). Specifically, product attributes<br />

elicited from online reviews include different granularities ("zoom" versus "10x optical zoom") or reflect<br />

different levels of technical sophistication ("low-light settings" vs. "iso settings"). Consulting the set of<br />

professional buying guides, we manually mapped our 39 automatically generated attributes onto a coarser<br />

but actionable categorization of specifications. The visualizations are generated and interpreted using<br />

these meta-attributes.<br />

lens cap/cover resolution manual focus lens rechargeable battery image resolution/print qual<br />

feel durability price movie/video zoom image stabil manual controls<br />

start-up time pc connect navigation weight software picture types<br />

shutter-lag flash bult - in storage type size low-light/light control service/warranty<br />

shot-delay camera type white balance lcd autofocus<br />

Table WD4.1 Actionable meta-attributes for regressing explaining CA dimensions<br />

Figure 3. Mapping the market using customer reviews<br />

Summary of attribute selection for dimension F1.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 navigation .000 .897 .882<br />

2 storage type .000 .992 .989<br />

3 movie/video .000 1.000 .999<br />

Summary of attribute selection for dimension F2.<br />


No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 low light .000 .913 .900<br />

2 lens .000 .988 .983<br />

3 price .000 1.000 .999<br />

Figure 5. Market structure by Cons with Pros as supplementary points<br />

Summary of attribute selection for dimension F1.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 image resolution/print qual .000 .929 .918<br />

2 start-up time .000 .972 .963<br />

3 lens/cap cover .000 .989 .982<br />

4 shutter-lag .000 .996 .992<br />

5 service-warranty .000 1.000 .999<br />

Summary of attribute selection for dimension F2.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 lens .001 .689 .644<br />

2 navigation .000 .933 .911<br />

3 size .000 .973 .956<br />


