22.01.2013 Views

Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using Online Customer Reviews

Automated Marketing Research Using Online Customer Reviews

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Automated</strong> <strong>Marketing</strong> <strong>Research</strong> <strong>Using</strong> <strong>Online</strong> <strong>Customer</strong><br />

<strong>Reviews</strong><br />

Thomas Y. Lee and Eric T. Bradlow<br />

Web Appendix A. Algorithmic Details<br />

In this Appendix, we elaborate on specific details of the text-processing algorithms.<br />

[1] Data collection and pre-processing<br />

After identifying the source of reviews, we wrote a program in the Python programming<br />

language. For each review, we get the product identifier used by Epinions.com to uniquely identify a<br />

product, the list of Pros, and the list of Cons. Product brand names are excerpted from the Epinions.com<br />

product identifier 1 and inserted into a MySQL database. It is also important to note that one could<br />

separate the selected reviews by pre-defined segments (i.e. demographic clusters, and hence produce<br />

segment-level Pro-Con lists that would be analyzed distinctly), or attempt to (but is beyond the scope of<br />

this research) to simultaneously infer latent segments and market structure simultaneously.<br />

To construct the matrix of word vectors, we focus on each phrase. For now, we do not<br />

distinguish between whether a phrase appears as a Pro or a Con, focusing only on grouping together those<br />

phrases that discuss a common product attribute. As a standard preprocessing step in text mining, we<br />

normalize words as follows. Delete all stop-words and stem the remaining text (Salton and McGill 1983).<br />

Stop-words, like grammatical articles, conjunctions, prepositions, etc. are meaningless for purposes of<br />

product attribute identification so they are removed. For example, after pruning, the phrase "Only 8 mb<br />

Smart media card included" becomes "8 mb Smart media card included." Reduce words to their root<br />

form by stemming. We use the Porter stemmer to find equivalences between singular, plural, past and<br />

present tense forms of individual words used by customers. Thus, "includes" and included" are both<br />

reduced to the root "includ."<br />

[2] Vector space model and word importance


Borrowing from the information retrieval community, our phrase � word matrix is a<br />

representation of the vector-space model (VSM). More formally, j � J is a word in the set of all words; i<br />

� I is a phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word<br />

sequences I = {| j � J}. We define an initial phrase � word matrix as a simple variation on the term-<br />

frequency inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983):<br />

Matrix(i,j) = (TFij � IPFj) (A1.1)<br />

where the term frequency �TF ij � counts the total number of occurrences of word j in the instances of<br />

phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more<br />

helpful in distinguishing between different product attributes because they only appear in a fraction of the<br />

total number of unique phrases. If |I| represents the total number of unique phrases in the review<br />

collection, nj counts the total number of unique phrases containing word j.<br />

A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like<br />

"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in<br />

the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a<br />

second set of K phrases using online reviews for an unrelated product domain. Intuitively, words<br />

appearing in the reviews for unrelated products are less likely to represent relevant product attributes for<br />

the focal one. For example, words describing digital camera attributes are less likely to also appear in<br />

vacuum cleaner reviews.<br />

Formally, for a set of (I') phrases drawn from the set of finite word sequences over j � J, we<br />

calculate rank(j) = rank(TF'ij�IPF'j) where higher weighted frequencies correspond to higher rank. Note<br />

that multiple words may share the same rank; if we define words that do not appear in any phrase as<br />

having IPF'j = 0, then we may say:<br />

Matrix(i,j) = TF rank�<br />

j��<br />

�IPF � IPF'<br />

�<br />

ij<br />

� (A1.2)<br />

Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF'<br />

j<br />

j<br />

2


[3] Phrase clustering<br />

We cluster the vectors (the matrix rows) so that all phrases describing the same product attribute<br />

are grouped together. More formally, given the phrase � word matrix(i,j) over the set of I phrases and the<br />

set of words J, we seek to separate phrases into a set C of k mutually exclusive and exhaustive clusters<br />

We use the cosine measure of angular distance between vectors to calculate similarity. The cosine<br />

measure is then applied to the phrase � word matrix using the K-means clustering algorithm. As noted<br />

earlier, while any number of clustering algorithms is acceptable, we selected K-means for its simplicity<br />

and its familiarity to both the text-mining and marketing communities.<br />

The quality, QC, of a K-means clustering, C, is calculated by the sum of the distances from each<br />

vector in a cluster to that vector's centroid. Following (Zhao and Karypis 2002), this metric is more<br />

simply defined as the sum of the length of the composite vectors:<br />

�v centroid�c<br />

��<br />

� composite�c<br />

�<br />

� � cos<br />

i �<br />

QC �<br />

,<br />

�ci�C�v�ci�ci�C where composite �ci � � �v (1)<br />

�v�<br />

Because K-means is known to be extremely sensitive to its initial conditions, we repeat the algorithm ten<br />

times, beginning with a new, random set of k centers and pick the solution that maximizes QC.<br />

[4] Attributes and their dimensions<br />

A critical step in our approach is to discover not only what product attributes customers are<br />

discussing but also the granularity with which those attributes are discussed. Specifically, we seek to<br />

elicit the attribute dimensions that customers use in their reviews. To discover attributes, we assume that<br />

each phrase corresponds to a distinct product attribute. To discover attribute dimensions, we will assume<br />

that each word in the phrase corresponds to a distinct dimension. Discovering attributes then reduces to<br />

the assignment of particular words to attribute dimensions.<br />

ci<br />

i<br />

4


Conceptually, we model this process as a constrained optimization problem. Abusing our<br />

previous notation slightly, assume a set of phrases I composed from the set of words J and a set of<br />

attribute dimensions D. We have J � D binary decision variables Xjd where Xjd is 1 if word j is<br />

assigned to dimension d. There are I � J {0, 1} variables representing a constraint matrix where Yij is<br />

1 or 0 depending upon whether word j appears in phrase i. Thus, our objective is to:<br />

max<br />

s.<br />

t.<br />

� X<br />

�<br />

�i Y * X � 1<br />

X<br />

jd<br />

jd<br />

J<br />

ij<br />

binary<br />

The graph partitioning algorithm used to set the parameters I, J, and D and the constrained logic<br />

program (CLP) by which we solve the optimization are implemented in Python and detailed next.<br />

[5] Graph representations<br />

To discover attributes, we assume that each customer review phrase corresponds to a distinct<br />

product attribute. To discover attribute dimensions, we assume that each word in the phrase corresponds<br />

to a distinct dimension. Discovering attribute dimensions then reduces to the assignment of particular<br />

words to attribute dimensions. But how do we know how many dimensions there are in the assignment<br />

problem? Is it possible that the assignment optimization has no feasible solution because of conflicting<br />

constraints due to noise from the vagaries of human language? To solve this problem, we generate a<br />

graph of all words in the cluster. Each word is a node and arcs are defined by the co-occurrence of two<br />

words in the same phrase. We partition the graph into (possibly overlapping) sub-graphs by searching for<br />

maximal cliques. Intuitively, each sub-graph represents a maximal subset of words and phrases for which<br />

an optimal solution exists. The size of the maximal clique sets the number of attributes |D|. The sub-<br />

graph (words J and phrases I) define the optimization.<br />

jd<br />

5


More formally, we assume that phrases and words are preprocessed and normalized into words as<br />

before. A graph G = (V,E) is a pair of the set of vertices V and the set of edges E. An edge in E is a<br />

connection between two vertices and may be represented as a pair (vi,vj) � V. Each phrase (word)<br />

represents a vertex v in the graph; edges are defined by phrase pairs within a review (word pairs within a<br />

phrase). An N-partite graph is a connected graph where there are no edges in any set of vertices Vi. A<br />

clique of size N simulates a plays the role of arelational schema and can be extended to an N-partite<br />

graph by substituting each vertice vi of the clique with a set of vertices Vi. A database table with disjoint<br />

columns thus represents an N-partite graph where the size of the clique defines the number of columns<br />

and each word in the clique “names” a column. A maximal-complete-N-partite graph is a complete-N-<br />

partite graph not contained in any other such graph; in other words, the initial clique is maximal. The<br />

corresponding database table of phrases represents the existing product attribute space, and the maximal-<br />

complete-N-partite graph includes possibly novel combinations of previously unpaired attributes and/or<br />

attribute properties.<br />

To relate the graph back to customer reviews, we say that a product attribute is constructed from k<br />

dimensions. Each dimension names a domain (D). Each domain D is defined by a finite set of words that<br />

includes the value NULL for review phrases where customers fail to mention one or more attribute<br />

dimension(s). The Cartesian product of domains D1 …Dk is the set of all k-tuples {t1…tk | ti � Di}. Each<br />

phrase is simply one such k-tuple and the set of all phrases in the cluster simply defines a finite subset of<br />

the Cartesian product. A relational schema is simply a mapping of attribute properties A1 …Ak to domains<br />

D1 … Dk. Note the strong, implicit assumption that a maximal clique, taken over a word graph, is a proxy<br />

for the proper number of attribute dimensions. Under this assumption, it is easy to see how searching for<br />

cliques within the graph results in a table.<br />

6


[6] Constrained Logic Programming<br />

To align words into their corresponding attribute dimensions, we frame the task as a<br />

mathematical assignment problem and resolve the problem using a bounds consistency approach. We<br />

define the assignment using the maximal clique that corresponds to the schema for each product attribute<br />

table (see Figure WA1.1). In the bounds consistency approach, we invert the constraints (tok_exclusion)<br />

to express the complementary set of candidate assignments (tok_candidates) for each attribute dimension.<br />

If the phrase constraints, taken together, are internally consistent, then the candidate assignments<br />

(tok_assign)for a given token are simply the intersection of all candidate assignments as defined by all<br />

phrases in the cluster containing that token.<br />

We transform the mutual exclusivity constraint represented by each phrase into a set of candidate<br />

assignments using the algorithm in Figure WA1.2. Note that we need only propagate the mutual<br />

exclusivity of words that are previously unassigned. Accordingly, for each unassigned token in a given<br />

phrase, the set of candidate assignments is the intersection of the possible assignments based upon the<br />

current phrase and all candidate assignments from earlier phrases containing the same token. We<br />

maintain a list of active tokens boundary_list to avoid rescanning the set of all tokens every time the<br />

possible assignments for a given token is updated.<br />

Finally, the K-means clustering used to separate review phrases into distinct product attributes is<br />

a noisy process. The clustering can easily result in the inclusion of spurious phrases. Both the initial<br />

process_phrases(p_list)<br />

[1] schema = find_maximal_clique(p_list)<br />

[2] order phrases by length<br />

[3] for each phrase p:<br />

[4] # initialize data structures<br />

[5] tok_exclusion – for each tok, mutually exclusive tokens<br />

[6] tok_candidates – for each tok, valid candidate assignments<br />

[7] tok_assign – for each tok, the dimension assignment<br />

[8] # propagate the constraints for each successive phrase<br />

[9] tok_candidates, tok_exclusion, tok_assign =<br />

[10] propagate_bounds(phrase, tok_candidates,<br />

[11] tok_exclusion, tok_assign, schema)<br />

[12]<br />

Figure WA1.1 Logical Assignment<br />

7


propagate_bounds(phrase, tok_candidates, tok_exclusion, tok_assign, schema)<br />

[1] # marshall prior assignments<br />

[2] unassigned_tok = {t|t�phrase � t�assign_d}<br />

[3] unassigned_attr = {a|a�schema � �t(t�phrase � a�tok_assign[t])}<br />

[4] for each t in unassigned_tok:<br />

[5] tok_exclusion[t] = (t � (unassigned_tok – t))⋃ tok_exclusion[t]<br />

[6] possible_assign = {a|a�(unassigned_attr ⋂ tok_candidates[t])}<br />

[7] boundary_list = {(t,[possible_assign])} ⋃ boundary_list<br />

[8] recurse_boundary(boundary_list, tok_exclusion, tok_assign)<br />

Figure WA1.2 Propagate boundary constraints<br />

clustering of phrases into product attributes and the subsequent assignment of words to attribute<br />

properties are inherently imperfect. Inconsistencies may emerge for any number of reasons including:<br />

Poor parsing, the legitimate appearance of one word multiple times within a single phrase (e.g. the phrase<br />

‘digital zoom and optical zoom’ duplicates the word ‘zoom’) or even “inaccuracies” by the human<br />

reviewers who write the text that is being automatically processed. This could result in a single attribute<br />

property divided over multiple table columns. For example, some reviews might write "SmartMedia" as a<br />

single word and others might use "Smart" and "media" as two separate words. Alternatively, multiple<br />

product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera' are clustered<br />

together based upon their common use of the word 'compact,' yet refer to distinct attributes.<br />

To address the problem of robustness in the face of noisy clusters that include references to additional<br />

product attributes or have different properties for the same attributes, we extend our CLP approach to<br />

simultaneously cluster phrases and assign words. By modeling reviews as a graph of phrases, we can<br />

apply the same CLP in a pre-assignment step to filter a single (noisy) cluster of phrases. As alluded to in<br />

Appendix B.2, we generate a graph where phrases are nodes, and edges represent the co-occurrence of<br />

two phrases within the same review. The extended CLP then prunes phrases by recursively applying co-<br />

occurrence constraints; two phrases in the same review cannot describe the same attribute just as two<br />

words in the same phrase cannot describe the same attribute dimension. The same assignment<br />

representation removes phrases that are not central to the product attribute at the heart of a particular<br />

8


phrase cluster. Phrases that are not “connected” in the graphical sense of a connected component or<br />

represent conflicting constraints are simply excluded from the subcluster.<br />

Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent<br />

distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to<br />

represent distinct product attributes, so we assume that meaningful tables should contain minimal word<br />

overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters.<br />

First, because each table itself separates tokens into attribute properties (columns), meaningful<br />

tables will not hold too small a percentage of the overall number of tokens. Second, we assume that<br />

meaningful tables comprise a (predominately) disjoint token subset. If the tokens in a table appear in no<br />

other table, then the intra-table token frequency should match the frequency of the initial k-means cluster;<br />

likewise, the table's tokens, when ordered by frequency, should match the relative frequency-based order<br />

of the same tokens within the initial cluster. The first stage of our statistical filter is evaluation of a � 2<br />

statistic, comparing each table to its corresponding initial cluster. Although there is no hypothesis to be<br />

tested per se, there is a history of applying the � 2 statistic in linguistics research to compare different sets<br />

of text with a measure that weights higher-frequency tokens with greater significance than lower<br />

frequency tokens (Kilgarriff 2001). In our case, we set a minimum threshold on the � 2 statistic to ensure<br />

that individual tables reflect an appropriate percentage of tokens from the initial cluster.<br />

After filtering out tables that do not satisfy the � 2 threshold, we use the same cluster token counts<br />

to calculate rank order statistics. We compare the token rank order from each constituent table to that in<br />

the corresponding initial cluster using a modified Spearman rank correlation co-efficient (rs). As a minor<br />

extension, we use the relative token rank, meaning that we maintain order but keep only tokens that are in<br />

both the initial and the iterated CLP cluster(s). We select as significant those tables that maximize rs. In<br />

the event that two or more tables maximize rs we promote all such subclusters either as a noisy cluster or<br />

as synonymous words for the same product attribute as determined by a manual reading.<br />

9


Web Appendix B. Automatically generated attributes and their corresponding properties and levels<br />

1. Picture (what/where)<br />

tough picture<br />

wildlife, watersports, vivid, underexposure, unclear, unattractive, took, surprisingly, sun, streaky, crisp, sharpest, color<br />

regret, recommend, quality, printable, proof, outside, noisy, minimum, maximum, lost, kid, immediately, hit,<br />

guess, fully, frame, far, excellent, even, definitely, daytime, counter, construct, base, adequate, absolutely<br />

vibrant underexpose<br />

plenty, gallery<br />

consist<br />

stong, inaccuracy, blotchy<br />

2. Flash (memory, card, photo)<br />

1 flash type card<br />

slot 2 immediate<br />

meg bigger<br />

use cheap<br />

compact fragile<br />

hard<br />

3. Slow (start-up, turn on, recovery)<br />

shutter reaction, slow, fast, mode, set, control, active, auto<br />

exposure long, no, high quality, wide, action<br />

us speed automatic bit top<br />

recovery lag, second feature little<br />

wonder sluggish turbo button, operate time, adjust, manual, take, range<br />

1 need release delay<br />

virtual shot hard 2, response press<br />

4. Resolution<br />

capacity megapixel low effect<br />

resolution cost really high<br />

lcd 1.5<br />

5. Megapixel<br />

mpix 4.0 access life<br />

megapixel 5 set<br />

pixel re mega<br />

pixl improve meg<br />

stick 6.3 map<br />

4.10, 4.1, 3.4, 2.6, 2.3, 2.0, 11, 1.1, .8, 0 come<br />

6. Lens (cap, quality, manufacturer)<br />

easily cap pro lost lens quality<br />

leash avail variable no leica<br />

telephoto option average damage amazing<br />

loose nikon<br />

attach famous, distance<br />

canon<br />

10


7. Optical (zoom, viewfinder)<br />

low optical viewfinder chintzy zoom digital<br />

blurry wide telephoto<br />

16 variance lens<br />

x3 option<br />

x10 correction<br />

smallest, power, bigger, 7, large, lack,<br />

feature, decent, cheat, available, 2.8, 2.5x<br />

8. Body (design, construction)<br />

body feel fragile plastic<br />

look flimsy ugly<br />

built small nice<br />

compact solid metal<br />

camera fairly indo<br />

9. Print (size, quality, output)<br />

print photo hard quality average<br />

produce beautiful match image<br />

film 4x6 high<br />

perfect copies<br />

adjust, accuracy, misleading,<br />

inaccurate<br />

10. Zoom<br />

capable clear, digit, lack, fuzzy, limit, efficient zoom 2.5<br />

definitely incredible lens useless<br />

us, nice, fast loose tricky stink<br />

non feature no seamless<br />

little, function, somewhat benefit 2.5x 1.6x, distance, crap, 32<br />

true, really 3 lack, fuzziness, limit, efficiently 12<br />

11. Feel (mfr, construction)<br />

solid, camera look, old little design<br />

film feel point nice<br />

thing, simpler want, supp, resembles, bulkier, hard right build<br />

35 way, finish kind, somewhat us<br />

cheap, plentiful built<br />

starter, get extra, faulty awesome hold<br />

12. Menu<br />

control, menu versus us control option slow<br />

tough relative screen read manual, ,<br />

cumbersome, bury,<br />

plain, inexpensive<br />

quirky, min sensitive<br />

easy manual up set, multiple, lot, difficult simple, complicated, scatter<br />

full, extensive,<br />

lack, no<br />

hidden, awkward<br />

custom, exposure preview option hidden function, scheme navigate petroglyph<br />

complicated<br />

decent familiar feature nice, layout, larger,<br />

imbed, hopeless,<br />

clear, camera, maze,<br />

ergonomic, annoy<br />

compress minimum interface, graphic design<br />

13. Shoot<br />

shoot adjust mode fast camera<br />

stitch slew, rapid, preset, multiple, lot, lack, numerous sound digital repetitive<br />

movie microphone point<br />

function clip, need look<br />

option infinitely just<br />

11


14. Support (service)<br />

support product terrible custom<br />

service tech inadequate<br />

need<br />

indifferent<br />

organize<br />

quick<br />

15. LCD<br />

see light screen hard lcd bright unimpressive, twist, stripe, dull, sharper, lackluster, innovative,<br />

deceptively, placement, location, flexible, clarity, accurate, 1.5<br />

sunlight soft difficult rotate<br />

night panel dim unprotected, articulated, huge, clear<br />

display, automatic outside little<br />

pretty<br />

review exposure<br />

daylight<br />

relatively<br />

16. Movie (audio, visual)<br />

control sound set unlimited length clip video<br />

second 30 take price length<br />

movie quicktime no avi<br />

lot audio vga mpeg mode<br />

quality capture useless 60<br />

capability feature<br />

17. USB<br />

platform, computer upload easy, weird given speed<br />

dock transfer file use connect<br />

usb, convenient charge brainless, flawless picture<br />

sound video, come, station mean<br />

archive large option pc<br />

cable travel slow camera<br />

serial download quick tv<br />

av link problem<br />

18. Focus<br />

focus auto fast<br />

soft<br />

nice<br />

19. Software<br />

interface benefit image no, slideshow,<br />

include, bundle,<br />

download, view,<br />

easyshare, zoom, browse<br />

useful, capable,<br />

improve,<br />

tv, external, provided,<br />

bundled, use, tedious,<br />

compatible, slightly, twain<br />

easy<br />

method, easily,<br />

custom<br />

software<br />

computer clunky computer, user friendly, average possible<br />

proprietary complicated special window retain<br />

fun manage clumsy tricky, function contrast<br />

mac, somewhat, 5 set weak, wonder, capture, suck, include, basic, hookup<br />

option, little, package, kodak decent, lack,<br />

creative<br />

low, install, say, reload program<br />

difficult xp<br />

12<br />

output, interface, box, way,<br />

pretty, extremely,<br />

particularly, generous


20. Cover (lens, LCD, battery)<br />

lcd cover screen no close step, interfere, care<br />

camera protect display slide tricky<br />

strap, built, auto, automatic lenses side fully zoom<br />

integrated lens open<br />

rubber afterthought lock<br />

retract batteries flimsy mechanical<br />

21. Price<br />

price place little higher, nice, reason<br />

long easy, ease point, fair use<br />

speed autofocus range time<br />

fast perfect super high<br />

cheap startup<br />

cost sluggish<br />

22. Memory/Screen<br />

need memory large lcd<br />

fairly screen travel view, review,<br />

use window additional protect<br />

proprietary panel sony dirty<br />

money upgrade, removable<br />

built<br />

23. Floppy (storage media)<br />

floppy access disk versatile easy<br />

use no<br />

storage<br />

operate<br />

cheap<br />

24. Disk<br />

disc computer use, regular<br />

easier disk load floppy<br />

media easi transfer 1.44mb<br />

diskette medium, space removable<br />

storage limited, choice<br />

cheap<br />

25. Battery (life, use, type)<br />

sure pretty, little, aa,<br />

eats, run, 4,<br />

power, ion,<br />

supply, built,<br />

take, back<br />

tomorrow<br />

recharge, extra,<br />

infolithium<br />

up use preview batteries<br />

lot, quick, nimh, 2,<br />

pack, charger, right,<br />

wonder<br />

included,<br />

inexpensive, keep,<br />

drain, suck<br />

fast, quickyi, alkaline,<br />

really, regular, lithium, ,<br />

hour, no, need<br />

13<br />

battery<br />

life, time display thing cruddy just give<br />

ease remove, camera price weigh because socket<br />

bring expensive long last easy liion<br />

terrible real low charge lose<br />

system


26. Size<br />

small, broke, convenient, durable,<br />

intuitive, mechanical , near, big,<br />

problem, pleasant, case, bigger,<br />

smaller, larger, large, versatile,<br />

cumbersome, money, status, heavy,<br />

guide, nice, no, clearer, easily, dim<br />

design, tad, weight, slightly, kinda, lot,<br />

frustrating expensive, icon, capable,<br />

incredible, extremely, stick, chassis,<br />

function, mirror, possible, perfect, little,<br />

right, require, hard, cheap, somewhat, bit,<br />

2.2, digit, pretty, compare, damage,<br />

pocket, flip<br />

non, awkward a101 stuffed<br />

couple time finepix<br />

true, nearly, jean wish drop<br />

expect unusual, separate software<br />

crack unit, shape, moveable preset<br />

14<br />

sleek, lightweight, cam, , size, display,<br />

quirk, button, turn, color, complicated,<br />

ton, fairly, pro, ergonomic, solid, bulky,<br />

ergo, rugged, viewfinder, basic, highly,<br />

carry, pressure<br />

27. Photo quality<br />

photo loose<br />

unfocused wash<br />

min trait<br />

touch, suburb, white, tint, suction, stun, stitch, soft, share, landscape, rupture, retouch, result, realist, quality, pretty,<br />

plan, noisy, move, manage, lost, length, generous, fuzzy, fully, file, fabulous, eras, hard, amicable, alter, actual,<br />

accurate, 250, floppy, downtime<br />

unexpected<br />

0<br />

28. Low light<br />

auto lowlight little focus low difficult, option bit reliable<br />

dim light touchy set certain mix situation capable<br />

object hunt, finicky conditions auto, lot question, night grainy<br />

use lamp blind assist judge harder accurate<br />

aid hard, long,<br />

inability,<br />

difficult<br />

annoy,<br />

awkward,<br />

inadequate,<br />

laser<br />

fast, take, average,<br />

assist, slightly, range,<br />

system, level<br />

adjustable no, problem room condition, trouble,<br />

iffy, motion, way,<br />

lag, need, set,<br />

relatively, occasion<br />

imperfect<br />

especially, limit,<br />

sensitive, focus, dark,<br />

dismal, fare, cost,<br />

noisy, weak, finicky<br />

29. Control<br />

control obvious, parallax<br />

dist, underexposure, solid, ergonomic, unintuitive, tiny, simpler, sensitive, place, pad, odd, finicky, familiar, dummy,<br />

overexposure<br />

color, clear, basic, analog, clumsy, individual<br />

stigma casio<br />

frame, refund, size slight<br />

slightly full<br />

guess, time, big<br />

30. Macro (lens)<br />

macro design, zero, unimpressed, built, average, lack, awesome mode<br />

no, amazing, fussy, real, function<br />

super, great. fantastic, poor stupendous<br />

nice possible, ability<br />

incredible feature<br />

closeup, unsurpassed


31. MB (memory)<br />

mb media, picture included 8 smart card<br />

flash small lowly quality memory<br />

no usb 4, 32 16 flashcard<br />

compactflash 2, 7, 11, provided stick, take onboard, internal, run, wimpy, pricey<br />

come, skimpy little 128 way smartcard<br />

need avail size measly recommended<br />

32. Edit (in camera)<br />

edit software capable package<br />

effect onboard image limited<br />

no<br />

average<br />

33. Red eye<br />

problem flash eye red hue<br />

built massive indoors low background<br />

pic occasion poor yellow look,<br />

anti lot catch light<br />

show tint<br />

unbearable, frequent, easier, deflect, appear, eliminate, control<br />

34. Shutter (delay, lag)<br />

shutter lag turbo button, operate delay<br />

us speed automat bit top<br />

recovery sluggish second feature little<br />

wonderful reaction, slow, fast, long, no, mode, set, control, active, quality, wide, time, adjust, manual, take,<br />

exposure high,<br />

action<br />

rang<br />

virtual 1 need release auto<br />

shot hard 2, responsive press<br />

35. Features<br />

feature manual, newer, g2, extend, , switch, change, readily lot, additional, 6, need, avail, expert, document, lack<br />

place practice, neat, cool, extra range, array<br />

point wide, incredible, hard, access easy, ton, hard, difficult<br />

small, large, huge, sell, limit, impress, basic set, want, readable<br />

pro semi, level<br />

long, competitive list<br />

36. Instructions<br />

instruction clear<br />

unique from<br />

lot<br />

lose<br />

lack<br />

answer<br />

37. Adapter (AC)<br />

separate ac purchase no available adapter<br />

need Level 1: included optional power external<br />

50 comes case<br />

charger<br />

usb<br />

15


38. Picture quality<br />

extremely, switch poor picture quality<br />

surprisingly, printable, mediocre, amazingly image cap<br />

print design<br />

zoom distance<br />

durable<br />

lens<br />

39. Image (quality)<br />

inconsistent image quality<br />

webcam, super, profession, margin, hard, over, ok, nikon, mediocre, lcd, indifferent, wonder, unacceptable,<br />

satisfactory, terrible, problem, overprocessed, nice, mean, fair, expect, class, awful, average, astounding,<br />

astonishing, accept, medium, addicted, horrible, generally, control, case, before<br />

sharp<br />

incredible need<br />

color<br />

produce<br />

detail<br />

color<br />

16


Web Appendix C. User Survey<br />

In this Appendix, we list the digital camera product attributes (with duplicates eliminated) that are<br />

found exclusively in one or more online buying guides (Expert Only), learned automatically from the<br />

product reviews (VOC Only), or in both (Expert + VOC). The means for Familiarity and Importance as<br />

collected from our survey are reported on a 1 to 7 scale. To help align the attribute names used here with<br />

those in Table 2, we include a mapping from the automatically derived attribute clusters (auto) to those 55<br />

attributes used in the consumer survey. Note that in some cases, an automatically derived attribute is<br />

mapped to more than one survey (expert) attribute name and vice versa due to inconsistencies between the<br />

granularity with which an attribute is discussed in the expert guides and/or by the Voice of the Consumer.<br />

Expert Only<br />

Survey attribute Familiarity Importance<br />

battery source 6.42 6.05<br />

flash ext 4.91 3.09<br />

flash range 4.17 3.90<br />

image compress 4.80 4.62<br />

image sensor 2.37 3.25<br />

image stab 4.88 5.28<br />

man light sens 3.69 3.69<br />

manual exp 2.69 3.10<br />

manual light meter 2.38 2.86<br />

manual shut 3.09 3.24<br />

mem qty built-in 5.68 5.05<br />

movie fps 4.43 4.35<br />

movie output 3.90 4.20<br />

music play 4.57 3.80<br />

num sensors 2.25 3.18<br />

power adapt 6.05 4.87<br />

time lapse 4.45 3.76<br />

wide angle 3.57 3.57<br />

17


VOC Only<br />

Survey attribute Auto Familiarity Importance<br />

camera size size 6.35 5.87<br />

body (design) body 5.82 5.48<br />

download time USB 5.68 4.75<br />

feel (durability) feel; support service 5.30 5.43<br />

instructions instruction 6.07 4.18<br />

lcd brightness screen 4.62 4.57<br />

shutter lag slow; shutter 3.87 3.80<br />

twist lcd cover 3.80 3.62<br />

Expert + VOC<br />

Survey attribute Auto Familiarity Importance<br />

battery life battery 6.50 6.30<br />

cam soft edit 5.32 3.57<br />

cam type shoot 4.65 5.06<br />

comp cxn USB 6.13 5.75<br />

comp soft software 5.77 4.60<br />

ergonomic feel feel 5.15 4.83<br />

flash built-in red-eye 6.45 6.26<br />

flash mode low-light; red-eye 5.48 5.12<br />

lcd viewfinder lcd 5.00 5.15<br />

lens cap cover; lens 4.92 4.25<br />

lens type macro 3.80 4.20<br />

manual aper control 2.40 2.86<br />

manual focus control; focus 5.14 3.72<br />

mem capacity mb 5.30 5.68<br />

mem stor type disk; floppy; flash (drive) 5.63 5.47<br />

movie audio movie 4.85 4.48<br />

movie length movie 5.73 5.47<br />

movie res mpeg 4.97 5.15<br />

navigation menu 5.93 5.40<br />

optical viewfinder optical<br />

photo qual; picture; print;<br />

4.50 4.33<br />

picture quality<br />

image qual 6.35 6.60<br />

price price 6.31 6.45<br />

resolution resolution; megapixel 5.35 5.56<br />

18


shot modes features 5.55 4.67<br />

shutter delay slow; shutter 4.27 3.95<br />

shutter speed control 4.81 4.60<br />

white bal features 3.40 3.74<br />

zoom dig zoom 5.07 4.47<br />

zoom opt zoom; optical 5.52 5.21<br />

19


Web Appendix D. Interpreting the Correspondence Analysis dimensions.<br />

Correspondence Analysis is an approach to dimension reduction when analyzing high-<br />

dimensional, two-mode, two-way count data (Everitt and Dunn 2001). To interpret the dimensions in the<br />

reduced space, we regressed each brand’s factor scores on the derived attributes. For each figure in the<br />

paper, we report the results of the stepwise regression on F1 and F2. The reported results assume a<br />

probability for entry of .05 and a probability of removal of .1.<br />

To make the dimensions both interpretable and actionable, we follow the marketing literature in<br />

relating customer needs, as represented by user generated reviews, to actionable manufacturer<br />

specifications, as in the "House of Quality" (Hauser and Clausing 1988). Specifically, product attributes<br />

elicited from online reviews include different granularities ("zoom" versus "10x optical zoom") or reflect<br />

different levels of technical sophistication ("low-light settings" vs. "iso settings"). Consulting the set of<br />

professional buying guides, we manually mapped our 39 automatically generated attributes onto a coarser<br />

but actionable categorization of specifications. The visualizations are generated and interpreted using<br />

these meta-attributes.<br />

lens cap/cover resolution manual focus lens rechargeable battery image resolution/print qual<br />

feel durability price movie/video zoom image stabil manual controls<br />

start-up time pc connect navigation weight software picture types<br />

shutter-lag flash bult - in storage type size low-light/light control service/warranty<br />

shot-delay camera type white balance lcd autofocus<br />

Table WD4.1 Actionable meta-attributes for regressing explaining CA dimensions<br />

Figure 3. Mapping the market using customer reviews<br />

Summary of attribute selection for dimension F1.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 navigation .000 .897 .882<br />

2 storage type .000 .992 .989<br />

3 movie/video .000 1.000 .999<br />

Summary of attribute selection for dimension F2.<br />

20


No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 low light .000 .913 .900<br />

2 lens .000 .988 .983<br />

3 price .000 1.000 .999<br />

Figure 5. Market structure by Cons with Pros as supplementary points<br />

Summary of attribute selection for dimension F1.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 image resolution/print qual .000 .929 .918<br />

2 start-up time .000 .972 .963<br />

3 lens/cap cover .000 .989 .982<br />

4 shutter-lag .000 .996 .992<br />

5 service-warranty .000 1.000 .999<br />

Summary of attribute selection for dimension F2.<br />

No. of<br />

Adjusted<br />

variables Attribute MSE R² R²<br />

1 lens .001 .689 .644<br />

2 navigation .000 .933 .911<br />

3 size .000 .973 .956<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!