njit-etd2003-081 - New Jersey Institute of Technology

njit-etd2003-081 - New Jersey Institute of Technology njit-etd2003-081 - New Jersey Institute of Technology

archives.njit.edu
from archives.njit.edu More from this publisher
20.01.2015 Views

111 These ideas are easily extended to the case of P variables x 1 , x 2 ,...,xp . Each principal component is a linear combination of the x variables. Coefficients of these linear combinations are chosen to satisfy the following three requirements: 2. The values of any two principal components are uncorrelated. 3. For any principal component the sum of the squares of the coefficients is one. In other words, C 1 is the linear combination of the largest variance. Subject to the condition that it is uncorrelated with C 1 , C2 is the linear combination with the largest variance. Similarly, C3 has the largest variance subject to the condition that it is uncorrelated with C I and C2 , etc. The VarCi are the eigenvalues. These P variances add up to the original total variance. In some literature the set of coefficients of the linear combination for the ith principal component is called the ith eigenvector (also known as the characteristic or latent vector).

112 3.15 Cluster Analysis The term cluster analysis (first used by Tryon, 1939) actually encompasses a number of different classification algorithms. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. For example, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. Note that one talks about clustering algorithms and does not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when priori hypotheses are not available, and it is still in the exploratory phase of the research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in ANOVA). Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses [59]. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies.

112<br />

3.15 Cluster Analysis<br />

The term cluster analysis (first used by Tryon, 1939) actually encompasses a number <strong>of</strong><br />

different classification algorithms. A general question facing researchers in many areas<br />

<strong>of</strong> inquiry is how to organize observed data into meaningful structures, that is, to<br />

develop taxonomies. For example, biologists have to organize the different species <strong>of</strong><br />

animals before a meaningful description <strong>of</strong> the differences between animals is possible.<br />

According to the modern system employed in biology, man belongs to the primates, the<br />

mammals, the amniotes, the vertebrates, and the animals. Note how in this<br />

classification, the higher the level <strong>of</strong> aggregation the less similar are the members in the<br />

respective class. Man has more in common with all other primates (e.g., apes) than it<br />

does with the more "distant" members <strong>of</strong> the mammals (e.g., dogs), etc.<br />

Note that one talks about clustering algorithms and does not mention anything<br />

about statistical significance testing. In fact, cluster analysis is not as much a typical<br />

statistical test as it is a "collection" <strong>of</strong> different algorithms that "put objects into<br />

clusters." The point here is that, unlike many other statistical procedures, cluster<br />

analysis methods are mostly used when priori hypotheses are not available, and it is still<br />

in the exploratory phase <strong>of</strong> the research. In a sense, cluster analysis finds the "most<br />

significant solution possible." Therefore, statistical significance testing is really not<br />

appropriate here, even in cases when p-levels are reported (as in ANOVA).<br />

Clustering techniques have been applied to a wide variety <strong>of</strong> research problems.<br />

Hartigan (1975) provides an excellent summary <strong>of</strong> the many published studies reporting<br />

the results <strong>of</strong> cluster analyses [59]. For example, in the field <strong>of</strong> medicine, clustering<br />

diseases, cures for diseases, or symptoms <strong>of</strong> diseases can lead to very useful taxonomies.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!