Choice of distance matrices in cluster analysis: defining regions

Mimmack, Gillian M.; Mason, Simon J.; Galpin, Jacqueline S.

Cluster analysis is a technique frequently used in climatology for grouping cases to define classes (synoptic types or climate regimes, for example), or for grouping stations or grid points to define regions. Cluster analysis is based on some form of distance matrix, and the most commonly used metric in the climatological field has been Euclidean distances. Arguments for the use of Euclidean distances are in some ways similar to arguments for using a covariance matrix in principal components analysis: the use of the metric is valid if all data are measured on the same scale. When using Euclidean distances for cluster analysis, however, the additional assumption is made that all the variables are uncorrelated, and this assumption is frequently ignored. Two possible methods of dealing with the correlation between the variables are considered: performing a principal components analysis before calculating Euclidean distances, and calculating Mahalanobis distances using the raw data. Under certain conditions calculating Mahalanobis distances is equivalent to calculating Euclidean distances from the principal components. It is suggested that when cluster analysis is used for defining regions, Mahalanobis distances are inappropriate, and that Euclidean distances should be calculated using the unstandardized principal component scores based on only the major principal components.


  • thumnail for Mimmack_GM_etal_2001_JClim_14_2790.pdf Mimmack_GM_etal_2001_JClim_14_2790.pdf application/pdf 524 KB Download File

Also Published In

More About This Work

Academic Units
International Research Institute for Climate and Society
Published Here
March 23, 2020