Choropleth map



A choropleth map is a type of statistical thematic map that uses pseudocolor, meaning color corresponding with an aggregate summary of a geographic characteristic within spatial enumeration units, such as population density or per-capita income.

Choropleth maps provide an easy way to visualize how a variable varies across a geographic area or show the level of variability within a region. A heat map or isarithmic map is similar but uses regions drawn according to the pattern of the variable, rather than the a priori geographic areas of choropleth maps. The choropleth is likely the most common type of thematic map because published statistical data (from government or other sources) is generally aggregated into well-known geographic units, such as countries, states, provinces, and counties, and thus they are relatively easy to create using GIS, spreadsheets, or other software tools.

History
The earliest known choropleth map was created in 1826 by Baron Pierre Charles Dupin, depicting the availability of basic education in France by department. More "cartes teintées" ("tinted maps") were soon produced in France to visualize other "moral statistics" on education, disease, crime, and living conditions. Choropleth maps quickly gained popularity in several countries due to the increasing availability of demographic data compiled from national Censuses, starting with a series of choropleth maps published in the official reports of the 1841 Census of Ireland. When Chromolithography became widely available after 1850, color was increasingly added to choropleth maps.

The term "choropleth map" was introduced in 1938 by the geographer John Kirtland Wright, and was in common usage among cartographers by the 1940s. Also in 1938, Glenn Trewartha reintroduced them as "ratio maps", but this term did not survive.

Structure
A choropleth map brings together two datasets: spatial data representing a partition of geographic space into distinct districts, and statistical data representing a variable aggregated within each district. There are two common conceptual models of how these interact in a choropleth map: in one view, which may be called "district dominant", the districts (often existing governmental units) are the focus, in which a variety of attributes are collected, including the variable being mapped. In the other view, which may be called "variable dominant", the focus is on the variable as a geographic phenomenon (say, the Latino population), with a real-world distribution, and the partitioning of it into districts is merely a convenient measurement technique.



Geometry: aggregation districts
In a choropleth map, the districts are usually previously defined entities such as governmental or administrative units (e.g., counties, provinces, countries), or districts created specifically for statistical aggregation (e.g., census tracts), and thus have no expectation of correlation with the geography of the variable. That is, boundaries of the colored districts may or may not coincide with the location of changes in the geographic distribution being studied. This is in direct contrast to chorochromatic and isarithmic maps, in which region boundaries are defined by patterns in the geographic distribution of the subject phenomenon.

Using pre-defined aggregation regions has a number of advantages, including: easier compilation and mapping of the variable (especially in the age of GIS and the Internet with its many sources of data), recognizability of the districts, and the applicability of the information to further inquiry and policy tied to the individual districts. A prime example of this would be elections, in which the vote total for each district determines its elected representative.

However, it can result in a number of issues, generally due to the fact that the constant color applied to each aggregation district makes it look homogeneous, masking an unknown degree of variation of the variable within the district. For example, a city may include neighborhoods of low, moderate, and high family income, but be colored with one constant "moderate" color. Thus, real-world spatial patterns may not conform to the regional unit symbolized. Because of this, issues such as the ecological fallacy and the modifiable areal unit problem (MAUP) can lead to major misinterpretations of the data depicted, and other techniques are preferable if one can obtain the necessary data.

These issues can be somewhat mitigated by using smaller districts, because they show finer variations in the mapped variable, and their smaller visual size and increased number reduces the likelihood that the map user makes judgments about the variation within a single district. However, they can make the map overly complex, especially if there is not a meaningful geographic pattern in the variable (i.e., the map looks like randomly scattered colors). Although representing specific data in large regions can be misleading, the familiar district shapes can make the map clearer and easier to interpret and remember. The choice of regions will ultimately depend on the map's intended audience and purpose. Alternatively, the dasymetric technique can sometimes be employed to refine the region boundaries to more closely match actual changes in the subject phenomenon.

Because of these issues, for many variables, one may prefer an isarithmic (for a quantitative variable) or chorochromatic map (for a qualitative variable), in which the region boundaries are based on the data itself. However, in many cases such detailed information is simply not available, and the choropleth map is the only feasible option.



Property: aggregate statistical summaries
The variable to be mapped may come from a wide variety of disciplines in the human or natural world, although human topics (e.g. demographics, economics, agriculture) are generally more common because of the role of governmental units in human activity, which often leads to the original collection of the statistical data. The variable can also be in any of Stevens' levels of measurement: nominal, ordinal, interval, or ratio, although quantitative (interval/ratio) variables are more commonly used in choropleth maps than qualitative (nominal/ordinal) variables. It is important to note that the level of measurement of the individual datum may be different than the aggregate summary statistic. For example, a census may ask each individual for his or her "primary spoken language" (nominal), but this may be summarized over all of the individuals in a county as "percent primarily speaking Spanish" (ratio) or as "predominant primary language" (nominal).

Broadly speaking, a choropleth map may represent two types of variables, a distinction common to physics and chemistry as well as Geostatistics and spatial analysis:
 * A spatially extensive variable (sometimes called a global property) is one that can apply only to the entire district, commonly in the form of total counts or amounts of a phenomenon (akin to Mass or weight in physics). Extensive variables are said to be accumulative over space; for example, if the population of the United Kingdom is 65 million, it is not possible that the populations of England, Wales, Scotland, and Northern Ireland could also be 65 million. Instead, their total populations must sum (accumulate) to calculate the total population of the collective entity. However, while it is possible to map an extensive variable in a choropleth map, this is almost universally discouraged because patterns can be easily misinterpreted. For example, if a choropleth map assigned a particular shade of red to total populations between 60 and 70 million, a situation in which United Kingdom (as a single district) has 65 million inhabitants would be indistinguishable from a situation in which the four constituent countries each had 65 million inhabitants, even though these are vastly different geographic realities. Another source of interpretation error is that if a large district and a small district have the same value (and thus the same color), the larger one will naturally look like more. Other types of thematic maps, especially proportional symbols and cartograms, are designed to represent extensive variables and are generally preferred.
 * A spatially intensive variable, also known as a field, statistical surface, or localized variable, represents a property that could be measured at any location (a point or small area, depending on its nature) in space, independent of any boundaries, although its variation over a district can be summarized as a single value. Common intensive variables include densities, proportions, rates of change, mean allotments (e.g., GDP per capita), and descriptive statistics (e.g., mean, median, standard deviation). Intensive variables are said to be distributive over space; for example, if the population density of the United Kingdom is 250 people per square kilometer, then it would be reasonable to estimate (in the absence of any other data) that the most likely (if not actually correct) density of each of the five constituent countries is also 250/km2. Traditionally in cartography, the predominant conceptual model for this kind of phenomenon has been the statistical surface, in which the variable is imagined as a third-dimension "height" above the two-dimensional space that varies continuously. In Geographic information science, the more common conceptualization is the field, adopted from Physics and usually modeled as a scalar function of location. Choropleth maps are better suited to intensive variables than extensive; if a map user sees the United Kingdom filled with a color for "100-200 people per square km", estimating that Wales and England may each have 100-200 people per square km may not be accurate, but it is possible and a reasonable estimate.



Normalization
Normalization is the technique of deriving a spatially intensive variable from one or more spatially extensive variables, so that it can be appropriately used in a choropleth map. It is similar, but not identical, to the technique of normalization or standardization in statistics. Typically, it is accomplished by computing the ratio between two spatially extensive variables. Although any such ratio will result in an intensive variable, only a few are especially meaningful and commonly used in choropleth maps:
 * Density = total / area. Example: population density
 * Proportion = subgroup total / grand total. Example: Wealthy households as a percentage of all households.
 * Mean allocation = total amount / total individuals. Example: gross domestic product per capita (total GDP / total population)
 * Rate of change = total at later time / total at earlier time. Example: annual population growth rate.

These are not equivalent, nor is one better than another. Rather, they tell different aspects of a geographic narrative. For example, a choropleth map of the population density of the Latino population in Texas visualizes a narrative about the spatial clustering and distribution of that group, while a map of the percent Latino visualizes a narrative of composition and predominance. Failure to employ proper normalization will lead to an inappropriate and potentially misleading map in almost all cases. This is one of the most common mistakes in cartography, with one study finding that at one point, more than half of United States COVID-19 dashboards hosted by state governments were not employing normalization to their choropleth maps. This is one of many issues that contributed to the infodemic surrounding the COVID-19 pandemic, and "might also be a subtle facilitator of the extreme political polarization surrounding measures to combat COVID that has occurred in the United States".

Classification
Every choropleth map has a strategy for mapping values to colors. A classified choropleth map separates the range of values into classes, with all of the districts in each class being assigned the same color. An unclassed map (sometimes called n-class) directly assigns a color proportional to the value of each district. Starting with Dupin's 1826 map, classified choropleth maps have been far more common. It is likely that this was originally due to the greater simplicity of applying a limited set of tints; only in the age of computerized cartography have unclassed choropleth maps even been feasible, and until recently, they were still not easy to create in most mapping software. Waldo R. Tobler, in formally introducing the unclassed scheme in 1973, asserted that it was a more accurate depiction of the original data, and stated that the primary argument in favor of classification, that it is more readable, needed to be tested. The debate and experiments that followed came to the general conclusion that the primary advantage of unclassed choropleth maps, in addition to Tobler's assertion of raw accuracy, was that they allowed readers to see subtle variations in the variable, without leading them to believe that the districts the fell into the same class had identical values. Thus, they are able to better see the general patterns in the geographic phenomenon, but not the specific values. The primary argument in favor of classed choropleth maps is that it is easier for readers to process, due to the fewer number of distinct shades to recognize, which reduces cognitive load and allows them to precisely match the colors in the map to the values listed in the legend.

Classification is performed by establishing a classification rule, a series of thresholds that partitions the quantitative range of variable values into a series of ordered classes. For example, if a dataset of annual Median income by U.S. county includes values between US$20,000 and $150,000, it could be broken into three classes at thresholds of $45,000 and $83,000. To avoid confusion, any classification rule should be mutually exclusive and collectively exhaustive, meaning that any possible value falls into exactly one class. For example, if a rule establishes a threshold at the value 6.5, it needs to be clear about whether a district with a value of exactly 6.5 will be classified into the lower or upper class (i.e., whether the definition of the lower class is <6.5 or ≤6.5 and whether the upper class is >6.5 or ≥6.5). A variety of types of classification rules have been developed for choropleth maps:
 * Exogenous rules import thresholds without regard for patterns in the data at hand.
 * Established rules are those already in common use due to past scientific research or official policy. An example would be using government tax brackets or a standard Poverty threshold when classifying income levels.
 * Ad hoc or Common sense strategies are essentially invented by the cartographer using thresholds that have some intuitive sense. An example would be classifying incomes according to what the cartographer believes to be "rich," "middle class," and "poor." These strategies are generally not advised unless all other methods are not feasible.
 * Endogenous rules are based on patterns in the dataset itself.
 * Natural breaks rules look for natural clusters in the data, in which large numbers of districts have similar values with large gaps between them. If this is the case, such clusters are probably geographically meaningful.
 * The Jenks natural breaks optimization, developed by George F. Jenks, is a heuristic algorithm for automatically identifying such clusters if they exist; it is essentially a one-dimensional form of the k-means clustering algorithm. If natural clusters do not exist, the breaks it generates are often recognized as a good compromise between the other methods, and it is commonly the default classifier used in GIS software.
 * Equal intervals or an arithmetic progression divides the range of values so that each class has an equal range of values: (max - min)/n. For example, the income range above ($20,000 - $150,000) would be divided into four classes at $52,500, $85,000, and $117,500.
 * A standard deviation rule also generates equal ranges of value, but rather than starting with the minimum and maximum values, it starts at the arithmetic mean of the data and establishes a break at each multiple of a constant number of standard deviations above and below the mean.
 * Quantiles divides the dataset so each class has an equal number of districts. For example, if the 3,141 counties of the United States were divided into four quantile classes (i.e., quartiles), then the first class would include the 785 poorest counties, then the next 785. Adjustments may need to be made when the number of districts does not divide evenly, or when identical values straddle the threshold.
 * A Geometric progression rule divides the range of values so the ratio of thresholds is constant (rather than their interval as in an arithmetic progression). For example, the income range above would be divided using a ratio of 2 with thresholds at $40,000 and $80,000. This type of rule is commonly used when the frequency distribution of the data has a very high positive skew, especially if it is geometric or exponential.
 * A nested means or Head/tail Breaks rule is an algorithm that recursively divides the data set by setting a threshold at the arithmetic mean, then subdividing each of the two created classes at their respective means, and so on. Thus, the number of classes is not arbitrary, but must be a power of two (2, 4, 8, etc.). It has been suggested that this also works well for highly skewed distributions.

Because calculated thresholds can often be at precise values that are not easily interpretable by map readers (e.g., $74,326.9734), it is common to create a modified classification rule by rounding threshold values to a similar simple number. A common example is a modified geometric progression that subdivides powers of ten, such as [1, 2.5, 5, 10, 25, 50, 100, ...] or [1, 3, 10, 30, 100, ...].

Color progression
The final element of a choropleth map is the set of colors used to represent the different values of the variable. There are a variety of different approaches to this task, but the primary principle is that any order in the variable (e.g., low to high quantitative values) should be reflected in the perceived order of the colors (e.g., light to dark), as this will allow map readers to intuitively make "more vs. less" judgements and see trends and patterns with minimal reference to the legend. A second general guideline, at least for classified maps, is that the colors should be easily distinguishable, so the colors on the map can be unambiguously matched to those in the legend to determine the represented values. This requirement limits the number of classes that can be included; for shades of gray, tests have shown that when value alone is used (e.g., light to dark, whether gray or any single hue), it is difficult to practically use more than seven classes. If differences in hue and/or saturation are incorporated, that limit increases significantly to as many as 10-12 classes. The need for color discrimination is further impacted by color vision deficiencies; for example, color schemes that use red and green to distinguish values will not be useful for a significant portion of the population.

The most common types of color progressions used in choropleth (and other thematic) maps include:


 * A Sequential progression represents variable values as color value
 * A Grayscale progression uses only shades of gray. Color progression examples value progression.svg
 * A Single-hue progression fades from a dark shade of the chosen color (or gray) to a very light or white shade of relatively the same hue. This is a common method used to map magnitude. The darkest hue represents the greatest number in the data set and the lightest shade representing the least number. Color progression examples single hue.svg
 * A Partial-spectral progression uses a limited range of hues to add more contrast to the value contrast, enabling a larger number of classes to be used. Yellow is commonly used for the lighter end of the progression due to its natural apparent lightness. Common hue ranges are yellow-green-blue and yellow-orange-red. Color progression examples blended hue.svg
 * A Divergent or Bi-polar progression is essentially two sequential color progressions (of the types above) joined with a common light color or white. They are normally used to represent positive and negative values or divergence from a central tendency, such as the mean of the variable being mapped. For example, a typical progression when mapping temperatures is from dark blue (for cold) to dark red (for hot) with white in the middle. These are often used when the two extremes are given value judgements, such as showing the "good" end as green and the "bad" end as red. Color progression examples bi-polar.svg
 * A Spectral progression uses a wide range of hues (possibly the entire color wheel) without intended differences in value. This is most commonly used when there is an order to the values, but it is not a "more vs. less" order, such as seasonality. It is frequently used by non-cartographers in situations where other color progressions would be much more effective. Color progression examples full-spectral.svg
 * A Qualitative progression uses a scattered set of hues in no particular order, with no intended difference in value. This is most commonly used with nominal categories in a qualitative choropleth map, such as "most prevalent religion." Color progression examples qualitative.svg

Bivariate choropleth maps
It is possible to represent two (and sometimes three) variables simultaneously on a single choropleth map by representing each with a single-hue progression and blending the colors of each district. This technique was first published by the U.S. Census Bureau in the 1970s, and has been used many times since, to varying degrees of success. This technique is generally used to visualize the correlation and contrast between two variables hypothesized to be closely related, such as educational attainment and income. Contrasting but not complementary colors are generally used, so that their combination is intuitively recognized as "between" the two original colors, such as red+blue=purple. The technique works best when the geography of the variable has a high degree of spatial autocorrelation, so that there are large regions of similar colors with gradual changes between them; otherwise the map can look like a confusing mix of random colors. They have been found to be more easily used if the map includes a carefully designed legend and an explanation of the technique.

Legend
A choropleth map uses ad hoc symbols to represent the mapped variable. While the general strategy may be intuitive if a color progression is chosen that reflects the proper order, map readers cannot decipher the actual value of each district without a legend. A typical choropleth legend for a classed choropleth map includes a series of sample patches of the symbol for each class, with a text description of the corresponding range of values. On an unclassed choropleth map, it is common for the legend to show a smooth color gradient between the minimum and maximum values, with two or more points along it labeled with corresponding values.

An alternative approach is the histogram legend, which includes a histogram showing the frequency distribution of the mapped variable (i.e., the number of districts in each class). Each class may be represented by a single bar with its width determined by its minimum and maximum threshold values and its height calculated such that the box area is proportional to the number of districts included, then colored with the map symbol used for that class. Alternatively, the histogram may be divided into a large number of bars, such that each class includes one or more bars, symbolized according to its symbol in the map. This form of legend shows not only the threshold values for each class, but gives some context for the source of those values, especially for endogenous classification rules that are based on the frequency distribution, such as quantiles. However, they are not currently supported in GIS and mapping software, and must typically be constructed manually.