232

# Statistical Data

Authored by: Nicolas Lambert , Christine Zanin

# Practical Handbook of Thematic Cartography

Print publication date:  May  2020
Online publication date:  May  2020

Print ISBN: 9780367261290
eBook ISBN: 9780429291968

10.1201/9780429291968-4

#### Abstract

Geographic information has a semantic dimension that corresponds to the subject developed by the map. This semantic information can be quantitative or qualitative. This chapter describes how to consider statistical data, how to identify them, and how to represent them in order to avoid misinterpretation between the data and its representation.

#### Objectives

• How to construct a data table
• How to identify the nature of statistical indicators
• Getting to know the different discretization methods
• Choosing the right discretization method.

Statistical data is either quantitative or qualitative, either collected or constructed, enabling a cartographic representation. We consider that this data is the semantic dimension of geographical information. While the basemap can be seen as the “container”, the data is the “content”. It forms the base of the geographical information represented and delivers the geographical message by way of spatially organizing that what is revealed by the map. This is the data to which the rules of symbolization (graphic semiology) apply (see Part 2).

#### 2.1  Data Tables

Statistical data is stored in the form of elementary tables in which each line corresponds to a spatial unit (or a geographical unit) and each column corresponds to a variable (or an indicator) that characterizes the object. The data is captured, modified, and handled by way of a spreadsheet. Producing a map requires these data tables to be very thorough and accurate so as to avoid any ambiguity when integrating the data into the basemap.

The first line in the table identifies the names of the different variables. They should be as short and explicit as possible. For reasons of compatibility with certain software programs, it is preferable to avoid special characters, spaces, and accents. The first column serves to identify each single territorial unit by a specific code. These codes are often related to coding systems from the different data suppliers (US Census Bureau, Eurostat, World Bank, etc.). However, in the case of a completely new dataset an intelligible coding system will need to be devised. The units or “boxes” in the table formed in this manner correspond to the different values taken on by the spatial units for each of the variables.

Data tables can include missing values. These can be referenced by way of an empty cell in the table, or by “NA” or “N/A”, meaning not attributable or not available. In certain tables, for historical reasons relating to the data formats in some software, missing data are sometimes coded -9999. To avoid any risk of confusion, this practice should be avoided. Thus, missing data will appear as blanks on the map and be referenced in the legend as such.

Figure 2.01   Data tables.

A nomenclature is defined like a set or system of names or terms, as those used in a particular science or art, by an individual or community. We can call countries nomenclature the country codes which are short alphabetic or numeric geographical codes (geocodes) developed to represent countries and dependent areas, for use in data processing and communications. Several different systems have been developed to do this. The term “country code” frequently refers to ISO 3166-1. However, each international organization has their own countries codes. For example, WIS codes for the World Bank and M49 Standard “Standard Country or Area Codes for Statistical Use” for the United Nations system.

The ISO 3166-1 system SO 3166-1 is part of the ISO 3166 standard published by the International Organization for Standardization (ISO). The alphabetic country codes were first included in ISO 3166 in 1974, and the numeric country codes were first included in 1981. The country codes have been published as ISO 3166-1 since 1997, when ISO 3166 was expanded into three parts, with ISO 3166-2 defining codes for subdivisions and ISO 3166-3 defining codes for former countries

For UN, the list of countries or areas contains the names of countries or areas in alphabetical order, their three-digit numerical codes used for statistical processing purposes by the Statistics Division of the United Nations Secretariat, and their three-digit alphabetical codes assigned by the ISO.

#### 2.2  Data Types

Designing a map is not possible without knowing and understanding the nature of the data to be represented. Choices in the area of representation concern the graphic expression of the information (see Part 2). It is therefore essential to know how to characterize the data so as to process and represent it adequately. Below are a few practical elements to guide you.

#### 2.2.1  Statistical Data Expresses Either a Quality or a Quantity

Qualitative data is not measurable; it involves names, acronyms, and codes. Qualitative attributes cannot be summed, and averages cannot be calculated. Qualitative data can be divided into two categories: ordinal qualitative data which can be classified in a given or chosen order, and nominal qualitative data which cannot be ordered. For instance, a hierarchical classification of European towns and cities – capital cities, regional capitals, secondary cities, etc. – is a form of ordinal qualitative data. Data on the official language of countries – French, German, Spanish, etc. – is nominal qualitative data that cannot be hierarchized.

Quantitative data is always numerical. By definition, the data is ordered, and average will have meaning. There are also two types of quantitative data. Absolute quantitative data expresses concrete quantities, and the sum has a meaning. There are two categories: absolute quantitative “stock” data corresponding to counts at instant t (e.g., the number of inhabitants on January 1) and absolute quantitative “flow” data (not to be confused with flow data expressing relationships between places), corresponding to counts over a period of time (e.g., the number of births in the year). Then, we have relative quantitative data derived from the calculation of a relationship between two values (for instance, the unemployment rate or population density). The sum of relative quantitative data, often expressed in percentages, has no meaning, and only the mean can be significant. By extension, composite numerical indicators can be combined with relative quantitative data mingling several simple types of data (e.g., indices).

Figure 2.02   Characterizing data.

#### 2.3  Data Processing

The statistical information contained in a data table cannot always be mapped directly. Most often, the information needs to be converted and collated, or reduced to render it intelligible (creation of indices, typologies, or classifications, etc.). The task is to order the information and to retain only what is useful for the cartographic representation. This work on the data, which is an integral part of cartographic construction, would merit a book in itself. We suggest readers refer to the bibliography to find the relevant references. Here, we discuss only what is essential in a book about cartography and map design. Complementary or interpretative information should be sought in the references provided.

In cartography, the graphic transcription of data cannot always be direct, since this could result in an unmanageable, illegible map. The data that needs always to be simplified is relative quantitative data (rates, indices, etc.) which need to be subdivided into classes of values. This procedure, known as discretization, is based on specific methods and characteristic values. The aim is to simplify the statistical series observed. To do this, there are several stages.

#### 2.3.1  Summarizing

Reviewing and summarizing a statistical series is a way of becoming acquainted with it: what are the minimum and maximum values observed? How was the phenomenon measured? Is the value derived from a calculation? What calculation? Can the set of values be expressed by one or several characteristic values? Can spatial or statistical comparisons be made?

#### 2.3.1.1  Position Parameters

Position parameters make it possible to sum up a statistical series in a single value. This can be formed by a specific value or by a value that is considered to be “central”.

Specific values (minimum, maximum, or any other) are considered as being representative of a domain (for instance, the number of children required to renew the population), or they are fixed by a law or regulation (for instance, the occupation coefficient fixed by an urbanism document). Central values are calculated or determined from all the values in a series. There are three central values: the mean, the median, and the mode. The choice of the best central value depends on both the objective of the summary and the shape of the distribution.

The mean value indicated x̄ (x bar) is the simplest statistical value expressing the magnitude of a statistical series. It is the sum of the values divided by the number of statistical units observed (or in cartography the number of geographical units). The mean is the gravitational center of the distribution: the sum of the deviations from the mean value is zero.

The median (indicated Q2) is the value that divides a statistical series into two parts comprising equal numbers. In other words, half of the values are above the median and the other half below. The median is the value that is nearest to all the values in the distribution.

The mode (or dominant value) is the most frequent value in a distribution. It is always calculated by scanning the set of values. A distribution can be unimodal (a single mode) or multimodal (several modes). In this case, it is usual to distinguish a main mode and one or several secondary modes.

#### Definition

Unit of Measurement and Order of Magnitude

The unit of measurement is the unit serving to count or calculate the values in a series, for instance, the number of inhabitants per square kilometer for a population density, the percentage for urbanization rate, the hectare for a surface area, etc. Knowing the measurement unit enables the data on which one is working to be understood, and makes it possible to ascertain whether comparisons are possible with other data.

The order of magnitude determines the variation or the extent of a series. It is provided by the minimum and maximum values in the series observed. It offers important information about the meaning of the data and the limitations for comparison with other data.

#### 2.3.1.2  Dispersion Parameters

The notion of dispersion refers to the degree to which values in a distribution spread out or scatter one in relation to another or on either side of a central value. The assessment of dispersion is always linked to a central value. It indicates how far values in a distribution generally deviate or diverge from the reference central value

The standard deviation (indicated σ) is an absolute dispersion parameter linked to the mean. To calculate it, an intermediate calculation is required: the variance. The variance is a global measure of the variation of a set of numbers that are spread out from their average value. The standard deviation has a probabilistic meaning. Probability theory enables the estimation of the likelihood of a value to be distant from the mean by more than a certain number of standard deviations. Indeed, when a distribution is Gaussian, (also referred to as “normal” and characteristic of symmetrical distributions), the probability of finding values at a given distance from the mean is known. This property is very useful in cartography because it enables a rational subdivision of values in a distribution.

Figure 2.03   Normal distribution.

The interquartile interval is an absolute dispersion parameter linked to the median. It is defined as the extent of the distribution concentrating the central array of elements that differ, the least, from the mean. A chosen percentage of the highest and lowest values in the distribution are excluded. This parameter is linked to the notion of the quartile which defines the limits of a subdivision into classes with equal numbers. Thus, different intervals are described according to the desired subdivision, into 4 (quartiles), into 5 (quintiles), or into 10 (deciles), etc. For instance, the interquartile interval is the part of the distribution concentrating half of the elements for which the values are the least different from the median. Thus, 25% of the lowest values and 25% of the highest values are excluded from the distribution.

The comparison of the absolute dispersion parameters of a characteristic is only meaningful if these two characteristics are of the same order of magnitude. For instance, the comparison of two standard deviations can only be envisaged if the distributions have the same mean, failing which a structure effect can be favored. If they do not, the comparison is only possible by resorting to measures of relative dispersion.

A relative dispersion parameter is a measure of the relative deviation of values in a distribution in relation to a central value. It corresponds to an absolute dispersion parameter divided by a central value. One thus obtains a number with no dimension (the mean differences, i.e., differences in order of magnitude, have been removed). The most common relative dispersion values are the coefficient of variation (CV) = standard deviation/mean and the relative interquartile coefficient: 3rd quartile – 1st quartile)/median or Q3-Q1/Q2.

Figure 2.04   Statistical and geographical dispersions.

#### 2.3.2  Analyzing

The next stage is an understanding of the intrinsic characteristics of a distribution by exploring its shape and the dispersion of values. These two elements enable a subdivision into classes that are suited to the dispersion observed, and thus enable the right cartographic choices. The shape of the distribution can be determined from the observation of the distribution diagram or from the comparisons of the central values.

#### 2.3.2.1  The Distribution Diagram

This enables the values in a series to be positioned along an axis that is oriented and graduated. The concentration of values on this axis reflects the concentration or dispersion of values.

If the values are clustered around a single concentration zone, the distribution is said to be unimodal. If the values are grouped around the mean value, the distribution is symmetrical. If they are concentrated around the low values, the distribution is asymmetrical or “skewed” to the left, and if they are concentrated around the high values, the distribution is “skewed” to the right.

Figure 2.05   Position, dispersion, histograms, and frequencies.

If the values present two or several concentration zones, the distribution is skewed and bimodal or multimodal. In this case, the mean is not an appropriate means of summarizing, since it may well “fall” within a dispersion zone.

When the values to be observed are too numerous, their distribution can also be observed in aggregated manner. This operation involves a calculation of the frequency of values within classes of equal extent. The shape thus formed by the height of the different bars (histogram) reflects the distribution of the values.

#### 2.3.2.2  Comparing Central Values

If the three central values (mode, median, and mean) are equivalent, the distribution is said to be symmetrical (or unscrewed). All the central values are relevant summaries.

If the central values differ widely one from another, the distribution is dissymmetrical or skewed. The mean is drawn towards the zone of dispersion; the mode is drawn towards the concentration of values. In this case, it is the median that provides the best compromise.

If the distribution presents no mode, the comparison can be made, on the same principle, comparing only the mean and the median values.

Figure 2.06   Determining the shape of a distribution from positioning parameters.

If the values present two or several zones of concentration, the distribution is skewed and bimodal or multimodal. Neither the mean nor the median is a suitable way to summarize the values, since they often fall within a dispersion zone. The principal mode, alongside the secondary mode, provides the best suited means of summarizing a bimodal distribution.

#### 2.3.3  Determining Class Intervals

Discretization consists in subdividing a statistical series into classes of values. This operation needs to take account of the different characteristics of the distribution. It is part of the prior processing of information, the aim of which is a simplification of the information with a view to analyzing and/or representing it.

#### Focus: Principles of Data Classification

Principle no. 1: the classes must be homogenous and distinct (no overlap and no break between classes), and the geographical objects in a given class should resemble one another more than they resemble objects in the other classes.

Principle no. 2: the number of classes is necessarily smaller than the number of observations, the merging of classes should correspond to the overall domain of variation for the characteristic under study, and the classes are ordered. The number of classes always depends on the number of statistical units observed, on the objective, and on the use to be made in the future (with or without a map).

Principle no. 3: the essential characteristics of the distribution should be preserved so as to lose as little information as possible. Three dimensions of the data series should be taken into account: the order of magnitude, the dispersion, and the shape of the distribution.

Principle no. 4: to facilitate the reading of the map, it is recommended to round off the boundary values for the classes, and if possible to use boundary values linked to relevant orders of magnitude. These boundary values should be easy to read and memorize.

Figure 2.07   How many classes?

There is no optimum classification. Each method will yield a different map, reflecting the actual distribution more or less efficiently. The aggregation of data into classes, in other words the reduction of the useful information, introduces an error or distortion in the perception of this distribution. In addition, the distribution pattern affects the choice of a method of classification.

This choice of a method of classification depends on the properties of the distribution and also on the cartographic objectives fixed. For instance, to put emphasis on very high values, the tendency will be to place them in a category apart so as to individualize them. Conversely, to represent a homogenous image of a territory, the tendency will be to choose a small number of classes.

Figure 2.08   Types of distribution and classification.

Ultimately, the classification procedure is guided by two sometimes contradictory objectives, which the cartographer will have to weigh up: preserving the characteristics of the statistical distribution so that the data will not be misleading, but at the same time allowing leeway to enable the delivery of an efficient cartographic message. When choosing a method of classification, the aim is thus at once to reflect the statistical series as accurately as possible, to give meaning to the different classes, to facilitate memorization, and to produce a clear message. A compromise between the statistics and the requirements of cartography is thus required. Choices must be made, but excessive manipulation should be avoided.

#### 2.3.3.1  Classification Methods

The Observed threshold (also called natural threshold) method is conducted visually on the graphic representation of the distribution (distribution diagram), by identifying each “trough” or “hump” to define the class boundaries. This manual method enables a focus on discontinuities in a statistical series. The numbers in the different classes can be markedly unequal, and the subdivisions are subjective. As a result, this method cannot be used to compare two distributions.

Figure 2.09   Method #1 observed thresholds.

The equal amplitude method is constructed by dividing up the extent of a statistical series (min/max) into the desired number of classes. This method, which yields simple, easily apprehended thresholds, is used with even or symmetrical distributions. It should be avoided for strongly skewed distributions. This method does not enable the comparison of several maps.

Figure 2.10   Method#2 equal amplitude.

The equal numbers (or quantile) method constructs classes in which there is the same number of statistical units. The classes formed in this way are known as quantiles. When there are 4 classes, the term used is quartiles (one quarter of the total number in each class), when there are 10 classes, they are known as deciles, and for 100 classes, they will be centiles. This classification method can be used with any type of distribution, and it enables the comparison of maps one to another.

Figure 2.11   Method #3 equal numbers.

The standardized classification method uses significant values (mean and standard deviation). The value of the mean can appear, depending on the purpose of the map, either as a class boundary or as the center of a class. This type of classification is ideal for symmetrical (normal, Gaussian) distributions and should be used solely in this instance. When the distribution is skewed, it is preferable to use another method.

Figure 2.12   Method #4 standardized classification.

The geometric progression method is suited to highly skewed distributions. It consists in constructing classes whose extent increases (or decreases) with each class, which enables close follow-up of the statistical series. The task is to find a number (common ratio for growth) which by multiplication will give the amplitude of the class. This method assumes that the minimum is greater than 0.

Figure 2.13   Method #5 geometric progression.

The Jenks (or Fisher) method is an automatic classification method based on the principle of resemblance or non-resemblance among individuals. The method functions via iteration. It groups individuals that most resemble one another and those that least resemble one another. In statistical terms, the method aims to minimize intra-class variance and to maximize inter-class variance. This method is on offer in most software and is a good “general” method adapted to all types of distribution. It should be noted, however, that it does not enable comparisons of maps one to another.

Figure 2.14   Same data and several maps.

#### Focus: Dividing Qualitative Data into Classes

The subdivision of qualitative information into classes consists in a predefined, straightforward grouping of elements in smaller numbers so as to obtain a reasoned typology or classification. The operation is governed by the same principles as above for quantitative information. If the information relates to an order (ordinal qualitative data) the hierarchy of the information must be strictly complied with. If the information is mainly nominal, the information is grouped according to resemblance to form a typology. The formation of the classes is then specific to the objectives of the chosen simplification.

Figure 2.15   Grouping qualitative data.

Corine Land Cover is a land-use database produced by the European Environment Agency. It covers 38 European states and provides a nomenclature according to three hierarchical levels (44 types in level 3, 15 in level 2, and 5 in level 1).

#### 2.4  Can Data Be Trusted?

“Geographical data are not supplied by God, but by a given geographer who, not content to apprehend the data on a certain scale, has also chosen and ordered the elements in the dataset; another geographer studying the same region or addressing the same issue on another scale will probably come up with rather different data” (Lacoste, 1976).

Constructing statistical data and making it available was for a long time the exclusive domain of Nations. Data, a central component of national sovereignty, was often secret, hidden, and the various States only issued information in accordance with their strategic interests. But today, nation states are not the only official providers of data. At international level, numerous bodies collect or harmonize statistical data on different subjects of variable complexity. Thus, the OECD (Organisation for the Economic Co-operation and Development), the UN (United Nations), the IMF(International Monetary Fund), Eurostat, the World Bank, and even the CIA (Central Intelligence Agency) produce official data each year on different countries across the world. Whatever the scale, it is also possible to create one’s own statistical data from surveys or measures in the field.

Thus, data suppliers can involve a whole range of agents, both public and private. What matters is knowing who constructed the dataset and who is circulating it. Thus is never innocent. Your analyses and results will be validated within the scope of the framework in which the data was developed and circulated. It is therefore always important to refer to the metadata.

#### Definition

Metadata is data which describes other’s data, data about the data. It is a marker that is applied to any type of resources, enabling its description: where does it come from, how was it created, and by whom? In fact, it ensures the traceability and the quality of the data (authenticating and assessing the data or the source). It also facilitates the search for information by describing its content, thus improving its referencing. It favors inter-operability by way of data sharing and exchange, improves data management and storage, and helps to manage and protect intellectual property rights. Statistical data cannot do without metadata.

A lot of data is used because it covers the desired space and the topic in exhaustive manner. For instance, data can answer a simple question such as “what are the unemployment trends in the different European region?” In this setting, it can seem of little importance what methods were used to collect the data and collate it, since the data required is available and official. Yet it is essential to remain critical with regard to the data so as to avoid misinterpretation or mistaken conclusions. This means that reflection is required beforehand on the intrinsic quality of the data used (accuracy, credibility, objectivity), on the context in which it is published (survey, census, GIS procedure, estimate, etc.) and the meaning (complex indicators, typologies, etc.). A sound approach requires the data supplier to be considered – who supplied the data? What other data does the supplier produce? For what purpose? What is the data intended to tell us?

It is also important to fully understand that any figure or value is the result of a construction. Statistical data used to produce maps is not a set of objective measures of a reality, but indicators enabling that reality to be approached, according to a specific construction and a certain aspect of that construction. What is behind the figures? Does a low employment rate in a country necessarily mean that access to jobs is easier than in a country with a high rate of unemployment? And even if the employment rates are similar, are the lives of individuals who are unemployed comparable? How precarious is employment? What access do workers have to healthcare? What are their rights? What is their place in society? And, in some countries, are there not people who are not accounted for by the official figures (on sick leave or after being struck off unemployment benefits)? In the words of Mark Twain (1835–1910), “Facts are stubborn things, but statistics are more pliable”.

#### Focus: What Data Tell Us?

Data is not innocent. Data contains a message. While relative quantitative data will tend to refer to a hierarchical and ordered view of geographical space, absolute data is in contrast closely linked to the notions of power and power balance.

For instance, while the GDP per capita of China ranks the country 120th in the world (according to the 2014 CIA World Factbook), the absolute value of its GDP ranks China 2nd among the main economic players internationally. Another example is if it is useful to know the number of armed forces per inhabitant to compare two countries, it is equally useful to know the absolute number of armed forces (or their military equipment), since in case of actual conflict, it will be the numbers that will matter.

The choice between absolute data and relative data thus involves two different views of the world: one that is more ordered, and other more conflicting. Thus, producing a map means that we need to take account of what the data has to tell us.

#### Quiz

• What are the two parameters that must be taken into account when data is processed?
• What is a distribution diagram?
• What is the mean? The median? The standard deviation?
• How can a symmetrical statistical series be discretized? And an asymmetrical statistical series?