Exploratory data analysis with each data type (part 1)
In this article, I will share with you examples of how to explore empirical data using statistics and different data visualization. This article is a continuation of the previous Data Types article.
There are four examples:
- Nominal data example.
- Ordinal data example.
- Interval data example.
- Ratio data example.
The nominal and ordinal examples have two sections, data and visualization; the statistical section is combined with the data section. For the interval and ratio examples, the statistical section is separate due to its significance.
Let’s get started
The table below displays the number of flags containing a particular color in 2012:
The data in the table are clearly nominal since I can change the order of any line without altering the meaning of the data. I am using the frequency to describe the data statistically.
Column/bar, pie and donut charts are often used to visualize nominal data. The bar chart below visualizes the number of colors (frequency) used on all the flags around the world, where the donut chart displays the percentage of the colors used on all the flags (links to the live bar demo and pie demo):
From these charts, the red and the white colors are the most used colors on flags. The color red is often seen as the color of revolution and war, where the white color is the symbol of peace and harmony.
The least used colors are gray and purple, probably because they have specific meaning for some countries instead of having a universal meaning.
The table below shows the degree of StackOverflow users (2019 survey):
Educational degrees are ordinal data because they have an order, and the units in each degree are not the same. For example, a bachelor’s degree is about 3 to 4 years, where a master’s degree is 1 to 2 years.
Like the nominal data, I use the frequency as a statistical tool to group the data into intervals.
From both charts, it is clear that the bachelor’s degree is the most dominant educational degree among StackOverflow users, followed by a master’s degree.
Temperature is classified as interval data, as the unit is the same, and the proportion doesn’t have a meaning, which means 4° is not four times 1°.
Let’s explore the table using the statistical tools for interval data:
Histogram and boxplot are often used to explore interval data sets. Histogram and boxplot allow us to display the statistical results visually. Other chart types are also used to visualize interval data. For example, as the temperature is continuous data, a line chart is an appropriate chart (see chart below). If I have discrete data, I will probably choose a bar or a column chart:
This line chart displays the average temperature, but it doesn’t allow us to get valuable information about the distribution of the data.
The chart below represents a histogram chart. The data in the histogram is divided into class intervals of 5°.
There are four main points to describe distributions using a histogram chart: shape, outliers, center, and spread, also referred to as SOCS (nice way to remember it ¯\_(ツ)_/¯ )
Knowing the shape of the distribution helps us to choose the right statistical tools.
In this example, the histogram has a symmetric shape and represents a multimodal distribution, as the histogram displays three picks. The first pick is the range between -6° to -1°, the second pick is the range from 4° to 9°, and the last pick is between 14° to 19°. It looks like the town of Halifax has longer winters and summers, and shorter autumns and springs. Another explanation of the high picks at the extremities could also mean an issue with the device/s used to collect the data.
Outliers like the shape are factors that guide us to identify the right statistical tools. If a data set doesn’t have outliers and the shape of the distribution is symmetric, then I use the mean and the standard deviation; otherwise, the median and IQR are used to describe the data set.
Any values less than Q1–1.5IQR are called the lower outliers, and any values above Q3 + 1.5IQR are called upper outliers (see table below):
It looks like there are no outliers in this data set because all the values are either less than the upper outliers limit or more than the lower outliers limit.
3.Center (Average or the medium)
By knowing the central tendency, we can get a good data set representation using only a single value, such as the mean and the median.
Now, I know that the shape of the histogram is symmetric, and there are no outliers; in this case, I use the mean to describe the center. The mean is equal to 6.29°; in other words, the average temperature of Halifax is around 6.29°.
The main purpose of the spread is to check how spread out is our data. For that, I can either use the standard deviation or IQR. I use the standard deviation because my distribution is symmetric and with no outliers. The typical distance away from the mean or the standard deviation is equal to 9.05°.
Use Average and standard deviation if the distribution has a symmetric shape with no outliers, use the median and IQR for any other distribution such as the skewed distribution.
The next chart is the box plot chart (link to box plot demo):
The best way to analyze a box plot chart is by comparing it to another box plot. In this chart below, I am comparing the average temperature of Halifax to the average temperature of the Pas, Manitoba during the same period from 1971 to 200 (link to the box plot demo):
The monthly average temperatures in Halifax are compact compared to Pas. In Halifax, 50% of the values are between -1.75° (Q1) to +14.32° (Q3), wherein Pas, 50% of the values are between a larger range -10.7° (Q1) to +11.2° (Q3); the temperatures in Halifax are easily predictable than Pas. The average temperature in Halifax (6.29°) is higher than Pas (0.11°). Well, it looks like it is warmer in Halifax, and the temperature is more predictable compared to Pas.