We all heard of normal distribution (often referred to as well as “Bell Curve” or “Gaussian Distribution”), and most people have an at least vague idea of what it is.
A little history
Galileo (XVII century) was the first one to have an intuition of such a distribution, as he realized that the measurements errors made by the instruments he was using were symmetric, and that small errors occurred more frequently than large errors.
Laplace defined the Central Limit Theorem in 1778 (which is strictly intertwined with the general concept of normal distribution).
Gauss (and Adrain) were instead the first ones to formalize the exact mathematical nature of such distribution in the early XIX century.
Normal distribution – What is it?
In this post we simply want to illustrate the nature of the Normal distribution in clear terms, and show the main R functions used in relation to this concept.
The Normal distribution is a probability (or density) function, often encountered in nature.
Common examples can be:
– height (or weight) of a human population
– marks obtained in an IQ test
– people’s salaries in a nation/region/city.
The measurements related to the each one of above mentioned values, have something in common:
– they all tend to concentrate around a mean (in other words: the frequency of the mean value tends to be the highest)
– each of those distributions’ shape is symmetrical (around their distinct mean)
– their shape is influenced by how the individual values recorded are distributed around their mean (in other words, the exact shape of each curve will be influenced by its specific individual data values variation around the mean)
The general shape of a Normal distribution is then the following:
Which makes clear at a glance why it is called “Bell curve” (notice as well how it is symmetrical around the mean, μ).
Now, while the general shape will be similar for any measurement following a normal distribution, the actual precise shape of any particular dataset will somehow differ on the base of 2 parameters:
– the mean around which each specific distribution is centered
– the general variation of individual data points (around the mean) of each specific dataset .
You can think about it as “how far” (in average) data points are from the dataset mean (in relative terms).
The following image (thank you, Wikipedia) should clarify this concept:
The image above makes clearer how different dataset (though all commonly bell shaped and centered around a mean) are diverse in terms of precise shape, depending on how “spread” their values are.
Notice as well how the position of the curve in relation to the axes is different, depending on where the mean of the dataset is positioned.
Without digging into any mathematical technicality, you can see how the 2 parameters of mean (μ) and standard deviation (σ) are the only variables defining each specific normal distribution, in the general Gaussian formula below (everything else is a constant):
Knowing and understanding this simple fact paves the way for the next (extremely important) notion linked to the notion of normal distribution, which we’ll see in the next paragraph.
Standard Normal Distribution
As mentioned above, the normal distribution of any dataset/measurement is strictly defined by its specific and unique mean (μ) and standard deviation (σ) .
For statistical analysis though, it would be useful to have a single uniformed function (with defined and well known characteristics) to facilitate generalized statistical assessments.
That function exists, it is the “Standard normal distribution”, and can fit any (normally distributed) dataset through a simple mathematical transformation (which will be described later).
The characteristics of the Standardized Distribution are:
- its mean is zero -> μ=0
- its standard deviation is one -> σ=1
- its values will be distributed as follows:
The properties deriving from such standard distribution of values bring to having:
– 68.2% within 1 (positive or negative) standard deviation
– 95.4% within 2 (positive or negative) standard deviations
– 99.8% within 3 (positive or negative) standard deviations
These precise values are used in statistics for many purposes (calculating confidence intervals and testing hypotesis, among others).
That’s done (as mentioned above) through a simple transformation that can be applied to any point of a normal distribution, in order to translate it in a standardized one.
Data points/values can be standardized as follows:
which gives us what is called the z-score of a data point in standardized terms.
The z-score simply indicates, in standard deviations, how far the data point is from the mean, in a normal standard distribution.
Knowing the z-score of a data point, not only gives us a general idea about where this sits in the distribution (having a simple glance at the graph above), but allows us to infer with statistical precision the related percentile where the point under exam sits.
This can be easily done (in the “old fashion” way) using some z-score tables, or using instead some statistical software (such as R-Studio).
R-Studio useful functions
Knowing now the main concepts related to the normal distribution, especially in its standardized form, we can have a look at the most common R-Studio functions related to this subject.
First of all, you might want to load a dataset. You can do that using a .csv file, for example (or some datasets available online).
In our case, we’ll create it ourselves, as it gives us the chance to show a useful function as well. The function is rnorm().
The following code creates a dataset of 1000 data points, distributed normally, with mean=20 and standard deviation =5
(and defining such parameters is extremely simple, as shown below):
set.seed(1) ##data are randomly generated, but setting a seed (above) will give always the same values##
NormDistr1000 <- rnorm(1000, mean=20, sd=5)
Also, the View() function visualizes the whole dataset in a new tab.
Two basic but fundamental functions are the ones needed to calculate the mean and the standard deviation of a dataset.
These functions are (simply) mean() and sd().
Let’s use those on our randomly generated dataset:
mean(NormDistr1000) # 19.94176
sd(NormDistr1000) # 5.174579
We indicated the value obtained on the side. Being the data randomly generated, they will obviously not be exactly the ones we set (mean=20, sd=5), but they are clearly pretty close to that.
To standardize a dataset, we can use the scale() function, for example on the dataset we just created above:
StandNormDistr1000 <- scale(NormDistr1000)
Thanks to the last line of code, we can see how each data point has been transformed in its equivalent z-score.
We can as well double check our mean and standard deviation, this time on the freshly created standardized version of our dataset, to verify their values:
mean(StandNormDistr1000) # 9.781455e-17 (~0)
sd(StandNormDistr1000) # 1
As we clearly see, they are basically the ones expected.
We can use the pnorm() function to calculate the probabilities of a single point of sitting at a defined value.
Using the same parameters used for our randomly generated dataset above, mean=20 and sd=5:
P_POINT15 <- pnorm(15, mean=20, sd=5, lower.tail=TRUE)
# 15.8655% of data points expected to sit to the LEFT of 15
# notice that lower.tail=TRUE is the default value, anyway
# if we wanted instead to see the amount of data points sitting to the RIGHT of our point, we would just set this parameter to lower.tail=FALSE
Which means that a data point having value=15 is expected to have 15.86% of the data points sitting to its left.
Notice that (15 being exactly 1 standard deviation to the left of the mean) this value is consistent with the image already shown above
where roughly 15.9% of the data points are expected to sit on the left of such a point.
The exact same function can be used with a z-score.
As 15 (as we said) is exactly 1 negative standard deviation from the mean:
P_MINUS1SD <- pnorm(-1, lower.tail = TRUE) # 0.158655
which is the exact same value obtained in the case above (as expected).
Notice that we didn’t need to specify the mean and standard deviation as parameters here, as they are given default values of mean=0 and sd=1, as we are using a z-score under assumption of normal standard distribution.
In a somehow complementary fashion, we can use the qnorm() function to calculate the quantiles (or percentiles) for a normal variable.
For example, calculating the 25th percentile (or 1st quartile) using the same mean and standard deviation used so far, would look like:
qnorm(q=0.25, mean=20, sd=5, lower.tail=TRUE) # 16.62755
This means that the first quartile is expected to sit on an X=16.62755, in our normal distribution with mean=20 and sd=5.
In other words, the 25% of the values are expected to sit left of this value (as lower.tail=TRUE).
Finally, we can plot the density function of (for example) our initial randomly generated dataset using the dnorm() command, as follows:
dens <- dnorm(NormDistr1000, mean=20, sd=5)
Which produces this graph as an output:
Still in relation to general distribution visualization, be aware that the hist() function allows to visualize your data, as well:
We hope that this post has provided a clear explanation to understand the concepts of normal distribution, standardized normal distribution, and an introduction the functions needed to start tackling these subject in the R-Studio statistic environment.
Thank you for reading.