We all heard of **normal distribution** (often referred to as well as **“Bell Curve”** or **“Gaussian Distribution”**), and most people have an at least vague idea of what it is.

#### A little history

**Galileo** (XVII century) was the **first one to have an intuition of such a distribution**, as he realized that the measurements errors made by the instruments he was using were symmetric, and that small errors occurred more frequently than large errors.

**Laplace defined** the **Central Limit Theorem** in 1778 (which is strictly intertwined with the general concept of normal distribution).

**Gauss** (and **Adrain**) were instead the** first ones to formalize** the exact **mathematical nature of such distribution** in the early XIX century.

#### Normal distribution – What is it?

In this post we simply want to illustrate the nature of the Normal distribution in clear terms, and show the main R functions used in relation to this concept.

The **Normal distribution** is a **probability (or density) function, often encountered in nature**.

**Common examples** can be:

– **height** (or **weight**) of a human population

– **marks** obtained in an **IQ test**

– **people’s salaries** in a nation/region/city.

The measurements related to the **each one of abov**e mentioned values, **have something in common**:

– they all tend to **concentrate around a mean** (in other words: the frequency of the mean value tends to be the highest)

– each of those **distributions’ shape is symmetrical** (** around their distinct mean**)

– their **shape is influenced by how** the individual **values** recorded **are distributed around their mean** (in other words, the *exact shape of each curve will be influenced by* its specific individual *data values variation around the mean*)

The **general shape** of a **Normal distribution** is then the following:

Which makes clear at a glance why it is called “**Bell curve**” (notice as well how it is symmetrical around the mean, μ).

Now, while the **general shape** will be **similar for any measurement** **following a normal distribution**, the **actual precise shape of any particular dataset** will **somehow differ** on the base of * 2 parameters*:

– the **mean** around which **each specific distribution** is **centered**

– the **general variation** of **individual data points** (around the mean) of each specific dataset .

*You can think about it as “how far” (in average) data points are from the dataset mean (in relative terms).
*

The following image (

*thank you, Wikipedia*) should clarify this concept:

The image above makes clearer how **different dataset** (though all commonly bell shaped and centered around a mean) are **diverse in terms of precise shape**, depending on **how “spread” their values are**.

Notice as well how the **position of the curve** in relation to the axes is **different, depending on** where the **mean of the dataset** is positioned.

Without digging into any mathematical technicality, you can see how the **2 parameters** of **mean (μ)** and **standard deviation ( σ)** are the

**only variables defining**

**each specific normal distribution**, in the general

**Gaussian formula**below (

*everything else is a constant*):

Knowing and understanding this simple fact paves the way for the next (extremely important) notion linked to the notion of normal distribution, which we’ll see in the next paragraph.

#### Standard Normal Distribution

As mentioned above, the normal distribution of any dataset/measurement is strictly defined by its specific and unique mean (μ) and standard deviation (*σ*) .

For statistical analysis though, it would be useful to have a **single uniformed function** (*with defined and well known characteristics*) to** facilitate generalized statistical assessments**.

**That function exists**, it is the **“Standard normal distribution”**, and *can fit any (normally distributed) dataset through a simple mathematical transformation* (which will be described later).

The **characteristics** of the **Standardized Distribution** are:

- its
**mean**is**zero**->**μ=0** - its
**standard deviation**is**one**->**σ=1** - its
**values**will be**distributed as follows**:

The **properties** deriving from such standard distribution of values bring to having:

– **68.2% within 1** (positive or negative) **standard deviation**

– **95.4% within 2** (positive or negative) **standard deviations**

– **99.8% within 3** (positive or negative) **standard deviations**

These precise values are **used in statistics for many purposes** (calculating confidence intervals and testing hypotesis, among others).

That’s done (as mentioned above) through a **simple transformation** that **can be applied to any point of a normal distribution**, in order t**o translate it in a standardized one**.

**Data points/values** **can be standardized as follows**:

which gives us what is called the **z-score** of a data point in standardized terms.

The z-score **simply indicates**, **in standard deviations, how far the data point is from the mean,*** in a normal standard distribution*.

Knowing the z-score of a data point, **not only gives us a general idea about where this sits in the distribution** (having a simple glance at the graph above), but **allows us to infer with statistical precision the related percentile** where the point under exam sits.

This can be easily done (in the “old fashion” way) using some **z-score tables,** or using instead **some statistical software (such as R-Studio)**.

#### R-Studio useful functions

Knowing now the main concepts related to the** normal distribution**, especially in its standardized form, we can have a look at the **most common R-Studio functions** *related to this subject*.

First of all, you might want to load a dataset. You can do that using a .csv file, for example (or some datasets available online).

In our case, **we’ll create it ourselves**, as it gives us the chance to show a useful function as well. The function is **rnorm()**.

The following code creates a dataset of **1000 data points**, distributed normally, with **mean=20** and **standard deviation =5**

(and defining such parameters is extremely simple, as shown below):

1 2 3 4 |
set.seed(1) ##data are randomly generated, but setting a seed (above) will give always the same values## NormDistr1000 <- rnorm(1000, mean=20, sd=5) View(NormDistr1000) |

Also, the **View()** function **visualizes the whole dataset** in a **new tab**.

**Two basic but fundamental functions** are the ones needed to calculate the **mean** and the** standard deviation** of a dataset.

These functions are (simply) **mean()** and **sd()**.

Let’s use those on our randomly generated dataset:

1 2 |
mean(NormDistr1000) # 19.94176 sd(NormDistr1000) # 5.174579 |

We indicated the value obtained on the side. Being the **data randomly generated**, they will **obviously not be exactly the ones we set (mean=20, sd=5)**, but they are clearly **pretty close to that**.

To **standardize a dataset**, we can use the** scale()** function, for example on the dataset we just created above:

1 2 |
StandNormDistr1000 <- scale(NormDistr1000) View(StandNormDistr1000) |

Thanks to the last line of code, we can see how **each data point** has been **transformed in its equivalent z-score**.

We can as well **double check** our** mean** and **standard deviation**, this time on the freshly created **standardized version of our dataset**, to verify their values:

1 2 |
mean(StandNormDistr1000) # 9.781455e-17 (~0) sd(StandNormDistr1000) # 1 |

As we clearly see, they are **basically the ones expected**.

**Probabilities calculation**

We can use the **pnorm()** function to calculate the **probabilities of a single point** of **sitting at a defined value**.

Using the** same parameters** used for our randomly generated dataset above, **mean=20** and **sd=5**:

1 2 3 4 5 6 7 8 |
P_POINT15 <- pnorm(15, mean=20, sd=5, lower.tail=TRUE) # 0.158655 # 15.8655% of data points expected to sit to the LEFT of 15 # notice that lower.tail=TRUE is the default value, anyway # if we wanted instead to see the amount of data points sitting to the RIGHT of our point, we would just set this parameter to lower.tail=FALSE |

Which means that **a data point having value=15** is **expected to have 15.86% of the data points sitting to its left**.

**Notice** that (**15 being exactly 1 standard deviation **to the **left of the mean**) this **value is consistent with** the **image already shown above**

where **roughly 15.9%** of the **data points** are **expected to sit on the left of such a point**.

The exact same function can be used with a z-score.

As **15** (as we said) is **exactly 1 negative standard deviation** from the mean:

1 |
P_MINUS1SD <- pnorm(-1, lower.tail = TRUE) # 0.158655 |

which is the** exact same value obtained** in the **case above** (**as expected**).

Notice that we **didn’t need to specify the mean and standard deviation as parameters here**, as they are given **default values** of **mean=0** and **sd=1**, as we are using a **z-score under assumption** of normal **standard** distribution.

**Percentile calculation**

In a somehow complementary fashion, we can use the **qnorm() function** to calculate the **quantiles** (or **percentiles**) for a normal variable.

For example, calculating the **25th percentile** (or **1st quartile**) using the same mean and standard deviation used so far, would look like:

1 |
qnorm(q=0.25, mean=20, sd=5, lower.tail=TRUE) # 16.62755 |

This means that the **first quartile** is **expected to sit** on an **X=16.62755**, in our normal distribution with mean=20 and sd=5.

In other words, the** 25% of the values are expected to sit left of this value** (as lower.tail=TRUE).

**Density Function**

Finally, we can plot the **density function** of (for example) our initial randomly generated dataset using the **dnorm()** command, as follows:

1 2 |
dens <- dnorm(NormDistr1000, mean=20, sd=5) plot(NormDistr1000, dens) |

Which produces this graph as an output:

Still in relation to **general distribution visualization**, be aware that the **hist() function** allows to visualize your data, as well:

1 |
hist(NormDistr1000, nclass=100) |

#### Conclusions

We hope that this post has provided a clear explanation to understand the concepts of **normal distribution**, **standardized normal distribution**, and an introduction the** functions needed** to **start tackling these subject** in the **R-Studio** statistic environment.

Thank you for reading.