# Introduction

Let’s start from a **basic definition**: **ANOVA** stands for “** Analysis of Variance**“, and it’s a

**test**used to

**assess the statistical equality**(or lack of thereof)

**among different groups’ means.**

Notice as well that to run this test you need to **have at least 3 different groups** (the means of which you intend to compare).

As the name itself partly suggests, **it does so** by **analysing** the **relationship** among the **VARIANCE BETWEEN GROUPS** (*the larger the variation, the more likely it is that the means among groups are different*) and the **VARIANCE WITHIN GROUPS** (*the larger it is, the less likely it is that the means among groups are different*).

This is **well noticeable in the formula used to run our ANOVA**:

**F = (variance between groups) / (variance within groups)**

**The larger the value of F**, the **more likely it is that the groups have different means** (which typically results in rejecting our H_{0, }or null, hypotesis).

This can be** intuitively understood** with a **simple practical example**.

Imagine to have **3 groups (A, B and C)**, of **10 people each**, sampled from 3 different university classes:

Imagine that the **means** are **respectively** (in cms):

**– 187cms (A)**

** – 180cms (B)**

** – 173cms (C)**

You’d be **naturally brought to think that the means are indeed different**, and don’t just suffer from some sampling randomness.

Why so? **Because the differences among the samples (A,B and C) is quite relevant in relative terms**.

In other words: ** the variance between the groups is large**.

At the same time, ** before stating that such means are indeed different**,

**you might want to assess what goes on within each of the groups**.

You probably want to see

**how**the

**individual heights are distributed**.

The

**more they’ll be actually concentrated around some value**, the

**more you’ll agree that they are representative of the population they come from**(in this case, each university class).

**If instead** you’ll **notice that group A is skewed because it has one semi-professional basketball player** in it, and **group C is skewed because it has a succesful jockey in it**, you * might be tempted instead to say that, afterall, the means of the populations are not necessarily deemed to be so different*.

In other words,** you would be more lineant** in judging the groups means as different because of a** larger variance within (certain) groups**.

*This logic is summarized in the measure used in an Anova test to determine wheteher the means of different groups are indeed different or not, which is (again):*

**F = (variance between groups) / (variance within groups)**

Which tries to make sense of such differences using a statistical approach (which includes assessing how big the groups are, through the use of the “degrees of freedom”).

#### One-way ANOVA

The **most basic form of this test** is the one we are addressing in this post: the **one-way Anova**.

Using the term **“one-way”** we simply **mean that the test is run on a single independent variable** (such as, for example, the height seen before).

Let’s run a **simple example using the R statistic software**, to see how it works.

First, we **load the data** from a webpage (mind the “data.table” library).

1 2 3 |
library(data.table) # allows to use fread() function sardines_data = fread("http://users.stat.ufl.edu/~winner/data/sardine.dat") |

Then we **assign names to the 2 columns** of the dataset,** for clarity**:

1 2 |
colnames(sardines_data)=c("Location","Vertebrae") View(sardines_data) |

Since we run the **View() function**, we can see how we have **6 values for “Location”**, and that there are **12.858 data points**.

For general interest, we know from the references of the dataset that the “Location” values indicate the following:

1 2 3 4 5 6 |
1=Alaska 2=British Columbia 3=San Francisco 4=Monterey 5=San Pedro 6=San Diego |

We can then run a **boxplot() function** to assess **how the number of vertebrae is distributed**, in **each location**:

1 |
boxplot(Vertebrae~Location, data=sardines_data, ylab="Vertebrae", xlab="Location", main="Vertebrae N. by Location") |

Which gives us the **following plot**:

The boxplot allows us to **see at a glance** details such as:

– **54 vertebrae** is a value **found only** in **locations 2, 3 and 5**

– **49 vertebrae** is a value** found only** in **location 2 and 4**

– There is a **prevalence of 52 in all locations, except location 1**

In any case *this plot doesn’t show us the number of samples for each location, nor allows us to decide whether the means can be considered equal or not, from the statistical point of view*.

#### ANOVA test

As we should now have a fair idea about the nature of our data,** we can finally run our Anova test**.

To do so, **we have 2 options**:

**1. **Taking a** relaxed approach at the equality of variances among groups**. In this case, you can use the:

**oneway.test()** function

which is **more lenient, in relation to the equality of variance assumption**.

**2.** You first **verify wheter the variances are equal among the groups using the Levene Test**, and only then **use** the:

**aov()** function

**if variances** are** proven to be statistically equal**.

**Let’s try this 2nd, stricter approach**:

**Levene Test:**

1 2 |
library(car) leveneTest(Vertebrae~as.factor(Location), data=sardines_data) |

Which gives us the **following result**:

1 2 3 4 5 6 7 |
> leveneTest(Vertebrae~as.factor(Location), data=sardines_data) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 5 4.1577 0.0008937 *** 12852 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 |

The **p-value** of the **Levene Test** tells us that the **variances of the 6 groups definitely cannot be considered equal**.

**We then need to procede using** the more lenient version of the Anova test, using the **oneway.test() function**:

1 |
oneway.test(Vertebrae~Location, var.equal=FALSE, data=sardines_data) |

Which gives us the following :

1 2 3 4 5 |
One-way analysis of means (not assuming equal variances) data: Vertebrae and Location F = 4.7549, num df = 5.00, denom df = 232.62, p-value = 0.0003717 |

**Given the minuscule p-value** (0.0003717), the **means of the different groups CAN be considered NOT EQUAL.
**

We

**definitely**

**have enough support to reject the null hypotesis**.

As a last word, please** note that the Anova test only tells us if the means of at least 2 groups are different**, but **falls short** of **indicating** **which groups actually have different means**.

There are **several post-hoc test** able to **assess which pairs of groups have different means**, a **common one being** the “**Tukey test**“.

Unfortunately, *this test assumes equality of variance among groups, hence cannot be used in our case, following our previous findings.*

*This was meant to be a basic explanation of the nature of the Anova test.*

*This was meant to be a basic explanation of the nature of the Anova test.*

*We hope that the example is clear enough to allow anybody to try a first approach to such a statistical test, in relation to their data of interest.*

*We hope that the example is clear enough to allow anybody to try a first approach to such a statistical test, in relation to their data of interest.*