Bayes’ Theorem – An introduction

Bayes’ theorem is a widely used concept used in both statistics and probability theory.

As per the Wikipedia page: “[It] describes the probability of an event, based on prior knowledge of conditions that might be related to the event“.

In other words, it tries to calculate the probabilities of an even within a defined scenario (or scenarios), having itself its own likelihood of happening.

Another common equivalent definition is that Bayes’ theorem deals with the conditional probability of events.

How it works

Bayes’ theorem deals with likelihoods of actual recorded events.

A first simplistic and intuitive example is:

you see a guy getting off his Volvo S60, and you are asked to guess if his salary is (say) north or south of 30,000€.
I bet most of us would guess that it is higher than that mark

(In this case we don’t have the precise data, but it is not far fetched to assume that 90% of Volvo S60 owners are above -or well above- the 30,000€ salary, for example).

Given our assumption above (and not forgetting that Bayes’ deals with actual factual probabilities) we just made in educated guess to minimize the possibilities of being wrong.

Largely, Bayes is based on this logic (though it definitely applies it with better statistical and numerical precision).

The statistics behind it

The formula of Bayes’ theorem is the one below:

Where A and B are 2 events that can or cannot happen simultaneously (otherwise said: are not mutually exclusive).

The formula above reads as it follows:

– The  probability of A happening, given that B has happened

is equal to

– the probability of B happening, given that A has happened, multiplied by the probabilities of A happening, and divided by the probabilities of B happening.

Example

A simple example, (taken from our past Statistics module) might help to clarify this statement.

Problem:

You have an automatic monitoring system, created to detect intruders, and it does so with a probability of 90%.

The system automatically records the weather, and in a series of controlled tests it has shown that, when the intruder was succesfully detected:
75%  of the times the weather was clear
20%  of the times the weather was cloudy
5% of the times the weather was rainy

When instead the system failed to detect the intruder:
60%  of the times the weather was clear
30%  of the times the weather was cloudy
10% of the times the weather was rainy

Find the probability of detecting the intruder, given that the weather is rainy (assuming an intruder actually entered the plant).

Solution

Defining D the event that the intruder is detected
(and DC its complementary event that the intruder is NOT detected):

P(D) = 0.9
P(Clear¦D) = 0.75
P(Cloudy¦D) = 0.20
P(Rainy¦D) = 0.05

And:

P(DC) = 0.1
P(Clear¦DC) = 0.60
P(Cloudy¦DC) = 0.30
P(Rainy¦DC) = 0.10

One way to look at the problem (which helps us understanding as well the logic behind the theorem) is by using the following tree:

Realizing then that the previously shown formula can be breaken down as explained here:

We can procede by calculating (remembering that D and DC are mutually exclusive & exhaustive events):

P(D¦Rainy) = [P(D) * P(Rainy¦D)] / [P(D)*P(Rainy¦D) + P(DC)*P(Rainy¦DC) ]

Which is:

(0.9)(0.5) / (0.9)(0.5)+(0.10)(0.10) = 0.818 = 81.8%

Under rainy conditions, the system can detect an intruder with probability of 0.818 (a value lower than the designed probability of 0.9)

Conclusion

We hope that this definitions and this example can help as a first approach to Bayes’ theorem nature and purposes.

Netflix – Big data and new technologies

It is quite impressive to think that Netflix, a company nowadays worth 19 billions in total assets, wouldn’t be even conceivable only 15-20 years ago.
The proof of this is that Netflix itself appeared on the market in 1997 as a DVD rental and sales company (though, almost immediately, started focusing exclusively into the DVD rental by mail business).
It was not until 2007 that it started providing media streaming as a product, and didn’t produce any series before 2012.

What made possible to create a business (in its actual form), able not only to wipe out a company like Blockbuster from the market, but as well to compete with TV channels and movie productions as a content creator?

Of course, the maturity of web technologies (in the form of interactive and quick responding websites) and telecommunications (in the form of speedy internet connections) were absolutely necessary to achieve some success, but do not explain (nor probably would be sufficient to reach) THIS kind of success.

The decisive factor to see Netflix operating at the level it does today, with countless series parts of the popular culture (House of cards, Stranger Things, Orange is the new black – just to name few), with a brand recognition of 65% (and a stable position among the top 100) in the US market and an ubiquous presence around the planet, was the ability to translate the massive amount of data available about their users’ behaviour and preferences into a better offer and a better user experience (and also, continously improved)

In other words, Netflix is a prime example of data-driven company making use of big data.

The predictive algorithm used by Netflix to suggest users the next content to watch (partly based on “association rules”, for example) is quite important (it gets to suggest about 80% of the content to viewers), but it’s only a part of many multi-faceted processes, and it depends on the data and metadata it is fed with, which brings us the the next two aspects.

First and foremost, a big part of the data used to select the content (type of series/movies offered),  and the general form of the offer (interface, technical specs) comes from simply analyzing customers’ behaviour (gathered in a somehow “passive” way).

Beside the above though, there is quite a bit going on within Netflix itself, in order to create data (or maybe, more appropriately “metadata”), as internal “taggers” are in charge to watch every minute of the series, marking with precision the actual content of each series (genre, presence of an ensemble cast, main themes and much more) allowing capture the actual nature of the content in the most possibly nuanced fashion.
This side of the process can somehow seen as a more “active” way of generate useful data, by Netflix, and it is just as fundamental.

But, how does all that translate in an actual better, more succesful product, able to improve customer retention and to obtain better revenues, by offering something that the average customer is more keen to pay for or to keep paying for?

Let’s see some practical examples.

A first, high-level example is the concept of “micro-genre“, (something largely created by Netflix itself) which is simply the result of machine-learning processes creating “buckets” of shows, by discovering some commonalities among them and their viewers.
Put simply, the algorithms working behind the scenes for Netflix created a countless number of micro-categories, allowing Netflix to do 2 things:
– to tailor the offer of existing shows to users
– to produce content that is very likely to be succesful among at least some demography of their audience.

Another example is the interface (cyclically reviewed) which is optimized to maximise the success rate of the series prompted.

Netflix interface 2014

Netflix interface 2018

We can see above how the interface changed in the last 4 years.
It’s fair to assume that data suggested a less dispersive visualization in the menu (fewer shows prompted at first), better interactivity (shows are now easily browsable, horizontally, by category), a more cinematic layout (darker tones, one single color throughout the webpage), and a better focus on the show selected (bigger prompt, more visible description and image, rating visible at a glance).

A third example is something that many (or all) Netflix users might have noticed, especially at the beginning of their experience as customers.
It is not uncommon that when you launch a series/movie, the first few seconds are lower-quality in terms of image, but they quickly adapt to then offer a stable and high-quality image throughout the show vision.
This is another decision relying on data analysis. Netflix noticed 2 things:
users would switch off within very few seconds, if the show doesn’t start streaming (hence, offering a lower quality at first ensures that this doesn’t happen, improving retention and user experience)
– it allows to optimize the streaming to ensure there is no buffering for the duration of the show (another factor that users could find extremely irritating and could lead to drop the vision, or even Netflix services in general)

Deciding (in particular) to trade lower quality for better response times was a choice deriving from looking at customers’ behaviour data.

A last example is how Netflix can (thanks to the use of big data) micro-target advertisement depending on the precise demography the latter is aimed at (something that a TV channel cannot do, with such precision).
For example, for the first season of House of Cards it created 10 different trailers, each aimed at a specific segment of viewers, to maximise the potential interest of customers, when launching the series.

It should be clear by now how Netflix is a textbook example of how data analytics, big data and data driven decision should be run.

It is surely not easy task to create the infrastructure, the internal know-how and the culture to make the best use of big data, but it is a fact that, when done properly, such an approach allows to optmize resources, offering at the same time the best possible product to customers, ensuring success and growth of a business under any perspective.

The Normal Distribution – An introduction and some related R-Studio functions

We all heard of normal distribution (often referred to as well as “Bell Curve” or “Gaussian Distribution”), and most people have an at least vague idea of what it is.

A little history

Galileo (XVII century) was the first one to have an intuition of such a distribution, as he realized that the measurements errors made by the instruments he was using were symmetric, and that small errors occurred more frequently than large errors.

Laplace defined the Central Limit Theorem in 1778 (which is strictly intertwined with the general concept of normal distribution).

Gauss (and Adrain) were instead the first ones to formalize the exact mathematical nature of such distribution in the early XIX century.

Normal distribution – What is it?

In this post we simply want to illustrate the nature of the Normal distribution in clear terms, and show the main R functions used in relation to this concept.

The Normal distribution is a probability (or density) function, often encountered in nature.

Common examples can be:
height (or weight) of a human population
marks obtained in an IQ test
people’s salaries in a nation/region/city.

The measurements related to the each one of above mentioned values, have something in common:

– they all tend to concentrate around a mean (in other words: the frequency of the mean value tends to be the highest)

– each of those distributions’ shape is symmetrical (around their distinct mean)

– their shape is influenced by how the individual values recorded are distributed around their mean (in other words, the exact shape of each curve will be influenced by its specific individual data values variation around the mean)

The general shape of a Normal distribution is then the following:


Which makes clear at a glance why it is called “Bell curve” (notice as well how it is symmetrical around the mean, μ).

Now, while the general shape will be similar for any measurement following a normal distribution, the actual precise shape of any particular dataset will somehow differ on the base of 2 parameters:

– the mean around which each specific distribution is centered

– the general variation of individual data points (around the mean) of each specific dataset .
You can think about it as “how far” (in average) data points are from the dataset mean (in relative terms).

The following image (thank you, Wikipedia) should clarify this concept:


The image above makes clearer how different dataset (though all commonly bell shaped and centered around a mean) are diverse in terms of precise shape, depending on how “spread” their values are.

Notice as well how the position of the curve in relation to the axes is different, depending on where the mean of the dataset is positioned.

Without digging into any mathematical technicality, you can see how the 2 parameters of mean (μ) and standard deviation (σ) are the only variables defining each specific normal distribution, in the general Gaussian formula below (everything else is a constant):

Knowing and understanding this simple fact paves the way for the next (extremely important) notion linked to the notion of normal distribution, which we’ll see in the next paragraph.

Standard Normal Distribution

As mentioned above, the normal distribution of any dataset/measurement is strictly defined by its specific and unique mean (μ) and standard deviation (σ) .

For statistical analysis though, it would be useful to have a single uniformed function (with defined and well known characteristics) to facilitate generalized statistical assessments.

That function exists, it is the “Standard normal distribution”, and can fit any (normally distributed) dataset through a simple mathematical transformation (which will be described later).

The characteristics of the Standardized Distribution are:

  • its mean is zero -> μ=0
  • its standard deviation is one -> σ=1
  • its values will be distributed as follows:

The properties deriving from such standard distribution of values bring to having:

68.2% within 1 (positive or negative) standard deviation
95.4% within 2 (positive or negative) standard deviations
99.8% within 3 (positive or negative) standard deviations

These precise values are used in statistics for many purposes (calculating confidence intervals and testing hypotesis, among others).

That’s done (as mentioned above) through a simple transformation that can be applied to any point of a normal distribution, in order to translate it in a standardized one.

Data points/values can be standardized as follows:

which gives us what is called the z-score of a data point in standardized terms.

The z-score simply indicates, in standard deviations, how far the data point is from the mean, in a normal standard distribution.

Knowing the z-score of a data point, not only gives us a general idea about where this sits in the distribution (having a simple glance at the graph above), but allows us to infer with statistical precision the related percentile where the point under exam sits.

This can be easily done (in the “old fashion” way) using some z-score tables, or using instead some statistical software (such as  R-Studio).

R-Studio useful functions

Knowing now the main  concepts related to the normal distribution, especially in its standardized form, we can have a look at the most common R-Studio functions related to this subject.

First of all, you might want to load a dataset. You can do that using a .csv file, for example (or some datasets available online).

In our case, we’ll create it ourselves, as it gives us the chance to show a useful function as well. The function is rnorm().
The following code creates a dataset of 1000 data points, distributed normally, with mean=20 and standard deviation =5
(and defining such parameters is extremely simple, as shown below):

Also, the View() function visualizes the whole dataset in a new tab.

Two basic but fundamental functions are the ones needed to calculate the mean and the standard deviation of a dataset.
These functions are (simply) mean() and sd().
Let’s use those on our randomly generated dataset:

We indicated the value obtained on the side. Being the data randomly generated, they will obviously not be exactly the ones we set (mean=20, sd=5), but they are clearly pretty close to that.

To standardize a dataset, we can use the scale() function, for example on the dataset  we just created above:

Thanks to the last line of code, we can see how each data point has been transformed in its equivalent z-score.

We can as well double check our mean and standard deviation, this time on the freshly created standardized version of our dataset, to verify their values:

As we clearly see, they are basically the ones expected.

Probabilities calculation

We can use the pnorm() function to calculate the probabilities of a single point of sitting at a defined value.
Using the same parameters used for our randomly generated dataset  above, mean=20 and sd=5:

Which means that a data point having value=15 is expected to have 15.86% of the data points sitting to its left.

Notice that (15 being exactly 1 standard deviation to the left of the mean) this value is consistent with the image already shown above

where roughly 15.9% of the data points are expected to sit on the left of such a point.

The exact same function can be used with a z-score.
As 15 (as we said) is exactly 1 negative standard deviation from the mean:

which is the exact same value obtained in the case above (as expected).
Notice that we didn’t need to specify the mean and standard deviation as parameters here, as they are given default values of mean=0 and sd=1, as we are using a z-score under assumption of normal standard distribution.

Percentile calculation

In a somehow complementary fashion, we can use the qnorm() function to calculate the quantiles (or percentiles) for a normal variable.

For example, calculating the 25th percentile (or 1st quartile) using the same mean and standard deviation used so far, would look like:

This means that the first quartile is expected to sit on an X=16.62755, in our normal distribution with mean=20 and sd=5.

In other words, the 25% of the values are expected to sit left of this value (as lower.tail=TRUE).

Density Function

Finally, we can plot the density function of (for example) our initial randomly generated dataset using the dnorm() command, as follows:

Which produces this graph as an output:

Still in relation to general distribution visualization, be aware that the hist() function allows to visualize your data, as well:

Conclusions

We hope that this post has provided a clear explanation to understand the concepts of normal distribution, standardized normal distribution, and an introduction the functions needed to start tackling these subject in the R-Studio statistic environment.

Thank you for reading.

One-Way ANOVA – Intro and example

Introduction

Let’s start from a basic definition: ANOVA stands for “Analysis of Variance“, and it’s a test used to assess the statistical equality (or lack of thereof) among different groups’ means.

Notice as well that to run this test you need to have at least 3 different groups (the means of which you intend to compare).

As the name itself partly suggests, it does so by analysing the relationship among the VARIANCE BETWEEN GROUPS (the larger the variation, the more likely it is that the means among groups are different) and the VARIANCE WITHIN GROUPS (the larger it is, the less likely it is that the means among groups are different).

This is well noticeable in the formula used to run our ANOVA:

F = (variance between groups) / (variance within groups)

The larger the value of F, the more likely it is that the groups have different means (which typically results in rejecting our H0, or null, hypotesis).

This can be intuitively understood with a simple practical example.

Imagine to have 3 groups (A, B and C), of 10 people each, sampled from 3 different university classes:

Imagine that the means are respectively (in cms):
– 187cms (A)
– 180cms (B)
– 173cms (C)

You’d be naturally brought to think that the means are indeed different, and don’t just suffer from some sampling randomness.

Why so? Because the differences among the samples (A,B and C) is quite relevant in relative terms.
In other words: the variance between the groups is large.

At the same time, before stating that such means are indeed different, you might want to assess what goes on within each of the groups.
You probably want to see how the individual heights are distributed.
The more they’ll be actually concentrated around some value, the more you’ll agree that they are representative of the population they come from (in this case, each university class).

If instead you’ll notice that group A is skewed because it has one semi-professional basketball player in it, and group C is skewed because it has a succesful jockey in it, you might be tempted instead to say that, afterall, the means of the populations are not necessarily deemed to be so different.

In other words, you would be more lineant in judging the groups means as different because of a larger variance within (certain) groups.

This logic is summarized in the measure used in an Anova test to determine wheteher the means of different groups are indeed different or not, which is (again):

F = (variance between groups) / (variance within groups)

Which tries to make sense of such differences using a statistical approach (which includes assessing how big the groups are, through the use of the “degrees of freedom”).

One-way ANOVA

The most basic form of this test is the one we are addressing in this post: the one-way Anova.

Using the term “one-way” we simply mean that the test is run on a single independent variable (such as, for example, the height seen before).

Let’s run a simple example using the R statistic software, to see how it works.

First, we load the data from a webpage (mind the “data.table” library).

Then we assign names to the 2 columns of the dataset, for clarity:

Since we run the View() function, we can see how we have 6 values for “Location”, and that there are 12.858 data points.

For general interest, we know from the references of the dataset that the “Location” values indicate the following:

We can then run a boxplot() function to assess how the number of vertebrae is distributed, in each location:

Which gives us the following plot:

The boxplot allows us to see at a glance details such as:
54 vertebrae is a value found only in locations 2, 3 and 5
49 vertebrae is a value found only in location 2 and 4
– There is a prevalence of 52 in all locations, except location 1

In any case this plot doesn’t show us the number of samples for each location, nor allows us to decide whether the means can be considered equal or not, from the statistical point of view.

ANOVA test

As we should now have a fair idea about the nature of our data, we can finally run our Anova test.

To do so, we have 2 options:

1. Taking a relaxed approach at the equality of variances among groups. In this case, you can use the:

oneway.test() function

which is more lenient, in relation to the equality of variance assumption.

2. You first verify wheter the variances are equal among the groups using the Levene Test, and only then use the:

aov() function

if variances are proven to be statistically equal.

Let’s try this 2nd, stricter approach:

Levene Test:

Which gives us the following result:

The p-value of the Levene Test tells us that the variances of the 6 groups definitely cannot be considered equal.

We then need to procede using the more lenient version of the Anova test, using the oneway.test() function:

Which gives us the following :

Given the minuscule p-value (0.0003717), the means of the different groups CAN be considered NOT EQUAL.

We definitely have enough support to reject the null hypotesis.

As a last word, please note that the Anova test only tells us if the means of at least 2 groups are different, but falls short of indicating which groups actually have different means.

There are several post-hoc test able to assess which pairs of groups have different means, a common one being the “Tukey test“.
Unfortunately, this test assumes equality of variance among groups, hence cannot be used in our case, following our previous findings.

This was meant to be a basic explanation of the nature of the Anova test.
We hope that the example is clear enough to allow anybody to try a first approach to such a statistical test, in relation to their data of interest.