Bayes’ theorem is a widely used concept used in both statistics and probability theory.
As per the Wikipedia page: “[It] describes the probability of an event, based on prior knowledge of conditions that might be related to the event“.
In other words, it tries to calculate the probabilities of an even within a defined scenario (or scenarios), having itself its own likelihood of happening.
Another common equivalent definition is that Bayes’ theorem deals with the conditional probability of events.
How it works
Bayes’ theorem deals with likelihoods of actual recorded events.
A first simplistic and intuitive example is:
you see a guy getting off his Volvo S60, and you are asked to guess if his salary is (say) north or south of 30,000€.
I bet most of us would guess that it is higher than that mark
(In this case we don’t have the precise data, but it is not far fetched to assume that 90% of Volvo S60 owners are above -or well above- the 30,000€ salary, for example).
Given our assumption above (and not forgetting that Bayes’ deals with actual factual probabilities) we just made in educated guess to minimize the possibilities of being wrong.
Largely, Bayes is based on this logic (though it definitely applies it with better statistical and numerical precision).
The statistics behind it
The formula of Bayes’ theorem is the one below:
Where A and B are 2 events that can or cannot happen simultaneously (otherwise said: are not mutually exclusive).
The formula above reads as it follows:
– The probability of A happening, given that B has happened
is equal to
– the probability of B happening, given that A has happened, multiplied by the probabilities of A happening, and divided by the probabilities of B happening.
A simple example, (taken from our past Statistics module) might help to clarify this statement.
You have an automatic monitoring system, created to detect intruders, and it does so with a probability of 90%.
The system automatically records the weather, and in a series of controlled tests it has shown that, when the intruder was succesfully detected:
– 75% of the times the weather was clear
– 20% of the times the weather was cloudy
– 5% of the times the weather was rainy
When instead the system failed to detect the intruder:
– 60% of the times the weather was clear
– 30% of the times the weather was cloudy
– 10% of the times the weather was rainy
Find the probability of detecting the intruder, given that the weather is rainy (assuming an intruder actually entered the plant).
Defining D the event that the intruder is detected
(and DC its complementary event that the intruder is NOT detected):
It is quite impressive to think that Netflix, a company nowadays worth 19 billions in total assets, wouldn’t be even conceivable only 15-20 years ago.
The proof of this is that Netflix itself appeared on the market in 1997 as a DVD rental and sales company (though, almost immediately, started focusing exclusively into the DVD rental by mail business).
It was not until 2007 that it started providing media streaming as a product, and didn’t produce any series before 2012.
What made possible to create a business (in its actual form), able not only to wipe out a company like Blockbuster from the market, but as well to compete with TV channels and movie productions as a content creator?
Of course, the maturity of web technologies (in the form of interactive and quick responding websites) and telecommunications (in the form of speedy internet connections) were absolutely necessary to achieve some success, but do not explain (nor probably would be sufficient to reach) THIS kind of success.
The decisive factor to see Netflix operating at the level it does today, with countless series parts of the popular culture (House of cards, Stranger Things, Orange is the new black – just to name few), with a brand recognition of 65% (and a stable position among the top 100) in the US market and an ubiquous presence around the planet, was the ability to translate the massive amount of data available about their users’ behaviour and preferences into a better offer and a better user experience (and also, continously improved)
In other words, Netflix is a prime example of data-driven companymaking use of big data.
The predictive algorithm used by Netflix to suggest users the next content to watch (partly based on “association rules”, for example) is quite important (it gets to suggest about 80% of the content to viewers), but it’s only a part of many multi-faceted processes, and it depends on the data and metadata it is fed with, which brings us the the next two aspects.
First and foremost, a big part of the data used to select the content (type of series/movies offered), and the general form of the offer (interface, technical specs) comes from simply analyzing customers’ behaviour (gathered in a somehow “passive” way).
Beside the above though, there is quite a bit going on within Netflix itself, in order to create data (or maybe, more appropriately “metadata”), as internal “taggers” are in charge to watch every minute of the series, marking with precision the actual content of each series (genre, presence of an ensemble cast, main themes and much more) allowing capture the actual nature of the content in the most possibly nuanced fashion.
This side of the process can somehow seen as a more “active” way of generate useful data, by Netflix, and it is just as fundamental.
But, how does all that translate in an actual better, more succesful product, able to improve customer retention and to obtain better revenues, by offering something that the average customer is more keen to pay for or to keep paying for?
Let’s see some practical examples.
A first, high-level example is the concept of “micro-genre“, (something largely created by Netflix itself) which is simply the result of machine-learning processes creating “buckets” of shows, bydiscovering some commonalities among them and their viewers.
Put simply, the algorithms working behind the scenes for Netflix created a countless number of micro-categories, allowing Netflix to do 2 things:
– to tailor the offer of existing shows to users
– to produce content that is very likely to be succesful among at least some demography of their audience.
Another example is the interface (cyclically reviewed) which is optimized to maximise the success rate of the series prompted.
Netflix interface 2014
Netflix interface 2018
We can see above how the interface changed in the last 4 years. It’s fair to assume that data suggested a less dispersive visualization in the menu (fewer shows prompted at first), better interactivity (shows are now easily browsable, horizontally, by category), a more cinematic layout (darker tones, one single color throughout the webpage), and a better focus on the show selected (bigger prompt, more visible description and image, rating visible at a glance).
A third example is something that many (or all) Netflix users might have noticed, especially at the beginning of their experience as customers.
It is not uncommon that when you launch a series/movie, the first few seconds are lower-quality in terms of image, but they quickly adapt to then offer a stable and high-quality image throughout the show vision.
This is another decision relying on data analysis. Netflix noticed 2 things:
– users would switch off within very few seconds, if the show doesn’t start streaming (hence, offering a lower quality at first ensures that this doesn’t happen, improving retention and user experience)
– it allows to optimize the streaming to ensure there is no buffering for the duration of the show (another factor that users could find extremely irritating and could lead to drop the vision, or even Netflix services in general)
Deciding (in particular) to trade lower quality for better response times was a choice deriving from looking at customers’ behaviour data.
A last example is how Netflix can (thanks to the use of big data) micro-target advertisement depending on the precise demography the latter is aimed at (something that a TV channel cannot do, with such precision).
For example, for the first season of House of Cards it created 10 different trailers, each aimed at a specific segment of viewers, to maximise the potential interest of customers, when launching the series.
It should be clear by now how Netflix is a textbook example of how data analytics, big data and data driven decision should be run.
It is surely not easy task to create the infrastructure, the internal know-how and the culture to make the best use of big data, but it is a fact that, when done properly, such an approach allows to optmize resources, offering at the same time the best possible product to customers, ensuring success and growth of a business under any perspective.
We all heard of normal distribution (often referred to as well as “Bell Curve” or “Gaussian Distribution”), and most people have an at least vague idea of what it is.
A little history
Galileo (XVII century) was the first one to have an intuition of such a distribution, as he realized that the measurements errors made by the instruments he was using were symmetric, and that small errors occurred more frequently than large errors.
Laplace defined the Central Limit Theorem in 1778 (which is strictly intertwined with the general concept of normal distribution).
Gauss (and Adrain) were instead the first ones to formalize the exact mathematical nature of such distribution in the early XIX century.
Normal distribution – What is it?
In this post we simply want to illustrate the nature of the Normal distribution in clear terms, and show the main R functions used in relation to this concept.
The Normal distribution is a probability (or density) function, often encountered in nature.
Common examples can be:
– height (or weight) of a human population
– marks obtained in an IQ test
– people’s salaries in a nation/region/city.
The measurements related to the each one of above mentioned values, have something in common:
– they all tend to concentrate around a mean (in other words: the frequency of the mean value tends to be the highest)
– each of those distributions’ shape is symmetrical (around their distinct mean)
– their shape is influenced by how the individual values recorded are distributed around their mean (in other words, the exact shape of each curve will be influenced by its specific individual data values variation around the mean)
The general shape of a Normal distribution is then the following:
Which makes clear at a glance why it is called “Bell curve” (notice as well how it is symmetrical around the mean, μ).
Now, while the general shape will be similar for any measurementfollowing a normal distribution, the actual precise shape of any particular dataset will somehow differ on the base of 2 parameters:
– the mean around which each specific distribution is centered
– the general variation of individual data points (around the mean) of each specific dataset . You can think about it as “how far” (in average) data points are from the dataset mean (in relative terms).
The following image (thank you, Wikipedia) should clarify this concept:
The image above makes clearer how different dataset (though all commonly bell shaped and centered around a mean) are diverse in terms of precise shape, depending on how “spread” their values are.
Notice as well how the position of the curve in relation to the axes is different, depending on where the mean of the dataset is positioned.
Without digging into any mathematical technicality, you can see how the 2 parameters of mean (μ) and standard deviation (σ) are the only variables defining each specific normal distribution, in the general Gaussian formula below (everything else is a constant):
Knowing and understanding this simple fact paves the way for the next (extremely important) notion linked to the notion of normal distribution, which we’ll see in the next paragraph.
Standard Normal Distribution
As mentioned above, the normal distribution of any dataset/measurement is strictly defined by its specific and unique mean (μ) and standard deviation (σ) .
For statistical analysis though, it would be useful to have a single uniformed function (with defined and well known characteristics) to facilitate generalized statistical assessments.
That function exists, it is the “Standard normal distribution”, and can fit any (normally distributed) dataset through a simple mathematical transformation (which will be described later).
The characteristics of the Standardized Distribution are:
its mean is zero -> μ=0
its standard deviation is one -> σ=1
its values will be distributed as follows:
The properties deriving from such standard distribution of values bring to having:
– 68.2% within 1 (positive or negative) standard deviation
– 95.4% within 2 (positive or negative) standard deviations
– 99.8% within 3 (positive or negative) standard deviations
These precise values are used in statistics for many purposes (calculating confidence intervals and testing hypotesis, among others).
That’s done (as mentioned above) through a simple transformation that can be applied to any point of a normal distribution, in order to translate it in a standardized one.
Data points/valuescan be standardized as follows:
which gives us what is called the z-score of a data point in standardized terms.
The z-score simply indicates, in standard deviations, how far the data point is from the mean, in a normal standard distribution.
Knowing the z-score of a data point, not only gives us a general idea about where this sits in the distribution (having a simple glance at the graph above), but allows us to infer with statistical precision the related percentile where the point under exam sits.
This can be easily done (in the “old fashion” way) using some z-score tables, or using instead some statistical software (such as R-Studio).
R-Studio useful functions
Knowing now the main concepts related to the normal distribution, especially in its standardized form, we can have a look at the most common R-Studio functionsrelated to this subject.
First of all, you might want to load a dataset. You can do that using a .csv file, for example (or some datasets available online).
In our case, we’ll create it ourselves, as it gives us the chance to show a useful function as well. The function is rnorm().
The following code creates a dataset of 1000 data points, distributed normally, with mean=20 and standard deviation =5
(and defining such parameters is extremely simple, as shown below):
set.seed(1)##data are randomly generated, but setting a seed (above) will give always the same values##
Also, the View() function visualizes the whole dataset in a new tab.
Two basic but fundamental functions are the ones needed to calculate the mean and the standard deviation of a dataset.
These functions are (simply) mean() and sd().
Let’s use those on our randomly generated dataset:
We indicated the value obtained on the side. Being the data randomly generated, they will obviously not be exactly the ones we set (mean=20, sd=5), but they are clearly pretty close to that.
To standardize a dataset, we can use the scale() function, for example on the dataset we just created above:
Thanks to the last line of code, we can see how each data point has been transformed in its equivalent z-score.
We can as well double check our mean and standard deviation, this time on the freshly created standardized version of our dataset, to verify their values:
mean(StandNormDistr1000)# 9.781455e-17 (~0)
As we clearly see, they are basically the ones expected.
We can use the pnorm() function to calculate the probabilities of a single point of sitting at a defined value.
Using the same parameters used for our randomly generated dataset above, mean=20 and sd=5:
# 15.8655% of data points expected to sit to the LEFT of 15
# notice that lower.tail=TRUE is the default value, anyway
# if we wanted instead to see the amount of data points sitting to the RIGHT of our point, we would just set this parameter to lower.tail=FALSE
Which means that a data point having value=15 is expected to have 15.86% of the data points sitting to its left.
Notice that (15 being exactly 1 standard deviation to the left of the mean) this value is consistent with the image already shown above
where roughly 15.9% of the data points are expected to sit on the left of such a point.
The exact same function can be used with a z-score.
As 15 (as we said) is exactly 1 negative standard deviation from the mean:
which is the exact same value obtained in the case above (as expected).
Notice that we didn’t need to specify the mean and standard deviation as parameters here, as they are given default values of mean=0 and sd=1, as we are using a z-score under assumption of normal standard distribution.
In a somehow complementary fashion, we can use the qnorm() function to calculate the quantiles (or percentiles) for a normal variable.
For example, calculating the 25th percentile (or 1st quartile) using the same mean and standard deviation used so far, would look like:
This means that the first quartile is expected to sit on an X=16.62755, in our normal distribution with mean=20 and sd=5.
In other words, the 25% of the values are expected to sit left of this value (as lower.tail=TRUE).
Finally, we can plot the density function of (for example) our initial randomly generated dataset using the dnorm() command, as follows:
Which produces this graph as an output:
Still in relation to general distribution visualization, be aware that the hist() function allows to visualize your data, as well:
We hope that this post has provided a clear explanation to understand the concepts of normal distribution, standardized normal distribution, and an introduction the functions needed to start tackling these subject in the R-Studio statistic environment.
Let’s start from a basic definition: ANOVA stands for “Analysis of Variance“, and it’s a test used to assess the statistical equality (or lack of thereof) among different groups’ means.
Notice as well that to run this test you need to have at least 3 different groups (the means of which you intend to compare).
As the name itself partly suggests, it does so by analysing the relationship among the VARIANCE BETWEEN GROUPS (the larger the variation, the more likely it is that the means among groups are different) and the VARIANCE WITHIN GROUPS (the larger it is, the less likely it is that the means among groups are different).
This is well noticeable in the formula used to run our ANOVA:
F = (variance between groups) / (variance within groups)
The larger the value of F, the more likely it is that the groups have different means (which typically results in rejecting our H0, or null, hypotesis).
This can be intuitively understood with a simple practical example.
Imagine to have 3 groups (A, B and C), of 10 people each, sampled from 3 different university classes:
Imagine that the means are respectively (in cms): – 187cms (A) – 180cms (B) – 173cms (C)
You’d be naturally brought to think that the means are indeed different, and don’t just suffer from some sampling randomness.
Why so? Because the differences among the samples (A,B and C) is quite relevant in relative terms.
In other words: the variance between the groups is large.
At the same time, before stating that such means are indeed different, you might want to assess what goes on within each of the groups.
You probably want to see how the individual heights are distributed.
The more they’ll be actually concentrated around some value, the more you’ll agree that they are representative of the population they come from (in this case, each university class).
If instead you’ll notice that group A is skewed because it has one semi-professional basketball player in it, and group C is skewed because it has a succesful jockey in it, you might be tempted instead to say that, afterall, the means of the populations are not necessarily deemed to be so different.
In other words, you would be more lineant in judging the groups means as different because of alarger variance within (certain) groups.
This logic is summarized in the measure used in an Anova test to determine wheteher the means of different groups are indeed different or not, which is (again):
F = (variance between groups) / (variance within groups)
Which tries to make sense of such differences using a statistical approach (which includes assessing how big the groups are, through the use of the “degrees of freedom”).
The most basic form of this test is the one we are addressing in this post: the one-way Anova.
Using the term “one-way” we simply mean that the test is run on a single independent variable (such as, for example, the height seen before).
Let’s run a simple example using the R statistic software, to see how it works.
First, we load the data from a webpage (mind the “data.table” library).
library(data.table)# allows to use fread() function
We can then run a boxplot() function to assess how the number of vertebrae is distributed, in each location:
boxplot(Vertebrae~Location,data=sardines_data,ylab="Vertebrae",xlab="Location",main="Vertebrae N. by Location")
Which gives us the following plot:
The boxplot allows us to see at a glance details such as:
– 54 vertebrae is a value found only in locations 2, 3 and 5
– 49 vertebrae is a value found only in location 2 and 4
– There is a prevalence of 52 in all locations, except location 1
In any case this plot doesn’t show us the number of samples for each location, nor allows us to decide whether the means can be considered equal or not, from the statistical point of view.
As we should now have a fair idea about the nature of our data, we can finally run our Anova test.
To do so, we have 2 options:
1. Taking a relaxed approach at the equality of variances among groups. In this case, you can use the:
which is more lenient, in relation to the equality of variance assumption.
2. You first verify wheter the variances are equal among the groups using the Levene Test, and only then use the:
if variances are proven to be statistically equal.
One-way analysis of means(notassuming equal variances)
F=4.7549,num df=5.00,denom df=232.62,
Given the minuscule p-value (0.0003717), the means of the different groups CAN be considered NOT EQUAL.
We definitelyhave enough support to reject the null hypotesis.
As a last word, please note that the Anova test only tells us if the means of at least 2 groups are different, but falls short of indicatingwhich groups actually have different means.
There are several post-hoc test able to assess which pairs of groups have different means, a common one being the “Tukey test“.
Unfortunately, this test assumes equality of variance among groups, hence cannot be used in our case, following our previous findings.
This was meant to be a basic explanation of the nature of the Anova test.
We hope that the example is clear enough to allow anybody to try a first approach to such a statistical test, in relation to their data of interest.
“Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables“.
The general formula/model for multiple linear regression, given n observations, is yi = 0 + 1xi1 + 2xi2 + … pxip + i for i = 1,2, … n.
In a nutshell, we want to build a model that gives us reasonable estimates of a target variable, on the base of the associated values of other variables (having some logical relationship with the response/target one).
To perform our analysis, we’ll use a dataset that includes 30 items (airlines) , for each of which 11 different variables (beside the company names) have been recorded.
We now have a fully readable version of our small dataset, and we are ready to dig into it.
The structure of our data can be visualized by using the command below:
The result confirms us that the only multinomial variable is “airline” (company name), while all the remaining ones are numerical in nature (either integer or decimal), which is ideal for regression.
We get as well confirmation that there is a total of 30 items/rows (airlines) and 12 variables (including the companies’ names), as initially stated.
The variable we want to build our model around is a measure of the operating cost (“totOpsCost“, in our table) .
It is measured in “cents per revenue ton-mile” and it intends to give a uniformed measure of the costs for all companies.
We first have a look to the statistics and distribution of this variable (“totOpsCost”). The command:
Gives us this statistical results, for totOpsCost:
Min. 1st Qu. Median Mean 3rd Qu. Max. 42.30 50.70 73.35 113.41 122.00 820.90
We build as well an histogram of this variable, to have an idea about its distribution.
We can see at a glance how (for virtually all airlines) their costs sit somehwere between our minimum value (as shown above) of 42.30and (something below or equal to) 200.
There are 2 exceptions:
– Wiggins, a clear outlier with costs of 820.9
– Central, another noticeable exception (318.5), but perhaps not an as clear of an outlier, considering the mean and standard deviation of “totOpsCost” visualized above.
To make sure we keep working on the dataset with no issues, we use tha attach command, at this point (we probably could have done it earlier, but better leater than never):
The natural next step is probably to check the correlation between the variables. To do so we can run a command like the one below:
But the resulting matrix would probably be too dense and confused, to easily make sense out of it.
With some easy digging into the internet, we find references to obtain the matrix below, using the following code, which should be easy to read and serves our purpose very directly
(notice we exclude the airlines names/1st column, in the first line of code below):
And here is the matrix:
This matrix tells us straight away which variables have a correlation higher than desired (we set an absolute value of 0.75 as maximum acceptable).
Through a process of exclusion, we opt for including the variables indicated in the code below:
Residual standard error:102.3on25degrees of freedom
The model doesn’t show great results in terms of R-Squared, nor in term of parameters significance.
We might then want to think about ways to improve it.
Remember how we have found out in the earlier stages that Wiggins was an outlier (with costs over 800)?
Well, we never removed it from the dataset.
Let’s then do so, to assess if Wiggins is the item in the dataset preventing us from acquiring better results, for our multiple linear model.
We then remove Wiggins from the initial dataset, and we add a couple of lines to double check that the resulting data is what we want. We attach as well the new subset (“newdata”) to start working on it, instead:
Now that we have eliminated Wiggins from the data used to build our model, we are ready to apply again the exact same function used before, obviously this time the model will be built on all previous data MINUS the ones related to Wiggins.
We included the code to create the correlation matrix plot too. If you run it you’ll see how it’s extremely similar to the one we got before, so we can use the same variables chosen previously.
Residual standard error:28.75on24degrees of freedom
The improvement in terms of R-Squared, both multiple (0.787) and adjusted (0.7515), is clear and confirmed supported by a minuscule p-value.
As we are now getting decent results for our model, in statistical terms, we might want to tweak it slightly.
For example (remembering how all the 3 financial attributes included in the data were highly correlated among them), we might try to swap “InvestmentsAndSpecialFunds” for “TotalAssets”.
Residual standard error:28.21on24degrees of freedom
It is very similar to model 2, but there is a very minor improvement for the R-Squared, so we see no reason for not using this model 3 instead, anyway.
Considering that (as we can check we can check running the code below):
the actual variable “totOpCost” shows the following statistical values, in “newdata”: median=71.30 mean=89 sd=57.67
We see how the 50% of our residuals (for model 3) located between 1Q and 3Q are sitting within a tolerable margin of error in relative terms [-13.215, 8.575]
We can not see a diffused high significance among the variables though (in fact they mostly show a dismal one), but given the overall results of the model, we can probably decide to accept it as it is, and procede with our analysis.
We now want to have a look to the main plots that R can generate about our Multiple Linear Model, by using this simple command:
Residuals Vs Fitted
This graph shows us how the residuals are not too bad overall, as the line is fairly flat. Though, it does tends to slightly diverge at the the 2 ends.
It must be pointed out as well that datapoint 5 (“Central”) seems to have nature of outlier, by this plot.
Normal Q-Q plot of residuals
This Q-Q plot seems just confirmswhat we have seen above, allowing us to assess residuals in standardized terms.
Once again we see how datapoint 5 is clearly off-scale.
Finally, we see the graph visualizing the Cook’s distance (a method used to identify influential data points, which validity might need to be double checked).
We see once more how datapoint 5 is considered definitely odd, in statistical measures.
What plots tell us
The plots above seem to indicate that:
– even though our model shows good R-squared values
– even though we used uncorrelated variables to build it
– even though we eventually removed a clear outlier
We still clearly have a datapoint that doesn’t seem adequate to be included in our linear regression.
By now we should be quite familiar on how to procede, and by the previous steps of exploration and trial & error we should have a clear idea about what to do.
Hence, we simply advance by using the same procedure already seen.
First, we remove the problematic datapoint left (datpoint 5: “Central” airlines), making sure that it’s done as intended and attaching the new dataset to our R-Studio environment:
As we now have our new “newdata2” dataset ready to work on, we have a look to the correlation matrix to assess which variables now seem to be uncorrelated:
Which generates the plot below, confirming that the variables used in our last model (model 3) are still the best ones to pick, as they are all just as uncorrelated (and that we couldn’t add any more, without first removing some):
We then create our final model by running the usual code just once more (and using the same variables used for model 3):
Residual standard error:13.2on23degrees of freedom
This model not only shows clearly better Multiple and Adjusted R-Squared (0.8961 and 0.878 respectively) , but as well better significance for most parameters.
By running the plot function on our final model (feel free to try, as we won’t, in order to avoid to be too repetitive)…
… we see how this model (though maybe not perfect), is overall more apt in fitting/predicting the values of the operational costs of our airlines.
In particular, we see how our Cook’s distance graph this time doesn’t highlight any particular problem.
Our final model shows overall good results in terms of:
– R-Squared (multiple and adjusted)
– Parameters significance
– Residuals (as per summary print and plots)
– Cook’s distance (outliers)
We can then consider it to infer the values of the airlines costs, with the caveat of using it just within what seems to be the range covering the level of costs of the airlines that didn’t turn out to be outliers on the market (by our data and following our analysis above), which is a value roughly sitting between 0 and 200.