Multiple Linear Regression: an example

Introduction

We intend to run a multiple linear regression to describe an example of how to put in place this kind of statistical procedure (we’ll use R-Studio to do so).

Just for clarity, here is a definition of Multiple Linear Regression:

Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables“.

The general formula/model for multiple linear regression, given n observations, is 
yi = 0 + 1xi1 + 2xi2 + … pxip + i for i = 1,2, … n.

In a nutshell, we want to build a model that gives us reasonable estimates of a target variable, on the base of the associated values of other variables (having some logical relationship with the response/target one).

Our dataset

To perform our analysis, we’ll use a dataset that includes 30 items (airlines) , for each of which 11 different variables (beside the company names) have been recorded.

Most of the terms are quite technical, and we won’t dig excessively into their meaning, though we want to obviously have at least a rough idea about what they pertain and measure, in order to utilize them meaningfully.

The target is to create a model that, on the base of the selected variables, can tell us with some precision what the cost of running operations is.

A first look to our data

The first natural step is to load the data, so to start interacting with it in our R-Studio interface.

Please notice we needed to slightly clean the dataset by removing some empty space in the airlines names, in order to load it smoothly. You can download it from here.

Let’s then run the first lines of code (after having downloaded the dataset on your machine, from the hyperlink right above):

By running the code up here, we will notice that columns are named randomly, as they actually have no headings.

We surely want to add names to them, for better comprehension, hence we add some more explicative names to each of the columns:

We now have a fully readable version of our small dataset, and we are ready to dig into it.

The structure of our data can be visualized by using the command below:

The result confirms us that the only multinomial variable is “airline” (company name), while all the remaining ones are numerical in nature (either integer or decimal), which is ideal for regression.

We get as well confirmation that there is a total of 30 items/rows (airlines) and 12 variables (including the companies’ names), as initially stated.

The variable we want to build our model around is a measure of the operating cost (“totOpsCost“, in our table) .
It is measured in “cents per revenue ton-mile” and it intends to give a uniformed measure of the costs for all companies.

We first have a look to the statistics and distribution of this variable (“totOpsCost”). The command:

Gives us this statistical results, for totOpsCost:

Min.       1st Qu.     Median     Mean          3rd              Qu. Max.
42.30       50.70        73.35       113.41      122.00          820.90

SD    145.1433

We build as well an histogram of this variable, to have an idea about its distribution.

Distribution of operating costs for all the 30 airlines of the original dataset.

We can see at a glance how (for virtually all airlines) their costs sit somehwere between our minimum value (as shown above) of 42.30 and (something below or equal to) 200.

There are 2 exceptions:
Wiggins, a clear outlier with costs of 820.9
Central, another noticeable exception (318.5), but perhaps not an as clear of an outlier, considering the mean and standard deviation of “totOpsCost” visualized above.

To make sure we keep working on the dataset with no issues, we use tha attach command, at this point (we probably could have done it earlier, but better leater than never):

The natural next step is probably to check the correlation between the variables. To do so we can run a command like the one below:

But the resulting matrix would probably be too dense and confused, to easily make sense out of it.

With some easy digging  into the internet, we find references to obtain the matrix below, using the following code, which should be easy to read and serves our purpose very directly
(notice we exclude the airlines names/1st column, in the first line of code below):

And here is the matrix:

This matrix tells us straight away which variables have a correlation higher than desired (we set an absolute value of 0.75 as maximum acceptable).

Through a process of exclusion, we opt for including the variables indicated in the code below:

Giving us the following result:

The model doesn’t show great results in terms of R-Squared, nor in term of parameters significance.

We might then want to think about ways to improve it.

Remember how we have found out in the earlier stages that Wiggins was an outlier (with costs over 800)?
Well, we never removed it from the dataset.

Let’s then do so, to assess if Wiggins is the item in the dataset preventing us from acquiring better results, for our multiple linear model.
We then remove Wiggins from the initial dataset, and we add a couple of lines to double check that the resulting data is what we want. We attach as well the new subset (“newdata”) to start working on it, instead:

Now that we have eliminated Wiggins from the data used to build our model, we are ready to apply again the exact same function used before, obviously this time the model will be built on all previous data MINUS the ones related to Wiggins.

We included the code to create the correlation matrix plot too. If you run it you’ll see how it’s extremely similar to the one we got before, so we can use the same variables chosen previously.

Model 2 creation:

Model 2 results:

The improvement in terms of R-Squared, both multiple (0.787) and adjusted (0.7515), is clear and confirmed supported by a minuscule p-value.

As we are now getting decent results for our model, in statistical terms, we might want to tweak it slightly.
For example (remembering how all the 3 financial attributes included in the data were highly correlated among them), we might try to swap “InvestmentsAndSpecialFunds” for “TotalAssets”.

We then simply create a 3rd model:

Which results are (model 3):

It is very similar to model 2, but there is a very minor improvement for the R-Squared, so we see no reason for not using this model 3 instead, anyway.

Considering that (as we can check we can check running the code below):

the actual variable “totOpCost” shows the following statistical values, in “newdata”:        median=71.30       mean=89       sd=57.67

We see how the 50% of our residuals (for model 3) located between 1Q and 3Q are sitting within a tolerable margin of error in relative terms [-13.215, 8.575]

We can not see a diffused high significance among the variables though (in fact they mostly show a dismal one), but given the overall results of the model, we can probably decide to accept it as it is, and procede with our analysis.

We now want to have a look to the main plots that R can generate about our Multiple Linear Model, by using this simple command:

Residuals Vs Fitted

This graph shows us how the residuals are not too bad overall, as the line is fairly flat. Though, it does tends to slightly diverge at the the 2 ends.

It must be pointed out as well that datapoint 5 (“Central”) seems to have nature of outlier, by this plot.

Normal Q-Q plot of residuals

This Q-Q plot seems just confirms what we have seen above, allowing us to assess residuals in standardized terms.

Once again we see how datapoint 5 is clearly off-scale.

Cook’s Distance

Finally, we see the graph visualizing the Cook’s distance (a method used to identify influential data points, which validity might need to be double checked).

We see once more how datapoint 5 is considered definitely odd, in statistical measures.

What plots tell us

The plots above seem to indicate that:

– even though our model shows good R-squared values
– even though we used uncorrelated variables to build it
– even though we eventually removed a clear outlier

We still clearly have a datapoint that doesn’t seem adequate to be included in our linear regression.

Final model

By now we should be quite familiar on how to procede, and by the previous steps of exploration and trial & error we should have a clear idea about what to do.

Hence, we simply advance by using the same procedure already seen.

First, we remove the problematic datapoint left (datpoint 5: “Central” airlines), making sure that it’s done as intended and attaching the new dataset to our R-Studio environment:

As we now have our new “newdata2” dataset ready to work on, we have a look to the correlation matrix to assess which variables now seem to be uncorrelated:

Which generates the plot below, confirming that the variables used in our last model (model 3) are still the best ones to pick, as they are all just as uncorrelated (and that we couldn’t add any more, without first removing some):

We then create our final model by running the usual code just once more (and using the same variables used for model 3):

Which gives us the following results:

This model not only shows clearly better Multiple and Adjusted R-Squared (0.8961 and 0.878 respectively) , but as well better significance for most parameters.

By running the plot function on our final model (feel free to try, as we won’t, in order to avoid to be too repetitive)…

… we see how this model (though maybe not perfect), is overall more apt in fitting/predicting the values of the operational costs of our airlines.

In particular, we see how our Cook’s distance graph this time doesn’t highlight any particular problem.

Conclusions

Our final model shows overall good results in terms of:

R-Squared (multiple and adjusted)
Parameters significance
Residuals (as per summary print and plots)
Cook’s distance (outliers)

We can then consider it to infer the values of the airlines costs, with the caveat of using it just within what seems to be the range covering the level of costs of the airlines that didn’t turn out to be outliers on the market (by our data and following our analysis above), which is a value roughly sitting between 0 and 200.

Leave a Reply

Your email address will not be published. Required fields are marked *