**Introduction**

We intend to run a **multiple linear regression** to describe an example of how to put in place this kind of statistical procedure (we’ll use **R-Studio** to do so).

Just for clarity, here is a ** definition** of Multiple Linear Regression:

“*Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables*“.

The **general formula/model** for multiple linear regression, given *n* observations, is

**y _{i} = _{0} + _{1}x_{i1} + _{2}x_{i2} + … _{p}x_{ip} + _{i} for i = 1,2, … n.**

In a nutshell, *we want to build a model that gives us reasonable estimates of a target variable, on the base of the associated values of other variables* (having some logical relationship with the response/target one).

#### Our dataset

To perform our analysis, we’ll use a dataset that includes **30 items** (airlines) , for each of which **11 different variables** (beside the company names) have been recorded.

Most of the terms are quite technical, and we won’t dig excessively into their meaning, though we want to obviously have at least a rough idea about what they pertain and measure, in order to utilize them meaningfully.

The **target** is to create a model that, on the base of the selected variables, can tell us with some precision what the **cost of running operations** is.

#### A first look to our data

The first natural step is to load the data, so to start interacting with it in our R-Studio interface.

Please notice we needed to slightly clean the dataset by removing some empty space in the airlines names, in order to load it smoothly. **You can download** it from **here**.

Let’s then run the first lines of code (after having downloaded the dataset on your machine, from the hyperlink right above):

1 2 3 4 |
link=file.choose() dataset=read.table(link, header=T) View(dataset) |

By **running the code up here**, we will notice that **columns are named randomly**, as they actually have **no headings**.

We surely want to add names to them, for better comprehension, hence **we add some more explicative names to each of the columns:**

1 2 3 |
colnames(dataset)=c("airline","flightLength","speedMph","dailyFlyTime","Popserved","totOpsCost","RevTonsPerAircraftMile","Ton_MileLoadFactor","AvailableCapacity","TotalAssets","InvestmentsAndSpecialFunds","AdjustedAssets") View(dataset) |

We now have a fully readable version of our small dataset, and we are ready to dig into it.

The **structure of our data** can be visualized by using the command below:

1 |
str(dataset) |

The result confirms us that the **only multinomial variable is “airline” (company name)**, while** all the remaining ones are numerical** in nature (either integer or decimal), which is **ideal for regression**.

We get as well confirmation that there is a total of** 30 items/rows (airlines)** and **12 variables (including the companies’ names)**, as initially stated.

The **variable we want to build our model around** is a measure of the operating cost (“**totOpsCost**“, in our table) .

It is *measured in “cents per revenue ton-mile”* and it intends to give a uniformed measure of the costs for all companies.

We first have a look to the **statistics** and **distribution** of **this variable **(“totOpsCost”). The command:

1 2 3 |
summary(dataset$totOpsCost) sd(dataset$totOpsCost) |

Gives us this statistical results, for **totOpsCost**:

**Min. 1st Qu. Median Mean 3rd Qu. Max.**

** 42.30 50.70 73.35 113.41 122.00 820.90**

**SD 145.1433**

We build as well an histogram of this variable, to have an idea about its distribution.

1 |
hist(dataset$totOpsCost) |

We can see at a glance how (for virtually all airlines) their **costs sit somehwere between** our **minimum value** (as shown above) **of 42.30** **and** (something below or equal to) **200.**

There are **2 exceptions**:

– **Wiggins**, a **clear outlier** with costs of **820.9**

– **Central**, another **noticeable exception (318.5), but perhaps not an as clear of an outlier**, considering the mean and standard deviation of “totOpsCost” visualized above.

To make sure we keep working on the **dataset** with no issues, we use tha **attach command**, at this point (we probably could have done it earlier, but better leater than never):

1 |
attach(dataset) |

The **natural next step** is probably to **check the correlation** between the variables. To do so we can run a command like the one below:

1 |
pairs(dataset) |

But the resulting matrix would probably be too dense and confused, to easily make sense out of it.

With some easy digging into the internet, we find references to **obtain the matrix below, using the following code**, which should be easy to read and serves our purpose very directly

(notice we exclude the airlines names/1st column, in the first line of code below):

1 2 3 4 5 6 7 |
x=cor(dataset[,2:12]) x library(corrplot) library(RColorBrewer) corrplot(x,type="upper", order="hclust", col=brewer.pal(n=8, name="RdYlBu")) |

And here is the matrix:

This **matrix tells us straight away** which **variables have a correlation higher than desired** (we set an **absolute value of 0.75** as maximum acceptable).

Through a process of exclusion, we opt for including the variables indicated in the code below:

1 2 3 4 |
airlines_model1 <- lm(totOpsCost~dailyFlyTime+AvailableCapacity+Ton_MileLoadFactor+InvestmentsAndSpecialFunds) airlines_model1 summary(airlines_model1) |

Giving us the following result:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
lm(formula = totOpsCost ~ dailyFlyTime + AvailableCapacity + Ton_MileLoadFactor + InvestmentsAndSpecialFunds) Residuals: Min 1Q Median 3Q Max -153.48 -34.66 -2.37 23.63 410.47 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 567.0001 85.4618 6.635 5.95e-07 *** dailyFlyTime -31.8499 16.3842 -1.944 0.0632 . AvailableCapacity -9.7743 17.1205 -0.571 0.5732 Ton_MileLoadFactor -469.2252 214.9630 -2.183 0.0387 * InvestmentsAndSpecialFunds 0.4765 0.4713 1.011 0.3218 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 102.3 on 25 degrees of freedom Multiple R-squared: 0.5717, Adjusted R-squared: 0.5032 F-statistic: 8.343 on 4 and 25 DF, p-value: 0.0002031 |

The **model doesn’t show great results in terms of R-Squared**, **nor** in term of **parameters significance**.

We might then want to think about ways to improve it.

Remember how we have found out in the earlier stages that **Wiggins was an**** outlier** (with costs over 800)?

Well, **we never removed it** from the dataset.

**Let’s then do so**, to a**ssess if Wiggins is the item in the dataset preventing us from acquiring better results**, for our multiple linear model.

We then **remove Wiggins** from the initial dataset, and we add a couple of lines to double check that the resulting data is what we want. We **attach as well the new subset (“newdata”) to start working on it**, instead:

1 2 3 4 5 |
newdata = dataset[-29,] View(newdata) attach(newdata) |

**Now that we have eliminated Wiggins** from the data used to build our model, we are **ready to apply again the exact same function used before**, obviously this time the model will be built on all previous data **MINUS** the ones related to **Wiggins**.

We included the code to create the correlation matrix plot too. If you run it you’ll see how it’s extremely similar to the one we got before, so we can use the same variables chosen previously.

1 2 3 4 5 6 7 8 |
# new correlation matrix y=cor(newdata[,2:12]) y library(corrplot) library(RColorBrewer) corrplot(y,type="upper", order="hclust", col=brewer.pal(n=8, name="RdYlBu")) |

**Model 2 creation**:

1 2 3 4 |
airlines_model2 <- lm(totOpsCost~dailyFlyTime+AvailableCapacity+Ton_MileLoadFactor+InvestmentsAndSpecialFunds) airlines_model2 summary(airlines_model2) |

**Model 2 results**:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
lm(formula = totOpsCost ~ dailyFlyTime + AvailableCapacity + Ton_MileLoadFactor + InvestmentsAndSpecialFunds) Residuals: Min 1Q Median 3Q Max -49.373 -14.700 -3.524 9.484 104.795 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 303.89235 28.52353 10.654 1.41e-10 *** dailyFlyTime -5.67698 4.85245 -1.170 0.254 AvailableCapacity -2.69410 4.82950 -0.558 0.582 Ton_MileLoadFactor -347.48716 60.83352 -5.712 6.94e-06 *** InvestmentsAndSpecialFunds 0.09688 0.13431 0.721 0.478 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.75 on 24 degrees of freedom Multiple R-squared: 0.787, Adjusted R-squared: 0.7515 F-statistic: 22.16 on 4 and 24 DF, p-value: 9.128e-08 |

The **improvement** in terms of **R-Squared, both multiple (0.787) and adjusted (0.7515), is clear** and confirmed supported by a **minuscule p-value**.

As we are now getting decent results for our model, in statistical terms, we might want to tweak it slightly.

For example (remembering how all the 3 financial attributes included in the data were highly correlated among them), we might try to **swap “InvestmentsAndSpecialFunds” for “TotalAssets”**.

We then simply **create a 3rd model**:

1 2 3 4 |
airlines_model3 <- lm(totOpsCost~dailyFlyTime+AvailableCapacity+Ton_MileLoadFactor+TotalAssets) airlines_model3 summary(airlines_model3) |

Which **results** are (**model 3**):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
lm(formula = totOpsCost ~ dailyFlyTime + AvailableCapacity + Ton_MileLoadFactor + TotalAssets) Residuals: Min 1Q Median 3Q Max -49.158 -13.215 -1.345 8.575 102.799 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 306.78790 27.84358 11.018 7.16e-11 *** dailyFlyTime -4.39617 4.85019 -0.906 0.374 AvailableCapacity -5.07579 5.33967 -0.951 0.351 Ton_MileLoadFactor -360.76700 61.02337 -5.912 4.23e-06 *** TotalAssets 0.02358 0.01944 1.213 0.237 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.21 on 24 degrees of freedom Multiple R-squared: 0.7949, Adjusted R-squared: 0.7607 F-statistic: 23.26 on 4 and 24 DF, p-value: 5.833e-08 |

It is very similar to model 2, but there is a **very minor improvement for the R-Squared**, so **we see no reason for not using this model 3 instead**, anyway.

**Considering that** (as we can check we can check running the code below):

1 2 |
summary(newdata$totOpsCost) sd(newdata$totOpsCost) |

the **actual variable “totOpCost”** shows the **following statistical values**, in **“newdata”**: median=71.30 mean=89** s****d=57.67**

We see how the **50% of our residuals ***(for model 3)*

**located between 1Q and 3Q**are

**sitting within a tolerable margin of error in relative terms [-13.215, 8.575]**

We **can not see a diffused high significance among the variables though** (in fact they mostly show a dismal one), but given the overall results of the model, we can probably decide to accept it as it is, and **procede with our analysis**.

We now want to have a look to the **main plots** that **R can generate** about **our Multiple Linear Model**, by using this simple command:

1 |
plot(airlines_model3) |

**Residuals Vs Fitted**

This graph shows us how the **residuals are not too bad overall**, as the **line is fairly flat**. Though, it does **tends to slightly diverge at the the 2 ends.**

It **must be pointed out** as well that **datapoint 5 (“Central”)** seems to have nature of** outlier**, by this plot.

#### Normal Q-Q plot of residuals

This **Q-Q plot** seems just **confirms** **what** we have **seen above**, allowing us to assess residuals **in standardized terms**.

**Once again** we see how **datapoint 5** is **clearly off-scale**.

#### Cook’s Distance

Finally, we see the graph visualizing the **Cook’s distance** (a method used to identify influential data points, which validity might need to be double checked).

We see **once more** how **datapoint 5** is considered **definitely odd**, in statistical measures.

#### What plots tell us

The **plots above** seem to **indicate that**:

– even though our model shows good R-squared values

– even though we used uncorrelated variables to build it

– even though we eventually removed a clear outlier

**We still** clearly** have a datapoint** that **doesn’t seem adequate to be included** in **our linear regression**.

# Final model

By now we should be quite familiar on how to procede, and by the previous steps of exploration and trial & error we should have a clear idea about what to do.

Hence, we simply advance by using the **same procedure already seen**.

**First, we remove the problematic datapoint left (datpoint 5: “Central” airlines)**, making sure that it’s done as intended and attaching the new dataset to our R-Studio environment:

1 2 3 4 5 |
newdata2 = newdata[-5,] View(newdata2) attach(newdata2) |

As we now have our **new “newdata2” dataset** ready to work on, we have **a look to the correlation matrix** to assess which variables now seem to be uncorrelated:

1 2 3 4 5 6 7 |
z=cor(newdata2[,2:12]) z library(corrplot) library(RColorBrewer) corrplot(z,type="upper", order="hclust", col=brewer.pal(n=8, name="RdYlBu")) |

Which **generates the plot below**, confirming that the **variables used in our last model** (model 3) **are still the best ones to pick**, as they are all just as **uncorrelated** (and that we couldn’t add any more, without first removing some):

We then **create our final model** by running the usual code just once more (and using the same variables used for model 3):

1 2 3 4 |
airlines_modelx <- lm(totOpsCost~dailyFlyTime+AvailableCapacity+Ton_MileLoadFactor+TotalAssets) airlines_modelx summary(airlines_modelx) |

Which gives us the **following results**:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
lm(formula = totOpsCost ~ dailyFlyTime + AvailableCapacity + Ton_MileLoadFactor + TotalAssets) Residuals: Min 1Q Median 3Q Max -25.321 -10.207 1.409 8.267 20.618 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.465e+02 1.455e+01 16.938 1.75e-14 *** dailyFlyTime -4.667e+00 2.270e+00 -2.056 0.05133 . AvailableCapacity -7.776e+00 2.516e+00 -3.090 0.00516 ** Ton_MileLoadFactor -2.210e+02 3.227e+01 -6.849 5.53e-07 *** TotalAssets 1.428e-02 9.152e-03 1.560 0.13235 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.2 on 23 degrees of freedom Multiple R-squared: 0.8961, Adjusted R-squared: 0.878 F-statistic: 49.58 on 4 and 23 DF, p-value: 5.559e-11 |

This model not only shows **clearly better Multiple and Adjusted R-Squared** (**0.8961** and **0.878 respectively**) , but as well **better significance for most parameters**.

By running the **plot function on our final model** (feel free to try, as we won’t, in order to avoid to be too repetitive)…

1 |
plot(airlines_modelx) |

… we see how **this model (though maybe not perfect), is overall more apt in fitting/predicting the values of the operational costs of our airlines**.

In particular, we see how our** Cook’s distance graph** this time **doesn’t highlight any** particular **problem**.

# Conclusions

Our **final model** shows **overall good results** in terms of:

– **R-Squared** (multiple and adjusted)

– **Parameters significance**

– **Residuals** (as per summary print and plots)

– **Cook’s distance (outliers)**

We can then **consider it to infer the values of the airlines costs**, with the **caveat of using it just within **what seems to be the** range covering the level of costs of **the **airlines **that** didn’t turn out to be outliers **on the market (by our data and following our analysis above), which is **a value roughly sitting between 0 and 200**.