Data Selection:

From Library mlbench, I am using BostonHousing2 dataset, which is a dataframe of 506 observations and 19 attributes. The project will consider cmedv variable as the response from its relationships with other predictor variables. The dataset has two categorical variables, viz, town and chas and would be evaluated for determining the response.

Analysis:

a. Introduction:

Statement: Here, I will make an attempt to analyse the dataset to select a right model for predicting house prices in Boston.

BostonHousing2 is a database with housing information around Boston city from the 1970 census, and lists down the median house prices in USD 1000’s. The dataframe BostonHousing2 is the corrected version with additional spatial information and is available at: http://lib.stat.cmu.edu/datasets/

The original data are 506 observations on 14 variables, \(cmedv\) being the target variable: The following table summarizes the attributes and their types:

Attribute Description Data Type
crim per capita crime rate by town Factor
zn proportion of residential land zoned for lots over 25,000 sq.ft num
indus proportion of non-retail business acres per town num
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) Factor
nox nitric oxides concentration (parts per 10 million) num
rm average number of rooms per dwelling num
age proportion of owner-occupied units built prior to 1940 num
dis weighted distances to five Boston employment centres num
rad index of accessibility to radial highways int
tax full-value property-tax rate per USD 10,000 int
ptratio pupil-teacher ratio by town num
b 1000(B - 0.63)^2 where B is the proportion of blacks by town num
lstat percentage of lower status of the population num
medv median value of owner-occupied homes in USD 1000’s num
cmedv corrected median value of owner-occupied homes in USD 1000’s num
town name of town Factor
tract census tract int
lon longitude of census tract num
lat latitude of census tract num

With regards to predicting the house prices, the attributes that I think would be valuable are: ptratio, town, rad, dis, rm, nox and crim and I would use them to fit the best model for this project.

b. Methods:

#install.packages("mlbench")
#?BostonHousing2

data("BostonHousing2")
BostonHousing2 = na.omit(BostonHousing2)

#dropping attributes which are highly collinear(mdev) and those with location details(lat,lon,town) and industrial areas(indus)
BostonHousing2 = subset(BostonHousing2, select = -c(medv,tract,lon,lat,indus,town))

#test-train split

split = round(nrow(BostonHousing2) * .80)
bos_trn = BostonHousing2[1:split,]
bos_tst = BostonHousing2[(split + 1):nrow(BostonHousing2),]

#full model and its summary
bm = lm( cmedv ~ ., data = bos_trn)

#filtering on the statistically significant predictors for medv at alpha = 0.01
summary(bm)$coefficients
##                  Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)  29.808906211 6.133266581  4.8602006 1.699336e-06
## crim         -0.193469939 0.051212820 -3.7777638 1.828075e-04
## zn            0.043936792 0.014106433  3.1146636 1.976921e-03
## chas1         1.981703770 0.885448706  2.2380786 2.577685e-02
## nox         -13.793047578 4.345162054 -3.1743460 1.620347e-03
## rm            4.732345593 0.480602621  9.8466912 1.377141e-20
## age           0.001920516 0.014342940  0.1338997 8.935506e-01
## dis          -1.351139683 0.207124336 -6.5233265 2.125620e-10
## rad           0.447857750 0.084759443  5.2838685 2.103413e-07
## tax          -0.014814517 0.004245851 -3.4891748 5.395168e-04
## ptratio      -0.778802558 0.138585334 -5.6196607 3.636092e-08
## b            -0.002532068 0.006528362 -0.3878565 6.983327e-01
## lstat        -0.528111442 0.059019961 -8.9480141 1.460223e-17
subset(summary(bm)$coefficients,summary(bm)$coefficients[,4] < 0.05)
##                 Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  29.80890621 6.133266581  4.860201 1.699336e-06
## crim         -0.19346994 0.051212820 -3.777764 1.828075e-04
## zn            0.04393679 0.014106433  3.114664 1.976921e-03
## chas1         1.98170377 0.885448706  2.238079 2.577685e-02
## nox         -13.79304758 4.345162054 -3.174346 1.620347e-03
## rm            4.73234559 0.480602621  9.846691 1.377141e-20
## dis          -1.35113968 0.207124336 -6.523327 2.125620e-10
## rad           0.44785775 0.084759443  5.283869 2.103413e-07
## tax          -0.01481452 0.004245851 -3.489175 5.395168e-04
## ptratio      -0.77880256 0.138585334 -5.619661 3.636092e-08
## lstat        -0.52811144 0.059019961 -8.948014 1.460223e-17

From the summary of the fitted model, the age & b attribute does not look good as the Null Hypothesis of the predictors( in presence with other predictor) of the full model Fails to be Rejected.

Next, I will fit a model with predictor which I think are statistically significant at \(alpha = 0.05\).

model1 = lm(cmedv ~ ptratio + rad + dis + rm + nox + crim + chas + lstat + zn + tax, data = bos_trn)
subset(summary(model1)$coefficients,summary(model1)$coefficients[,4] < 0.05)
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  28.7529158 5.549121405  5.181526 3.521516e-07
## ptratio      -0.7804610 0.137459587 -5.677749 2.652009e-08
## rad           0.4423817 0.083504501  5.297698 1.955204e-07
## dis          -1.3557003 0.197862062 -6.851744 2.820316e-11
## rm            4.7221988 0.462404479 10.212269 7.049849e-22
## nox         -13.2320153 4.047733843 -3.268993 1.174126e-03
## crim         -0.1936400 0.050899070 -3.804392 1.647268e-04
## chas1         1.9931510 0.882624546  2.258209 2.447900e-02
## lstat        -0.5269432 0.054631407 -9.645426 6.643297e-20
## zn            0.0436920 0.013951289  3.131753 1.867381e-03
## tax          -0.0146461 0.004216056 -3.473887 5.699260e-04

This model looks to have better predictor set, as all predictor’s Null Hypothesis is Rejected Next, I will attempt to find predictors which are significant at \(alpha = 0.01\)

#finding significant predictors at lower significance levels alpha  = 0.01
subset(summary(model1)$coefficients, summary(model1)$coefficients[,4] < 0.01)
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  28.7529158 5.549121405  5.181526 3.521516e-07
## ptratio      -0.7804610 0.137459587 -5.677749 2.652009e-08
## rad           0.4423817 0.083504501  5.297698 1.955204e-07
## dis          -1.3557003 0.197862062 -6.851744 2.820316e-11
## rm            4.7221988 0.462404479 10.212269 7.049849e-22
## nox         -13.2320153 4.047733843 -3.268993 1.174126e-03
## crim         -0.1936400 0.050899070 -3.804392 1.647268e-04
## lstat        -0.5269432 0.054631407 -9.645426 6.643297e-20
## zn            0.0436920 0.013951289  3.131753 1.867381e-03
## tax          -0.0146461 0.004216056 -3.473887 5.699260e-04
#fitting the significant predictors( chas turns out to be not significant)
model2 = lm(cmedv ~ ptratio + rad + dis + rm + nox + crim + lstat + zn + tax, data = bos_trn)
summary(model2)$r.squared
## [1] 0.7343554

From the above test, I will choose to use the model2 for further analysis and verification.

Verification of model2

#AIC test
model_selected_aic = step(bm, direction = "backward", trace = 0)
names(coef(model_selected_aic))
##  [1] "(Intercept)" "crim"        "zn"          "chas1"       "nox"        
##  [6] "rm"          "dis"         "rad"         "tax"         "ptratio"    
## [11] "lstat"
summary(model_selected_aic)$r.squared
## [1] 0.7377497
#AIC test 2 with two way interactions
model_start = lm(cmedv ~ (.)^2, data = bos_trn)
model_selected_aic_int = step(model_start, direction = "backward", trace = 0)
names(coef(model_selected_aic_int))
##  [1] "(Intercept)"   "crim"          "zn"            "chas1"        
##  [5] "nox"           "rm"            "age"           "dis"          
##  [9] "rad"           "tax"           "ptratio"       "b"            
## [13] "lstat"         "crim:chas1"    "crim:nox"      "crim:age"     
## [17] "crim:dis"      "crim:rad"      "crim:tax"      "crim:lstat"   
## [21] "zn:chas1"      "zn:dis"        "zn:rad"        "zn:tax"       
## [25] "zn:b"          "zn:lstat"      "chas1:nox"     "chas1:rm"     
## [29] "chas1:dis"     "chas1:rad"     "chas1:tax"     "chas1:ptratio"
## [33] "chas1:b"       "nox:age"       "nox:rad"       "nox:tax"      
## [37] "nox:ptratio"   "nox:b"         "nox:lstat"     "rm:age"       
## [41] "rm:rad"        "rm:tax"        "rm:ptratio"    "rm:b"         
## [45] "rm:lstat"      "age:dis"       "age:rad"       "age:b"        
## [49] "age:lstat"     "dis:rad"       "dis:lstat"     "rad:tax"      
## [53] "rad:ptratio"   "rad:b"         "rad:lstat"     "tax:ptratio"  
## [57] "tax:lstat"     "ptratio:lstat"
summary(model_selected_aic_int)$r.squared
## [1] 0.9381225

Comparing the AIC selected model with full model and model2

#comparing model2 with full model
anova(model2, bm)[2,"Pr(>F)"]
## [1] 0.1566313

As \(p-value\) is big, preferred model is model2

#comparing model2, selected model from AIC and selected model from AIC with interactions.
anova(model_selected_aic,model2)[2,"Pr(>F)"]
## [1] 0.024479
anova(model_selected_aic, model_selected_aic_int)[2,"Pr(>F)"]
## [1] 7.259136e-83

Based on the Anova tests, the preferred model is still the model_selected_aic_int which also has the highest \(r.squared\) value.

Further Analysis of the preferred model

par(mfrow = c(2,2))
plot(model_selected_aic_int)

sqrt(mean(resid(model_selected_aic_int) ^ 2))
## [1] 2.303212

Checking the assumptions for constant variance and normal distribution From the Fitted vs Residual plot and Q-Q plot, there appears some violation of assumptions. I will perform the Breusch-Pagan Test and Shapiro-Wilk test for further checks.

bptest(model_selected_aic_int)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_selected_aic_int
## BP = 114.1, df = 57, p-value = 1.086e-05

For BP test on model_selected_aic_int, we see a small p-value, so we reject the null of homoscedasticity. The constant variance assumption is violated. This matches our findings with a fitted versus residuals plot.

shapiro.test(resid(model_selected_aic_int))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_selected_aic_int)
## W = 0.97363, p-value = 1.037e-06

From the Shapiro-Wilk test for model_selected_aic2, a small \(p-value\) indicates there is only a small probability the data could have been sampled from a normal distribution.

From model diagnostics, it is confirmed that model_selected_aic_int needs further analysis and modifications like response and predictor transformations, predictor interactions and polynomial terms. Besides these, I will also test the effect of outliers, leverage and influential points and cleanse the dataset accordingly

#finding outliers and influential points

sum(hatvalues(model_selected_aic_int) > 2 * mean(hatvalues(model_selected_aic_int)))
## [1] 45
sum(abs(rstandard(model_selected_aic_int)) > 2, na.rm = TRUE)
## [1] 20
inf_points = cooks.distance(model_selected_aic_int)
sum(inf_points > 4 / length(inf_points), na.rm = TRUE)
## [1] 40
rem_points = inf_points > 4 / length(inf_points)

model_selected_aic_int has 45 points of high leverage and 20 point with large residual. Among these 40 points are influential from Cooks Distance calculation.

#removing influential points to check model assumptions
model_aic_int_fix = lm(model_selected_aic_int, subset = inf_points <= 4 / length(inf_points))

Lets verify the plots and diagnostic tests again on model_aic_int_fix

par(mfrow = c(2,2))
plot(model_aic_int_fix)

bptest(model_aic_int_fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_aic_int_fix
## BP = 80.099, df = 57, p-value = 0.02352
shapiro.test(resid(model_aic_int_fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_aic_int_fix)
## W = 0.99107, p-value = 0.02643
summary(model_aic_int_fix)$r.squared
## [1] 0.9567399

From the above checks, I can confirm that the removal of influential points does improve the model diagnostics. All assumptions are now valid and as next step I will perform response and predictor transformation to analyse further.

As I intend to have a prediction model, lets compare the models so far with respect to train RMSE, test RMSE and LOOCV-RMSE.

#rmse definition
rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}
# RMSE Calculations
result = data.frame('Model' = c('model2', 'model_selected_aic', 'model_selected_aic_int', 'model_aic_int_fix'),
                    'Train RMSE' = c(rmse(bos_trn$cmedv, predict(model2)), rmse(bos_trn$cmedv, predict(model_selected_aic)), rmse(bos_trn$cmedv, predict(model_selected_aic_int)),rmse(bos_trn$cmedv, predict(model_aic_int_fix))),
                    'Test RMSE' = c(rmse(bos_tst$cmedv, predict(model2, bos_tst)), rmse(bos_tst$cmedv, predict(model_selected_aic, bos_tst)), rmse(bos_tst$cmedv, predict(model_selected_aic_int, bos_tst)),  rmse(bos_tst$cmedv, predict(model_aic_int_fix, bos_tst))))


result %>%
  kable(digits = 2) %>%
  kable_styling()
Model Train.RMSE Test.RMSE
model2 4.77 5.69
model_selected_aic 4.74 5.45
model_selected_aic_int 2.30 14.74
model_aic_int_fix 9.69 15.15
#LOOCV-RMSE

calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_loocv_rmse(model2)
## [1] 4.964532
calc_loocv_rmse(model_selected_aic)
## [1] 4.960245
calc_loocv_rmse(model_selected_aic_int)
## [1] 2.942174
calc_loocv_rmse(model_aic_int_fix)
## [1] 2.296457
#actual vs predicted

par(mfrow = c(2,2))

plot(predict(model2, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model2")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_selected_aic, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_selected_aic")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_selected_aic_int, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_selected_aic_int")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_aic_int_fix, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_aic_int_fix")
abline(a=0,b=1, col = "darkorange", lwd = 2)

From the RMSE, LOOCV-RMSE and actual predicted plots, none of the models seems to be performing a better prediction. Majority of the predicted median house prices are higher than the actual prices.

As next steps, I will check for transformations of the predictor and response and check for lowest RMSE values.

#response transformation for assumably significant variables.

model_res_trans = lm(log(cmedv) ~ ptratio + rm + crim + nox + b + lstat + age + dis + rad + tax, data = bos_trn)

par(mfrow = c(2,2))
plot(model_res_trans)

bptest(model_res_trans)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_res_trans
## BP = 112.26, df = 10, p-value < 2.2e-16
shapiro.test(resid(model_res_trans))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_res_trans)
## W = 0.92491, p-value = 2.283e-13

The response transformation, does not have a conforming model assumptions.

#finding outliers and influential points

sum(hatvalues(model_res_trans) > 2 * mean(hatvalues(model_res_trans)))
## [1] 29
sum(abs(rstandard(model_res_trans)) > 2, na.rm = TRUE)
## [1] 19
inf_points2 = cooks.distance(model_res_trans)
sum(inf_points2 > 4 / length(inf_points2), na.rm = TRUE)
## [1] 27
rem_points2 = inf_points2 > 4 / length(inf_points2)

model_res_trans has 29 points of high leverage and 19 point with large residual. Among these 27 points are influential from Cooks Distance calculation.

#removing influential points to check model assumptions
model_res_trans_fix = lm(model_res_trans, subset = inf_points2 <= 4 / length(inf_points2))

#checking assumptions
par(mfrow = c(2,2))
plot(model_res_trans_fix)

bptest(model_res_trans_fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_res_trans_fix
## BP = 68.521, df = 10, p-value = 8.55e-11
shapiro.test(resid(model_res_trans_fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_res_trans_fix)
## W = 0.98455, p-value = 0.0004574
summary(model_res_trans_fix)$r.squared
## [1] 0.8873163
calc_loocv_rmse(model_res_trans_fix)
## [1] 0.1205345
rmse(bos_trn$cmedv, exp(predict(model_res_trans_fix)))
## [1] 9.926192
rmse(bos_tst$cmedv, exp(predict(model_res_trans_fix, bos_tst)))
## [1] 3.93472

There is an improvement in the normality assumption as well as in test RMSE value.

#Predictor transformation

pairs(~ cmedv + ptratio + b + lstat + dis + rm + crim, data = bos_trn, pch = 1, col = "darkorange")

# lstat and rm seem to have linear relationship with cmedv. Adding polynomial terms to check the fitted model.

model_pred_trans = lm(cmedv ~ lstat + crim + rm + dis + b + chas + nox + rad + tax + ptratio + I(lstat^2) + I(rm^2) , data = bos_trn)

par(mfrow = c(2,2))
plot(model_pred_trans)

bptest(model_pred_trans)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_pred_trans
## BP = 80.095, df = 12, p-value = 3.959e-12
shapiro.test(resid(model_pred_trans))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_pred_trans)
## W = 0.83699, p-value < 2.2e-16

Model assumptions are violated.

#finding outliers and influential points

sum(hatvalues(model_pred_trans) > 2 * mean(hatvalues(model_pred_trans)))
## [1] 32
sum(abs(rstandard(model_pred_trans)) > 2, na.rm = TRUE)
## [1] 13
inf_points3 = cooks.distance(model_pred_trans)
sum(inf_points3 > 4 / length(inf_points3), na.rm = TRUE)
## [1] 17
rem_points3 = inf_points3 > 4 / length(inf_points3)

model_pred_trans has 32 points of high leverage and 13 point with large residual. Among these 17 points are influential from Cooks Distance calculation.

#removing influential points to check model assumptions
model_pred_trans_fix = lm(model_pred_trans, subset = inf_points3 <= 4 / length(inf_points3))

#checking assumptions
par(mfrow = c(2,2))
plot(model_pred_trans_fix)

bptest(model_pred_trans_fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_pred_trans_fix
## BP = 44.4, df = 12, p-value = 1.305e-05
shapiro.test(resid(model_pred_trans_fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_pred_trans_fix)
## W = 0.9834, p-value = 0.000196
summary(model_pred_trans_fix)$r.squared
## [1] 0.9082454
calc_loocv_rmse(model_pred_trans_fix)
## [1] 2.738876
rmse(bos_trn$cmedv, predict(model_pred_trans_fix))
## [1] 9.510312
rmse(bos_tst$cmedv, predict(model_pred_trans_fix, bos_tst))
## [1] 4.049455

** Experimenting with both predictor and response transformaion**

model_mix = lm(log(cmedv) ~ ptratio + rm + crim + nox + b + lstat + age + dis + rad + tax + I(rm ^2) + I(lstat ^2), data = bos_trn)
inf_points4 = cooks.distance(model_mix)
model_mix_fix = lm(model_mix, subset = inf_points4 <= 4 / length(inf_points4) )

par(mfrow = c(2,2))
plot(model_mix_fix)

bptest(model_mix_fix)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_mix_fix
## BP = 45.423, df = 12, p-value = 8.72e-06
shapiro.test(resid(model_mix_fix))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_mix_fix)
## W = 0.99614, p-value = 0.4853
summary(model_mix_fix)$r.squared
## [1] 0.8966536
calc_loocv_rmse(model_mix_fix)
## [1] 0.1148345
rmse(bos_trn$cmedv, exp(predict(model_mix_fix)))
## [1] 9.698517
rmse(bos_tst$cmedv, exp(predict(model_mix_fix, bos_tst)))
## [1] 3.570833

As I only care about prediction, I don’t need to worry about correlation vs causation, or need to worry about model assumptions, therefore I will directly jump to finding out the best model with lowest RMSE & LOOCV RMSE values found via penalizing the overfitting large models by using one of AIC or BIC methods.

result = data.frame('Model' = c('model2', 'model_selected_aic', 'model_selected_aic_int', 'model_pred_trans', 'model_res_trans', 'model_aic_int_fix', 'model_res_trans_fix', 'model_pred_trans_fix','model_mix_fix'),
                    
                    'Train RMSE' = c(rmse(bos_trn$cmedv, predict(model2)), rmse(bos_trn$cmedv, predict(model_selected_aic)), rmse(bos_trn$cmedv, predict(model_selected_aic_int)),rmse(bos_trn$cmedv, predict(model_pred_trans)),rmse(bos_trn$cmedv, exp(predict(model_res_trans))), rmse(bos_trn$cmedv, predict(model_aic_int_fix)),rmse(bos_trn$cmedv, exp(predict(model_res_trans_fix))), rmse(bos_trn$cmedv, predict(model_pred_trans_fix)),rmse(bos_trn$cmedv, exp(predict(model_mix_fix)))),
                    
                    'Test RMSE' = c(rmse(bos_tst$cmedv, predict(model2, bos_tst)), rmse(bos_tst$cmedv, predict(model_selected_aic, bos_tst)), rmse(bos_tst$cmedv, predict(model_selected_aic_int, bos_tst)), rmse(bos_tst$cmedv, predict(model_pred_trans, bos_tst)),rmse(bos_tst$cmedv, exp(predict(model_res_trans, bos_tst))), rmse(bos_tst$cmedv, predict(model_aic_int_fix, bos_tst)), rmse(bos_tst$cmedv, exp(predict(model_res_trans_fix, bos_tst))), rmse(bos_tst$cmedv, predict(model_pred_trans_fix, bos_tst)),rmse(bos_tst$cmedv, exp(predict(model_mix_fix, bos_tst)))),
                    
                    'LOOCV-RMSE' = c(calc_loocv_rmse(model2),calc_loocv_rmse(model_selected_aic),calc_loocv_rmse(model_selected_aic_int),calc_loocv_rmse(model_pred_trans),calc_loocv_rmse(model_res_trans), calc_loocv_rmse(model_aic_int_fix),calc_loocv_rmse(model_res_trans_fix), calc_loocv_rmse(model_pred_trans_fix),calc_loocv_rmse(model_mix_fix)),
                    
                    "adj.r.squared" = c(summary(model2)$adj.r.squared, summary(model_selected_aic)$adj.r.squared, summary(model_selected_aic_int)$adj.r.squared, summary(model_pred_trans)$adj.r.squared, summary(model_res_trans)$adj.r.squared, summary(model_aic_int_fix)$adj.r.squared, summary(model_res_trans_fix)$adj.r.squared, summary(model_pred_trans_fix)$adj.r.squared, summary(model_mix_fix)$adj.r.squared)
                    
)

kable(result, digits = 2) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(3:4, bold = T) %>%
  row_spec(9:9, bold = T, color = "white", background = "green")
Model Train.RMSE Test.RMSE LOOCV.RMSE adj.r.squared
model2 4.77 5.69 4.96 0.73
model_selected_aic 4.74 5.45 4.96 0.73
model_selected_aic_int 2.30 14.74 2.94 0.93
model_pred_trans 4.05 3.49 4.36 0.80
model_res_trans 4.43 3.56 0.20 0.77
model_aic_int_fix 9.69 15.15 2.30 0.95
model_res_trans_fix 9.93 3.93 0.12 0.88
model_pred_trans_fix 9.51 4.05 2.74 0.91
model_mix_fix 9.70 3.57 0.11 0.89

c. Results

From LOOCV-RMSE, the best model seems to be model_mix_fix.

Next, I will plot the actual vs predicted median house prices to choose the best model for the predictions.

par(mfrow = c(2,2))

plot(predict(model_pred_trans, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_pred_trans")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(exp(predict(model_res_trans, bos_tst)),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_res_trans")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_pred_trans_fix, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_pred_trans_fix")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(exp(predict(model_res_trans_fix, bos_tst)),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_res_trans_fix")
abline(a=0,b=1, col = "darkorange", lwd = 2)

par(mfrow = c(2,2))

plot(predict(model2, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model2")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_selected_aic, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_selected_aic")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_selected_aic_int, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_selected_aic_int")
abline(a=0,b=1, col = "darkorange", lwd = 2)

plot(predict(model_aic_int_fix, bos_tst),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_aic_int_fix")
abline(a=0,b=1, col = "darkorange", lwd = 2)

I will choose the model with lowest LOOCV-RMSE, which is: model_mix_fix = lm(log(cmedv)

#99% Confidence & Prediction Interval of the chosen model

which.max(bos_tst$cmedv)
## [1] 69
exp(predict(model_mix_fix, level = 0.99, interval = "confidence", bos_tst[69,] ))
##          fit      lwr      upr
## 474 22.01349 20.46585 23.67817
exp(predict(model_mix_fix, level = 0.99, interval = "prediction", bos_tst[69,] ))
##          fit      lwr      upr
## 474 22.01349 16.42219 29.50848
#plotting the best model for prediction.

plot(exp(predict(model_mix_fix, bos_tst)),bos_tst$cmedv, xlab="predicted",ylab="actual",
     xlim = c( 0, 30), ylim = c(0, 30), col = "dodgerblue", main = "model_pred_trans" )
abline(a=0,b=1, col = "darkorange", lwd = 2)

which.min(bos_tst$cmedv)
## [1] 1
bos_tst[1,]
##     cmedv    crim zn chas   nox    rm age    dis rad tax ptratio      b
## 406     5 67.9208  0    0 0.693 5.683 100 1.4254  24 666    20.2 384.97
##     lstat
## 406 22.98
predict(model_pred_trans, bos_tst[1,] )
##      406 
## 1.911767
exp(predict(model_res_trans, bos_tst[1,] ))
##      406 
## 6.446122
exp(predict(model_res_trans_fix, bos_tst[1,]))
##      406 
## 4.872437
exp(predict(model_mix_fix, bos_tst[1,]))
##      406 
## 4.398506

d. discussion

An important observation during the model selection and diagnostics was the factor variable town which I left out from the fitted models for simplicity. So the fitted model esentially takes into acount all significant predictors except the localion factors. However, modern prediction tools do use location/town as one of the major criteria.

The challenges that I faced when selecting town was in creating the train and test data sets. I was not able to calculate test RMSE as the test dataset would have new towns which the model wasn’t trained in.

Another observation which conforms our study during the course is, for predictions, we do not really need to care about correlation and causation. So, although I fitted a model(model_selected_aic_int) which followed constant variance and normal distribution, yet it did not do well for prediction, as it had higher test RMSE. The model chosen for prediction(model_res_trans_fix) violated model assumptions, but had the best LOOCV-RMSE and better test RMSE

None of the models could predict the actual prices with 100% accuracy, and I think fitting town would improve accuracy.

e. appendix

Reference for Statistics with R: https://daviddalpiaz.github.io/appliedstats/.

Kaggle data source: https://www.kaggle.com/andyxie/regression-with-r-boston-housing-price.

Actual data source used: http://lib.stat.cmu.edu/datasets/ available with library mlbench in R.