# MAE301 Homework 4 March 13, 2019 Please submit your homework as a single pdf through Canvas. You…

MAE301 Homework 4 March 13, 2019 Please submit your homework as a single pdf through Canvas. You can take pictures of your hand-written work and attach them to the pdf. 1 Problem Description You are given the data “dat train.txt” which includes a set of cars with their attributes. Your goal is to build a linear model that considers the price of cars (the first column in the data) as the response of the other car attributes. The homework will be graded based on two parts: (50pt) The following report on model estimation and selection, and (20pt) the prediction performance of your model using a set of test data. Details will be explained as follows. 2 Preparation You will need numpy and matplotlib, as usual, but the standard machine learning packages from Scikit-Learn. # use this if you are using jupyter notebook, otherwise comment it out %matplotlib inline # loading all packages… # import numpy as np from sklearn import decomposition from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt Next, you will need to import the data, as an array. Please make sure the .txt file is in the same directory/folder as your code. data = np.genfromtxt(’dat_train.txt’) var=[’Price’, ’Engine size’, ’Cylinders’, ’HP’, ’City MPG’, ’Highway MPG’, ’Weight’, ’Wheel base’, ’Length’, ’Width’] 1 In class we talked about Data Preprocessing: The data columns should have zero mean and unit variance. In other words, the data should be centered at the origin and the scales of attributes should not be too different. One way to preprocess is to use an appropriate scaling function in the sklearn.preprocessing sub-module. You can find the documentation here. In the following code, complete the import by erasing the “???” and replacing them with the proper function. from sklearn.preprocessing import ??? #import the standard scaler scaler_func=???() # assign standard scaler to a variable "scaler_func" scaler_func.fit(data) # train the scaler for use later data_scaled = scaler_func.transform(data) # preprocess the data 3 Ordinary Least Square (20pt) Like before, find the appropriate function for linear regression (this time in the sklearn.linear model submodule) 3.1 Complete the below code. (5pt) from sklearn.linear_model import ??? # response are car prices, i.e., the first column of the data matrix # y_train = data_scaled[:,0] # covariates are the other columns. Dealer cost is not included # since it correlates too much with the price # X_train = data_scaled[:,1:] ncv = 10 # number of cross valiation folds *KEEP THIS CONSTANT* lin_reg = ???() #new "modelling" object lin_reg.??? # do regression on the above yhat = lin_reg.??? #predict prices based on the model # Evaluate the models using crossvalidation scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=ncv) #### Leave the following plotting code as-is #### from matplotlib import gridspec plt.figure(figsize=(12, 7)) gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1]) ax = plt.subplot(gs[0]) plt.scatter(y_train, yhat, label="Linear 1deg OLS", clip_on=False) plt.title("CV avg score: {:.3}".format(abs(scores.mean()))) plt.xlabel("True car price") 2 plt.ylabel("Predicted Car Price") plt.xlim((min(y_train), max(y_train))) plt.ylim((min(yhat), max(yhat))) plt.plot(range(100000), label="perfect prediction") plt.legend() ax = plt.subplot(gs[1]) ypos = np.arange(len(var[1:])) plt.barh(ypos, lin_reg.coef_, align=’center’) plt.yticks(ypos, var[1:]) ax.yaxis.tick_right() plt.title("Coefficient Loadings") plt.xlim([-1,1]) fig.tight_layout(pad=1.9) ###################################### 3.2 Discussion (15pt) Is the model reasonable? For what range of prices is the prediction the best? What does the CV score mean (should it be large or small)? 4 Model Selection (30pt) In this section, we will discuss model selection using three criteria: crossvalidation (cv), AIC and BIC. We will investigate how model complexity affects the training error and these model selection criteria. To do so, we will make use of a neat API in sklearn called pipeline. This let us queue up a bunch of modeling tasks and perform all of them, as long as the same data is being used. Specifically, we’ll tell the pipeline to automatically generate the columns for models with different order of polynomials, and then fit them using linear OLS. Once again, fill out the “???” by reading the docs on pipeline. Use the linear OLS from the last problem, and find sklearn’s function for automatically generating the polynomial columns, in the sklearn.preprocessing submodule. 4.1 Complete the code (10pt) from sklearn.pipeline import Pipeline from sklearn.preprocessing import ??? # find the module from # sklearn.preprocessing that generates polynomial terms # a set of linear models with polynomial terms degrees = [???] # make a list of all degrees of polynomial models to try scores = np.zeros((len(degrees),ncv)) # cv error, for each model type error = np.zeros(len(degrees)) fig=plt.figure(figsize=(16, 5)) 3 for i in range(len(degrees)): polynomial = ???(degree=degrees[i], include_bias=True, interaction_only=True) lin_reg = ???(fit_intercept=False) # our data has been scaled, no need for intercept pipeline = Pipeline([("Polynomial Features", polynomial), ("Linear OLS", lin_reg)]) pipeline.??? # do regression on the above yhat = pipeline.??? #predict prices based on the model # Evaluate the models using crossvalidation scores[i,] = cross_val_score(pipeline, X_train, y_train, scoring="neg_mean_squared_error", cv=ncv) # add the new CV scores # Calculate the training error error[i] = ((yhat – y_train)**2).sum() #### Leave the plotting code as-is #### n = yhat.shape[0] ax = plt.subplot(1, len(degrees), i + 1, sharey=ax) plt.setp(ax, xticks=(), yticks=()) plt.scatter(y_train, yhat, label="Linear "+str(degrees[i])+"deg OLS", clip_on=False) plt.title("CV avg score: {:.3}nAIC = {:.2e}nBIC = {:.2e}".format(abs(scores[i,].mean()), 2*polynomial.n_output_features_+ n*np.log(error[i]/n) , np.log(y_train.shape[0])*polynomial.n_output_features_+ n*np.log(error[i]/n) )) plt.xlim((min(y_train), max(y_train))) plt.ylim((min(yhat), max(yhat))) plt.plot(range(100000), label="perfect prediction") plt.xlabel("True Car Price") plt.ylabel("Predicted Car Price") plt.legend() % fig.suptitle("Prediction and "+str(ncv)+" Cross-validation for polynomial fits (deg 1-4)") 4.2 Discussion (20 pt) Based on the above, which model seems to be the “best”? What is happening as the model complexity increases? 5 Design Your Own Model (20pt) Now that you learned the basics of how model is built and selected, it is time to investigate by yourself an appropriate model for this data. The model does not 4 have to contain only polynomial terms. Report the crossvalidation, AIC and BIC scores for your model. If you’re struggling to find a good model, try plotting histograms of the individual variables. If there are ranges with unusually large counts, you may want to find a function that reshapes that variable accordingly. Lastly, please apply your model on the test data ”dat test.txt” and report the mean square error using the following code. #import the test data test = np.genfromtxt("dat_test.txt") print test.shape test_scaled = scaler_func.transform(test) Xt = test_scaled[:,1:] #print test[:,1:].mean(axis=0) yt_scale = test_scaled[:,0] yt = test[:,0] fig, ax = plt.subplots(ncols=2, figsize=(12, 6)) # First the Linear OLS yhat_scale=lin_reg.predict(Xt) yhat = yhat_scale*y.std()+y.mean() ax[0].scatter(yt, yhat) ax[0].plot(range(100000)) residual = np.sqrt(np.sum((yt-yhat)**2)) ax[0].set_title("OLS Residual: {:.3}".format(residual)) 6 Read More: LASSO (Least Absolute Shrinkage and Selection Operator) Here we introduce a commonly used yet more sophisticated linear regression method called LASSO. The key idea is that we prefer simple linear models, i.e., model with only a few non-zero weights, so that the model is easy to interpret, and more generalizable (see Occam’s Razor). In order to achieve this, LASSO penalizes the magnitudes of weights. from sklearn.linear_model import LassoCV, Lasso model=LassoCV(cv=ncv, fit_intercept=False) #alpha based on CV for ncv from above model.fit(X_train, y_train) yhat = model.predict(X_train) print model.alpha_ scores = cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv=ncv) print "CV avg MSE ",-1*scores.mean() print model.coef_ 5 from matplotlib import gridspec plt.figure(figsize=(12, 7)) gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1]) ax = plt.subplot(gs[0]) plt.setp(ax, xticks=(), yticks=()) plt.scatter(y_train, yhat, label="Linear 1deg LASSO", clip_on=False) plt.title("CV avg score: {:.3}".format(abs(scores.mean()))) plt.xlabel("True car price") plt.ylabel("Predicted Car Price") plt.plot(range(6), label="perfect prediction") #plt.xlim((min(y), max(y))) #plt.ylim((min(yhat), max(yhat))) plt.legend() ax = plt.subplot(gs[1]) ypos = np.arange(len(var[1:])) plt.barh(ypos, model.coef_, align=’center’) plt.yticks(ypos, var[1:]) ax.yaxis.tick_right() plt.title("Coefficient Loadings") plt.xlim([-1,1]) plt.tight_layout(pad=1.9) #print model.path(X,y)[1] fig, ax=plt.subplots(ncols=2, figsize=(15,7)) print -np.log10(model.alpha_) ax[0].plot(-np.log10(model.path(X_train, y_train)[0]), model.path(X_train,y_train)[1].T) ax[0].legend([str(i) for i in range(9)], loc=0) ax[0].axvline(x=-np.log10(model.alpha_), ls=’–’) ax[0].set_title("Coef path for -log10(alpha)") #ax[0].set_ylim((-5,5)) ax[1].plot(-np.log10(model.alphas_), model.mse_path_) ax[1].axvline(x=-np.log10(model.alpha_), ls=’–’) ax[1].set_title("MSE on "+str(ncv)+" folds for -log10(alpha)") The following code compares the prediction performance of LASSO against OLS. #import the test data test = np.genfromtxt("dat_test.txt") print test.shape test_scaled = scaler_func.transform(test) Xt = test_scaled[:,1:] #print test[:,1:].mean(axis=0) yt_scale = test_scaled[:,0] yt = test[:,0] 6 fig, ax = plt.subplots(ncols=2, figsize=(12, 6)) # First the Linear OLS yhat_scale=lin_reg.predict(Xt) yhat = yhat_scale*y.std()+y.mean() ax[0].scatter(yt, yhat) ax[0].plot(range(100000)) residual = np.sqrt(np.sum((yt-yhat)**2)) ax[0].set_title("OLS Residual: {:.3}".format(residual)) #Now the LASSO yhat_scale=model.predict(Xt) yhat = yhat_scale*y.std()+y.mean() ax[1].scatter(yt, yhat) ax[1].plot(range(100000)) residual = np.sqrt(np.sum((yt-yhat)**2)) ax[1].set_title("LASSO Residual: {:.3}".format(residual)) fig.tight_layout() 7