### Parameter selection, Validation&Testing

#### Hyperparameters, Over-fitting, and Under-fitting

The issues associated with validation and cross-validation are some of most important aspects of the practice of machine learning.

If our estimator is underperforming, how should we move forward?

- Use simpler or more complicated model?
- Add more features to each observed data point?
- Add more training samples?

So, what separates the sucessful machine learning practitioners from the unsuccessful.

#### Bias-variance trade-off:illustration on a simple regression problem

a blog with the bias and variance

```
X = np.c_[ .5, 1].T
y = [.5, 1]
X_test = np.c_[ 0, 2].T
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)
pl.plot(X, y, 'o')
pl.plot(X_test, regr.predict(X_test))
```

In real life situation, we have noise.

```
np.random.seed(0)
for _ in range(6):
noise = np.random.normal(loc=0, scale=.1, size=X.shape)
noisy_X = X + noise
pl.plot(noisy_X, y, 'o')
regr.fit(noisy_X, y)
pl.plot(X_test, regr.predict(X_test))
```

the two point with the same color generate a line.

As we can see, our linear model captures and amplifies the noise in the data. It displays a lot of variance.

using another linear estimator that uses regularization: ‘Ridge’:

```
regr = linear_model.Ridge(alpha=.1)
np.random.seed(0)
for _ in range(6):
noise = np.random.normal(loc=0, scale=.1, size=X.shape)
noisy_X = X + noise
pl.plot(noisy_X, y, 'o')
regr.fit(noisy_X, y)
pl.plot(X_test, regr.predict(X_test))
```

As we can see, the estimator displays much less variance. However it systematically under-estimates the coefficient. It displays a biased behavior.

#### Learning Curves and the Bias/Variance Tradeoff

As we seen, d=1, the model is simple and under-fit,high bias.

d=6, over-fit, high variance.

Then, we will generate some data and show different degrees affect the errors.

```
def test_func(x, err=0.5):
return np.random.normal(10 - 1. / (x + 0.1), err)
def compute_error(x, y, p):
yfit = np.polyval(p, x)
return np.sqrt(np.mean((y - yfit) ** 2))
from sklearn.cross_validation import train_test_split
N = 200
test_size = 0.4
error = 1.0
# randomly sample the data
np.random.seed(1)
x = np.random.random(N)
y = test_func(x, error)
# split into training, validation, and testing sets.
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=test_size)
# show the training and validation sets
plt.scatter(xtrain, ytrain, color='red')
plt.scatter(xtest, ytest, color='blue')
```

```
# suppress warnings from Polyfit
import warnings
warnings.filterwarnings('ignore', message='Polyfit*')
degrees = np.arange(21)
train_err = np.zeros(len(degrees))
validation_err = np.zeros(len(degrees))
for i, d in enumerate(degrees):
p = np.polyfit(xtrain, ytrain, d)
train_err[i] = compute_error(xtrain, ytrain, p)
validation_err[i] = compute_error(xtest, ytest, p)
fig, ax = plt.subplots()
ax.plot(degrees, validation_err, lw=2, label = 'cross-validation error')
ax.plot(degrees, train_err, lw=2, label = 'training error')
ax.plot([0, 20], [error, error], '--k', label='intrinsic error')
ax.legend(loc=0)
ax.set_xlabel('degree of fit')
ax.set_ylabel('rms error')
```

This figure compactly shows the reason that validation is important. On the left side of the plot, we have very low-degree polynomial, which under-fits the data. This leads to a very high error for both the training set and the validation set. On the far right side of the plot, we have a very high degree polynomial, which over-fits the data. This can be seen in the fact that the training error is very low, while the validation error is very high. Plotted for comparison is the intrinsic error (this is the scatter artificially added to the data: click on the above image to see the source code). For this toy dataset, error = 1.0 is the best we can hope to attain. Choosing d=6 in this case gets us very close to the optimal error.

#### Learning Curves

with different size of data.

```
# suppress warnings from Polyfit
import warnings
warnings.filterwarnings('ignore', message='Polyfit*')
def plot_learning_curve(d, N=200):
n_sizes = 50
n_runs = 10
sizes = np.linspace(2, N, n_sizes).astype(int)
train_err = np.zeros((n_runs, n_sizes))
validation_err = np.zeros((n_runs, n_sizes))
for i in range(n_runs):
for j, size in enumerate(sizes):
xtrain, xtest, ytrain, ytest = train_test_split(
x, y, test_size=test_size, random_state=i)
# Train on only the first `size` points
p = np.polyfit(xtrain[:size], ytrain[:size], d)
# Validation error is on the *entire* validation set
validation_err[i, j] = compute_error(xtest, ytest, p)
# Training error is on only the points used for training
train_err[i, j] = compute_error(xtrain[:size], ytrain[:size], p)
fig, ax = plt.subplots()
ax.plot(sizes, validation_err.mean(axis=0), lw=2, label='mean validation error')
ax.plot(sizes, train_err.mean(axis=0), lw=2, label='mean training error')
ax.plot([0, N], [error, error], '--k', label='intrinsic error')
ax.set_xlabel('traning set size')
ax.set_ylabel('rms error')
ax.legend(loc=0)
ax.set_xlim(0, N-1)
ax.set_title('d = %i' % d)
```

Questions:

– As the number of training samples are increased, what do you expect to see for the training error? For the validation error?

– Would you expect the training error to be higher or lower than the validation error? Would you ever expect this to change?

**When the learning curves have converged to a high error, we have a high bias model.**

A high-bias model can be improved by:

- Using a more sophisticated model.
- Gather more feature for each sample.
- Decrease regularlization in a regularized model

A high-variance model can be improved by:

– Gathering more training samples.

– Using a less-sophisticated model

– Increasing regularization.

#### Summary

If our algorithm shows **high bias**, the following actions might help:

**Add more features.**In our example of predicting home prices, it may be helpful to make use of information such as the neighborhood the house is in, the year the house was built, the size of the lot, etc. Adding these features to the training and test sets can improve a high-bias estimator**Use a more sophisticated model.**Adding complexity to the model can help improve on bias. For a polynomial fit, this can be accomplished by increasing the degree d. Each learning technique has its own methods of adding complexity**Use fewer samples.**Though this will not improve the classification, a high-bias algorithm can attain nearly the same error with a smaller training sample. For algorithms which are computationally expensive, reducing the training sample size can lead to very large improvements in speed.**Decrease regularization.**Regularization is a technique used to impose simplicity in some machine learning models, by adding a penalty term that depends on the characteristics of the parameters. If a model has high bias, decreasing the effect of regularization can lead to better results.

If our algorithm shows with **high variance**, the following action might help:

**Use fewer features**. Using a feature selection technique may be useful, and decrease the over-fitting of the estimator.**Use a simple mode**. Model complexity and over-fitting go hand-in-hand.**Use more training samples**Adding training samples can reduce the effect of over-fitting, and lead to improvements in a high variance estimator.**Increase Regularization**. Regularization is designed to prevent over-fitting. In a high-variance model, increasing regularization can lead to better results.

### Exercise: Cross Validation and Model Selection

```
from sklearn.datasets import load_diabetes
data = load_diabetes()
X, y = data.data, data.target
# Using two regularization models 'Ridge Regression'(L2) and 'Lasso Regression'(L1)
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import cross_val_score
for Model in [Ridge, Lasso]:
model = Model()
print Model.__name__, cross_val_score(model, X, y).mean()
```

#### Basic Hyperparameters Optimization

```
# We'll choose 30 values of alpha between 0.0001 and 1:
alphas = np.logspace(-3, -1, 30)
for Model in [Lasso, Ridge]:
scores = [cross_val_score(Model(alpha), X, y, cv=3).mean() for alpha in alphas]
plt.plot(alphas, scores, label=Model.__name__)
plt.legend(loc='lower left')
```

#### Automatically Performing Grid Search

As shown in last figure, we see that Ridge {‘alpha’: 0.06} and Lasso {‘alpha’: 0.01} may be the best para.In Sklearn, there are some automatically tools.

```
from sklearn.grid_search import GridSearchCV
for Model in [Ridge, Lasso]:
gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)
print Model.__name__, gscv.best_params_
```

for some models in sklearn, a cross-validated version of the particular model is included. The cross-validated versions of Ridge and Lasso are RidgeCV and LassoCV. Respectively, the grid search on these estimators can be performed as follow:

```
from sklearn.linear_model import RidgeCV, LassoCV
for Model in [RidgeCV, LassoCV]:
model = Model(alphas=alphas, cv=3).fit(X, y)
print Model.__name__, model.alpha_
```

#### Exercise: Learning curves

```
from sklearn.metrics import mean_squared_error
# define a function that computes the learning curve (i.e. mean_squared_error as a function
# of training set size, for both training and test sets) and plot the result
from sklearn.metrics import explained_variance_score, mean_squared_error
from sklearn.cross_validation import train_test_split
def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None):
sizes = np.linspace(5, N, n_sizes).astype(int)
train_err = np.zeros((n_runs, n_sizes))
validation_err = np.zeros((n_runs, n_sizes))
for i in range(n_runs):
for j, size in enumerate(sizes):
xtrain, xtest, ytrain, ytest = train_test_split(
X, y, train_size=size, random_state=i)
# Train on only the first `size` points
model.fit(xtrain, ytrain)
validation_err[i, j] = err_func(ytest, model.predict(xtest))
train_err[i, j] = err_func(ytrain, model.predict(xtrain))
plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation')
plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training')
plt.xlabel('traning set size')
plt.ylabel(err_func.__name__.replace('_', ' '))
plt.grid(True)
plt.legend(loc=0)
plt.xlim(0, N-1)
if ylim:
plt.ylim(ylim)
plt.figure(figsize=(10, 8))
for i, model in enumerate([Lasso(0.01), Ridge(0.06)]):
plt.subplot(221 + i)
plot_learning_curve(model, ylim=(0, 1))
plt.title(model.__class__.__name__)
plt.subplot(223 + i)
plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000))
```