专注细节
努力进步

04_Supervised Learning: Classification of Handwritten Digits

Supervised Learning: Classification of Handwritten Digits

from sklearn.datasets import load_digits
digits = load_digits()

pylab inline

fig = plt.figure(figsize = (6,6))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
for i in range(64):
    ax = fig.add_subplot(8, 8, i+1, xticks=[], yticks=[])
    ax.imshow(digits.images[i],cmap=plt.cm.binary,interpolation='nearest')
    ax.text(0, 7, str(digits.target[i]))

digits data

Visualizing the Data

A good first-step for many problems is to visualize the data using the dimensionality reduction(PCA).

from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=2)
proj = pca.fit_transform(digits.data)
plt.scatter(proj[:,0],proj[:,1],c=digits.target)
plt.colorbar()

visualize data with pca

A weakness of PCA is that its linear dimensionality reduction: that miss some interesting relationships in the data. If you want to see a nonlinear mapping of the data, we can use one of the several methods (Isomap: a dimensionality redcution method using nonlinear mapping)in the manifold module.

from sklearn.manifold import Isomap
iso = Isomap(n_neighbors=5, n_components=2)
proj = iso.fit_transform(digits.data)
plt.scatter(proj[:,0],proj[:,1],c=digits.target)
plt.colorbar()

visualize data with Isomap

In any case, these visualizations show us that there is hope: even a simple classifier should be able to adequately identify the members of the various classes.

Gaussian Naive Bayes Classification

It’s nice to have a simple and fast algorithm to be baseline classifition. If this algorithm is sufficient, then we don’t have to waste the cpu to build more complex models.

Gaussian Naive Bayes is a good method that must keep in mind.It is a generative classifier which fits an axis-aligned multi-dimensional Gaussian distribution to each training label, and uses this to quickly give a rough classification. It is generally not sufficiently accurate for real-world data, but can perform surprisingly well.

from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

# train the model
clf = GaussianNB()
clf.fit(X_train, y_train)

# use the model to predict the labels of the test data
predicted = clf.predict(X_test)
expected = y_test
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary, interpolation='nearest')

    # label the image with the target value
    if predicted[i] == expected[i]:
        ax.text(0, 7, str(predicted[i]), color='green')
    else:
        ax.text(0, 7, str(predicted[i]), color='red')

digits with generative naive bayes

Quantitative Measurement of Performance

matches = (predicted == expected)
print matches.sum()
print len(matches)
print matches.sum()/float(len(matches)) # the accurate rate of the GaussianNB

some more sophisticated metrics (in the sklearn.metrics)that can be used to judge the performance of a classifier.


from sklearn import metrics
print metrics.classification_report(expected, predicted)

digits metrics performance reports

another enlightening metric for multi-label classification is a confusion matrix.

print metrics.confusion_matrix(expected, predicted)

digit performance with confusion matrix

Supervised Learning: Regression of Housing Data

#

using the boson house prices set:

feature selection: print the score of each feature.

# feature selection
from sklearn.feature_selection import SelectKBest
from operator import itemgetter
model = SelectKBest(k='all')
data_new = model.fit_transform(data.data, data.target)
feature_score=zip(data.feature_names, model.scores_)
print feature_score
feature_score.sort(key=lambda x:x[1])
print feature_score

This is a manual version of a technique called feature selection.

Predicting Home Price: a simple linear regression

# use the train_test_split in sklearn.cross_validation to generate the traning data and testing data.
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

# using the linear regression model.
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)

# predict the data
predicted = clf.predict(X_test)
expected = y_test
# visualize the result
plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
print "RMS:", np.sqrt(np.mean((predicted - expected) ** 2))

boson house result

Exercise: Using Gradient Boosting Tree Regression

from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor()
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)
expected = y_test

plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')
print "RMS:", np.sqrt(np.mean((predicted - expected) ** 2))

boson house result with GBTR

Measuring prediction performance

Using the K-neighbors classifier

# using confusion matrix
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

# Instantiate and train the classifier
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

# Check the results using metrics
from sklearn import metrics
y_pred = clf.predict(X)

print metrics.confusion_matrix(y_pred, y)

iris knn confusion matrix

Using DecisionTreeRegressor

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor

data = load_boston()
clf = DecisionTreeRegressor().fit(data.data, data.target)
predicted = clf.predict(data.data)
expected = data.target

plt.scatter(expected, predicted)
plt.plot([0, 50], [0, 50], '--k')
plt.axis('tight')
plt.xlabel('True price ($1000s)')
plt.ylabel('Predicted price ($1000s)')

boson with Decision Tree Regressor

A Better Approach: Using a validation set

from sklearn import cross_validation
X = digits.data
y = digits.target

# Using cross_validation.train_test_split to generate the validation set.
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)

print X.shape, X_train.shape, X_test.shape

clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
y_pred = clf.predict(X_test)


print metrics.f1_score(y_train, clf.predict(X_train))
print metrics.f1_score(y_test, y_pred)
print metrics.confusion_matrix(y_test, y_pred)
print metrics.classification_report(y_test, y_pred)

knn validation classification report
knn validation confusion matrix

Validation with a regression model

data = load_boston()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)

print X.shape, X_train.shape, X_test.shape

est = DecisionTreeRegressor().fit(X_train, y_train)

print "validation:", metrics.explained_variance_score(y_test, est.predict(X_test))
print "training:", metrics.explained_variance_score(y_train, est.predict(X_train))
# result: validation: 0.668390594439
# training: 1.0

This large spread between validation and training error is characteristic of a high variance model. Decision trees are not entirely useless, however: by combining many individual decision trees within ensemble estimators such as Gradient Boosted Trees or Random Forests, we can get much better performance:

# using the GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
est = GradientBoostingRegressor().fit(X_train, y_train)

print "validation:", metrics.explained_variance_score(y_test, est.predict(X_test))
print "training:", metrics.explained_variance_score(y_train, est.predict(X_train))
# validation: 0.82217747369
# training: 0.983165912763

model selction and validation

# suppress warnings from older versions of KNeighbors
import warnings
warnings.filterwarnings('ignore', message='kneighbors*')

X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.25, random_state=0)

for Model in [LinearSVC, GaussianNB, KNeighborsClassifier]:
    clf = Model().fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print Model.__name__, metrics.f1_score(y_test, y_pred)

print '------------------'

# test SVC loss
for loss in ['l1', 'l2']:
    clf = LinearSVC(loss=loss).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print "LinearSVC(loss='{0}')".format(loss), metrics.f1_score(y_test, y_pred)

print '-------------------'

# test K-neighbors
for n_neighbors in range(1, 11):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print "KNeighbors(n_neighbors={0})".format(n_neighbors), metrics.f1_score(y_test, y_pred)
分享到:更多 ()