专注细节
努力进步

03_Basic principles of machine learning

Basic principles of machine learning

By the end of this section you will
– Know the basic categories of supervised learning, inclding classfication and regression problems.
– Know the basic categories of unsupervised learning, inclding dimensionality reduction and clustering
– Know the basic syntax of the Scikit-learn estimator interface
– Know how feature are extraced from real-world data.

Problem setting

A simple definition of machine learning


Machine Learning(ML) is about building programs with tunable parameters that adjusted automatically so as to improve their behavior by adpting to previously seen data.

Machine Learning can be broken into 2 broad regimes: Supervised learning and unsupervised learning.


Introducing the scikit-learn estimator object


For instance a linear regression is:

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)

Supervised Learning: Classification and Regression

data: consisting of both features and labels.
some more complicated examples:
– Given a multicolor image of an object through a telescope, determine whether that object is a star, a quasar, a galaxy.
– Given a photograph of a person, identify the person in the photo.
– Given a list of movies a person has watched and their personal rating of the movie, to recommend a list of movies they would like.

Supervised Learning is further broken down into two categories, classfication(label is discrete) and regression(label is continuous).

Classification: K nearest neighbors(kNN)

from sklearn import neighbors, datasets
iris= datasets.load_iris()
X,y=iris.data, iris.target
knn =neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
print iris.target_names[knn.predict([[3,5,4,2]])]

Visualize the sepal space:

from helpers import plot_iris_knn
plot_iris_knn()

The script of plot_iris_knn()is seen in helpers
Iris knn result

Exercise: Now, use an estimator on the same problem:sklearn.svm.SVC

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
svc = SVC()
svc.fit(X, y)
print iris.target_names[svc.predict([[3, 5, 4, 2]])]

visualize the result by SVC:

# plot_iris_SVC
def plot_iris_SVC():
    iris = datasets.load_iris()
    X = iris.data[:, :2]  # we only take the first two features in order to plot.
    y = iris.target

    svc = SVC()
    svc.fit(X, y)

    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    pl.figure()
    # 绘图范围内每个点进行预测(使用SVC)
    pl.pcolormesh(xx, yy, Z)

    # Plot also the training points
    pl.scatter(X[:, 0], X[:, 1], c=y)
    pl.xlabel('sepal length (cm)')
    pl.ylabel('sepal width (cm)')
    pl.axis('tight')

plot_iris_SVC()

Iris SVC

Regression: the simplest regression is the linear regression as shown:

# Create some simple data
import numpy as np
np.random.seed(0)
X = np.random.random(size=(20, 1))
y = 3 * X.squeeze() + 2 + np.random.normal(size=20)

# Fit a linear regression to it
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print "Model coefficient: %.5f, and intercept: %.5f" % (model.coef_, model.intercept_)

# Plot the data and the model prediction
X_test = np.linspace(0, 1, 100)[:, np.newaxis]
y_test = model.predict(X_test)
import pylab as pl
pl.plot(X.squeeze(), y, 'o')
pl.plot(X_test.squeeze(), y_test)

linear regresion result

Unsupersived Learning

data: no labels; Comprises tasks such as dimensionality reduction, clustering, and density estimation.

  • given detailed observations of distant galaxies, determine which features or combinations of features summarize best the information
  • given a mixture of two sound sources(for example, a person talking over some musics), separate the two(so called the blind source separation)
  • given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.

Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful features in heterogeneous data, and then these features can be used within a supervised framework.

Visualization for PCA

X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print "Reduced dataset shape:", X_reduced.shape

import pylab as pl
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)

print "Meaning of the 2 components:"
for component in pca.components_:
    print " + ".join("%.3f x %s" % (value, name)
                     for value, name in zip(component, iris.feature_names))

Reduced dataset shape: (150L, 2L)
Meaning of the 2 components:
0.362 x sepal length (cm) + -0.082 x sepal width (cm) + 0.857 x petal length (cm) + 0.359 x petal width (cm)
-0.657 x sepal length (cm) + -0.730 x sepal width (cm) + 0.176 x petal length (cm) + 0.075 x petal width (cm)


the two components after PCA is “0.362 x sepal length (cm) + -0.082 x sepal width (cm) + 0.857 x petal length (cm) + 0.359 x petal width (cm)” and ”
-0.657 x sepal length (cm) + -0.730 x sepal width (cm) + 0.176 x petal length (cm) + 0.075 x petal width (cm)”

iris data two components distribution

Use the reduced data after to clustering

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans
k_means.fit(X_reduced)
y_pred = k_means.predict(X_reduced)

pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred)

iris data two component distribution kmeans clustering

A recap on Scikit-learn’s estimator interface

Scikit-learn have a uniform interface across all methods.

  • Available in all Estimators
    • model.fit():fit training data. In supervised method, this accepts two arguments: data X and the label y. In unsupervised method, this accepts only a single arguement: data X.
  • Available in supervised estimators”
    • model.predict(): this accepts only one argument: the new data.
    • model.predict_proba(): some model provide this method, which returns the probability that a new data has each label.
    • model.score(): for classification or regression, most model provide a score method. Scores are between 0 ad 1, with larger score indicating a better fit.
  • Available in unsupervised estimators
    • model.transform(): given an unsupervised model, accepts one argument X_new, and return the new representation of the data based on the unsupervised model.
    • model.fit_transform(): some models implement this method, which more efficiently performs a fit and a transform on the same input data.

Regularization: what it is and why it is necessary

An example of regularizationThe core idea behind regularization is that we are going to prefer models that are simpler, for a certain definition of ”simpler”, even if they lead to more errors on the train set.(奥卡姆剃刀?)

rng = np.random.RandomState(0)
x = 2 * rng.rand(100) - 1

f = lambda t: 1.2 * t ** 2 + .1 * t ** 3 - .4 * t ** 5 - .5 * t ** 9
y = f(x) + .4 * rng.normal(size=100)

pl.figure()
pl.scatter(x, y, s=4)

generate 9th order polynomial data

x_test = np.linspace(-1, 1, 100)

pl.figure()
pl.scatter(x, y, s=4)

X = np.array([x**i for i in range(5)]).T
X_test = np.array([x_test**i for i in range(5)]).T
order4 = LinearRegression()
order4.fit(X, y)
pl.plot(x_test, order4.predict(X_test), label='4th order')

X = np.array([x**i for i in range(10)]).T
X_test = np.array([x_test**i for i in range(10)]).T
order9 = LinearRegression()
order9.fit(X, y)
pl.plot(x_test, order9.predict(X_test), label='9th order')

pl.legend(loc='best')
pl.axis('tight')
pl.title('Fitting a 4th and a 9th order polynomial')

4th adn 9th polynomial result

pl.figure()
pl.scatter(x, y, s=4)
pl.plot(x_test, f(x_test), label="truth")
pl.axis('tight')
pl.title('Ground truth (9th order polynomial)')

Ground truth (9th order polynomail)

Sklearn ML MAP
Sklearn ML MAP

分享到:更多 ()