专注细节
努力进步

05_Example from Image Processing

Example from Image Processing

Here we take a look at a simple facial recognition example.
dataset: Labeled Faces in the wild data.

from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,data_home='datasets')
print lfw_people.data.shape
# Visualize these faces
fig = plt.figure(figsize=(8, 6))
# plot several images
for i in range(15):
    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)

plt.figure(figsize=(10, 2))

unique_targets = np.unique(lfw_people.target)
counts = [(lfw_people.target == i).sum() for i in unique_targets]

plt.xticks(unique_targets, lfw_people.target_names[unique_targets])
locs, labels = plt.xticks()
plt.setp(labels, rotation=45, size=14)
_ = plt.bar(unique_targets, counts)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lfw_people.data, lfw_people.target, random_state=0)

print X_train.shape, X_test.shape


# Preprocessing: Principal Component Analysis

from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)

# mean faces
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)
#  visualize these principal components:
print pca.components_.shape

fig = plt.figure(figsize=(16, 6))
for i in range(30):
    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)

# pca
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print X_train_pca.shape
print X_test_pca.shape

# Doing the learning
from sklearn import svm
clf = svm.SVC(C=5., gamma=0.001)
clf.fit(X_train_pca, y_train)

fig = plt.figure(figsize=(8, 6))
for i in range(15):
    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test[i].reshape((50, 37)), cmap=plt.cm.bone)
    y_pred = clf.predict(X_test_pca[i])[0]
    color = 'black' if y_pred == y_test[i] else 'red'
    ax.set_title(lfw_people.target_names[y_pred], fontsize='small', color=color)

from sklearn import metrics
y_pred = clf.predict(X_test_pca)
print(metrics.classification_report(y_test, y_pred, target_names=lfw_people.target_names))

print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.f1_score(y_test, y_pred))

** HAVE A BUG IN fetch_lfw_people()**

Text Feature Extraction for Classification and Clustering

Outline of this section:

  • Turn a corpus of text document into feature vectors using Bag of words representation
  • Train a simple text classifier on the feature vectors
  • Wrap the vectorizer and classifier with a pipeline
  • Corss-validation and model selection on pipeline
  • Qualitative model inspection

Text Classification in 20 lines of python

  • the 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
  • Bag of Words features extraction with TF-IDF weighting
  • Naive Bayes classifier or linear support vector machine for the classifier itself

    from sklearn.datasets import load_files
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB

    Load the text data

    categories = [
    ‘alt.atheism’,
    ‘talk.religion.misc’,
    ‘comp.graphics’,
    ‘sci.space’,
    ]
    twenty_train_subset = load_files(‘datasets/20news-bydate-train/’,
    categories=categories, charset=’latin-1′)
    twenty_test_subset = load_files(‘datasets/20news-bydate-test/’,
    categories=categories, charset=’latin-1′)

    Turn the text documents into vectors of word frequencies

    vectorizer = TfidfVectorizer(min_df=2)
    X_train = vectorizer.fit_transform(twenty_train_subset.data)
    y_train = twenty_train_subset.target

    Fit a classifier on the training set

    classifier = MultinomialNB().fit(X_train, y_train)
    print(“Training score: {0:.1f}%”.format(
    classifier.score(X_train, y_train) * 100))

    Evaluate the classifier on the testing set

    X_test = vectorizer.transform(twenty_test_subset.data)
    y_test = twenty_test_subset.target
    print(“Testing score: {0:.1f}%”.format(
    classifier.score(X_test, y_test) * 100))

<

p>The workflow diagram summary of what happened:
text work flow

In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model:

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline((
    ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
    ('clf', PassiveAggressiveClassifier(C=1)),
))

Such a pipeline can then be used to evaluate the performance on the test set:

pipeline.fit(twenty_train_subset.data, twenty_train_subset.target)

print("Train score:")
print(pipeline.score(twenty_train_subset.data, twenty_train_subset.target))
print("Test score:")
print(pipeline.score(twenty_test_subset.data, twenty_test_subset.target))
分享到:更多 ()