专注细节
努力进步

02_Representation and Visualization of Data

Representation and Visualization of Data

you should know at the end of this section.


  • Know the internal data representation of sklearn
  • Know how to use sklearn dataset loaders to load example data
  • know how to turn image & text data into data matrices for learning
  • Know how to use matplotlib to help visualize different types of data

Data in scikit-learn


Data in sklearn is assumed to be stored as a 2d array. of size [n_samples,n_features]

  • n_samples: the number of samples
  • n_features: the number of features

A simple example: the Iris Dataset


The data consists of measurements of 3 different species of irises.


Quick Question
the data ?

Loading the Iris Data with Scikit-learn

  • Features in the Iris datase:
    1. Sepal length in cm
    2. Sepal width in cm
    3. Petal length in cm
    4. Petal width in cm
  • Target classes to predict(3 classes):
    1. Iris Setosa
    2. Iris Versicolour
    3. Iris Virgincia
      Quick Exercise
      Changge x_index and y_index to get the most useful two features.
      Iris 01Iris 02Iris 03Iris 04Iris 05Iris 06
      Which one is best for the classification

Other Available Data


Scikit-learn makes available a host of datasets for testing your algorithm


  • Packaged Data: Using tools in sklearn.datasets.load_*
  • Downloadable Data: sklearn.datasets.featch_*
  • Generated Data:sklearn.datasets.make_*

Loading Digits Data

from sklearn.datasets import load_digits
digits = load_digits()
n_samples, n_features = digits.data.shape
print (n_samples, n_features)
# Visualize the data
fig = plt.figure(figsize=(6,6))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

for i in range(64):
    ax = fig.add_subplot(8,8,i+1,xticks=[],yticks=[])
    ax.imshow(digits.images[i],cmap=plt.cm.binary, interpolation='nearest')
    ax.text(0,7,str(digits.target[i]))

Generated Data: the S-Curve

from sklearn.datasets import make_s_curve
data, colors = make_s_curve(n_samples=1000)
print (data.shape)
print (colors.shape)
# Visualize the data in 3d
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d')
ax.scatter(data[:,0],data[:,1],data[:,2],c=colors)
ax.view_int(10,-60)

S-Curve

Exercise: working with the faces dataset

from sklearn.datasets import fetch_olivetti_faces
face_data = fetch_olivetti_faces()
# Visualize the face data
fig = plt.figure(figsize =(6,6))
fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05, wspace=0.05)
for i in range(64):
ax = fig.add_subplot(8,8,i+1,xticks=[],yticks=[])
ax.imshow(face_data.images[i],cmap=plt.cm.bone, interpolation='nearest')
ax.text(0,7,str(face_data.target[i]))

face result: 数字代表哪个人

Feature Extraction

By the end of this section you will

  • Know how features are extracted from read-world data
  • See an example of extracting numerical feature from textual data

What are features?

Numerical Features

In last section, we use the iris dataset, which has 150 samples and 4 features(sepal length|sepal width| petal length| petal width)

Categorical Features

For example, there is data on the color of each iris:color in [red, blue, purple], usually, we assign numbers to these features, i.e. red=1,blue=2, purple=3. But in general this is a bad idea. In this sence, 1 and 2 are more alike than 1 and 3.

A better stategy is to give each category its own dimension.

  • color = purple(1.0 or 0.0)
  • color = blue(1.0 or 0.0)
  • color = red(1.0 or 0.0)

Using the DictVectorizer(the second stategy as shown last) to encode categorical features

measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.},
]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()

Derived Features

Derived features may be based in dimensionality reduction(such as PCA or manifold learning), may be linear or nonlinear combinations of features (Polynomial regression), or may be some transform of the features(image processing).

scikit-image is a submodule designed for image data(see the skimage.feature submodule).

分享到:更多 ()