Sklearn Normalization   2018-01-06


Data Normalization 可以提升机器学习的成效

Normalization

from sklearn import preprocessing #标准化数据模块
import numpy as np

# 建立Array
a = np.array([[10, 2.7, 3.6],
[-100, 5, -2],
[120, 20, 40]], dtype=np.float64)

# 将normalized后的a打印出
print(preprocessing.scale(a))
[[ 0.         -0.85170713 -0.55138018]
 [-1.22474487 -0.55187146 -0.852133  ]
 [ 1.22474487  1.40357859  1.40351318]]

Normalization 对结果的影响

# 标准化数据模块
from sklearn import preprocessing
import numpy as np

# 将资料分割成train与test的模块
from sklearn.model_selection import train_test_split

# 生成适合做classification资料的模块
from sklearn.datasets.samples_generator import make_classification

# Support Vector Machine中的Support Vector Classifier
from sklearn.svm import SVC

# 可视化数据的模块
import matplotlib.pyplot as plt

生成适合做 Classification 数据

# 生成具有2种属性的300笔数据
X, y = make_classification(
n_samples=300, n_features=2,
n_redundant=0, n_informative=2,
random_state=22, n_clusters_per_class=1,
scale=100)

# n_features 特征个数 = n_informative() + n_redundant + n_repeated
# n_informative 多信息特征的个数
# n_redundant 冗余信息,informative 特征的随机线性组合
# n_classes 分类类别
# n_clusters_per_class 某一个类别是由几个 cluster 构成的

#可视化数据
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()


data normalization before

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = SVC()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
0.477777777778

data normalization after

数据的单位发生了变化, X 数据也被压缩到差不多大小范围.

X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf = SVC()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
# 0.9
0.933333333333

Reference


分享到:


  如果您觉得这篇文章对您的学习很有帮助, 请您也分享它, 让它能再次帮助到更多的需要学习的人. 您的支持将鼓励我继续创作 !
本文基于署名4.0国际许可协议发布,转载请保留本文署名和文章链接。 如您有任何授权方面的协商,请邮件联系我。

Contents

  1. Normalization
  2. Normalization 对结果的影响
    1. 生成适合做 Classification 数据
    2. data normalization before
    3. data normalization after
  3. Reference