最值归一化 Normalization

最值归一化是将所有数据映射到0~1之间,适用于分布有明显边界的情况,受outlier影响很大。

最值归一化的公式为

$$ x_{\text {scale}}=\frac{x-x_{\min }}{x_{\max }-x_{\min }} $$

手动实现最值归一化

X = np.random.randint(0,50,size=(50,2))
X = np.array(X,dtype =float)
# 进行最值归一化
X[:,0] = (X[:,0] - np.min(X[:,0])) / (np.max(X[:,0]) - np.min(X[:,0]))
X[:,1] = (X[:,1] - np.min(X[:,1])) / (np.max(X[:,1]) - np.min(X[:,1]))
plt.scatter(X[:,0],X[:,1]);
plt.show()

可以看到点的横坐标和纵坐标都处于0-1之间了

均值方差归一化 Standardization

对于数据没有明显的边界,或者是数据存在明显极端的数值,可以使用均值方差归一化将所有数据归一到均值为0,方差为1的分布当中,均值方差归一化的公式为

$$ x_{\text {scale}}=\frac{x-x_{\text {mean}}}{s} $$

X2 = np.random.randint(0,100,size = (50,2))
X2 = np.array(X2,dtype=float)
X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])
plt.scatter(X2[:,0],X2[:,1])
plt.show()

# 查看数据的均值和方差
np.mean(X2[:,0])
# 1.2434497875801754e-16
np.std(X2[:,0])
# 0.9999999999999998

使用scikit-learn中的Scaler进行归一化

上述操作都是自己手动实现,scikit-learn中也为我们封装好了归一化的函数

使用自带的鸢尾花数据集,对数据进行预处理

from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data,iris.target, test_size=0.2, random_state=666)

scikit-learn中的standardScaler

from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
# StandardScaler(copy=True, with_mean=True, with_std=True)
standardScaler.mean_
# array([5.83416667, 3.08666667, 3.70833333, 1.17      ])
standardScaler.scale_
# array([0.81019502, 0.44327067, 1.76401924, 0.75317107])
X_train = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
Last modification:April 2nd, 2020 at 08:31 pm