The sklearn.preprocessing
package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
sklearn的preprocessing模块提供了一系列包括标准化、数据最大最小缩放处理、正则化、特征二值化和数据缺失值处理在内的数据预处理模块。
基本操作流程为:
1 2 3 4 5 6 7 8 test_scaler = StandardScaler() test_scaler.fit(input ) test_scaler.transform(input ) test_scaler.fit_transform(input )
1. 标准化 使用StandardScaler(mean = 1,std = 0) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 >> from sklearn.preprocessing import StandardScaler >> import numpy as np >> test_array = np.arange(0 ,12 ).reshape((3 , 4 )) array([[ 0 , 1 , 2 , 3 ], [ 4 , 5 , 6 , 7 ], [ 8 , 9 , 10 , 11 ]]) >> test_scaler = StandardScaler() >> test_scaler.fit(test_array) >> print (test_scaler.var_) >> print (test_scaler.mean_) >> test_scaler.transform(test_array, copy = True ) [10.66666667 10.66666667 10.66666667 10.66666667 ] [4. 5. 6. 7. ] array([[-1.22474487 , -1.22474487 , -1.22474487 , -1.22474487 ], [ 0. , 0. , 0. , 0. ], [ 1.22474487 , 1.22474487 , 1.22474487 , 1.22474487 ]])
使用MaxMinScaler进行区间缩放(默认0-1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 >> from sklearn.preprocessing import MinMaxScaler >> test_array = np.random.uniform(low=-1 , high=16 , size=(4 , 3 )) >> print (test_array) [[10.45968472 6.52372993 12.08526458 ] [ 9.94529398 6.3849395 6.22910917 ] [12.02516025 11.50269044 6.380779 ] [14.67759022 0.36908024 1.13677392 ]] [0.42262781 0.17963625 0.18267358 ] >> test_scaler = MinMaxScaler((-1 , 1 )) >> test_scaler.fit(test_array) >> print (test_scaler.scale_) >> test_scaler.transform(test_array) array([[-0.78260417 , 0.1055982 , 1. ], [-1. , 0.08066641 , -0.06976488 ], [-0.12099067 , 1. , -0.04205881 ], [ 1. , -1. , -1. ]])
使用MaxAbsScaler稀疏数据标准化 为了避免标准化过程中破坏稀疏数据的稀疏性质,使用MaxAbsScaler,根据样本数据除以最大绝对值,实现到[-1, 1]的映射
使用RobustScaler带有离群值的数据标准化 2.非线性转化 主要包括概率分布转化(Quantile transforms)和正态变换(Power transforms),用来将原特定分布的特征值映射到另一个特征分布。
1 2 3 4 5 6 7 8 9 10 11 12 >> from sklearn.datasets import load_iris >> from sklearn.model_selection import train_test_split >> from sklearn.preprocessing import QuantileTransformer >> data_x, data_y = load_iris(return_X_y=True ) >> x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.2 , random_state = 42 ) >> quantileTransformer = QuantileTransformer() >> x_train_trans = quantileTransformer.fit_transform(x_train) >> x_test_trans = quantileTransformer.fit_transform(x_test) >> print (np.percentile(x_train[:, 0 ], [0 , 25 , 50 , 75 , 100 ])) >> print (np.percentile(x_train_trans[:, 0 ], [0 , 25 , 50 , 75 , 100 ])) [4.3 5.1 5.75 6.4 7.7 ] [0. 0.24789916 0.5 0.7605042 1. ]
使用Yeo-Johnson transform和 Box-Cox transform两种变换方式(暂时还不太懂,不列举代码)
3.标准化(Normalization ) 直接使用normalize函数,有三种归一化方式 {‘l1’, ‘l2’, ‘max’}, default=’l2’
,坑爹的是默认使用行向量
1 2 3 4 5 6 7 8 9 10 11 >> from sklearn import preprocessing >> test_array = np.arange(3 , 15 ).reshape(2 , 6 ) >> print (test_array) >> result = preprocessing.normalize(test_array, norm = 'l1' , axis = 0 ) >> print (result) >> print (result.sum (axis = 0 )) [[ 3 4 5 6 7 8 ] [ 9 10 11 12 13 14 ]] [[0.25 0.28571429 0.3125 0.33333333 0.35 0.36363636 ] [0.75 0.71428571 0.6875 0.66666667 0.65 0.63636364 ]] [1. 1. 1. 1. 1. 1. ]
或者使用Normalizer类
4.类型转化 主要包括 onehot和数字顺序编码两种形式,主要涉及OneHotEncoder和OrdinalEncoder
5. 遇到继续整理。。。。 自定义转化器 没想到怎么传入多个参数的函数
1 2 3 4 5 6 7 8 9 10 11 >> import numpy as np >> from sklearn.preprocessing import FunctionTransformer >> def my_power (x, power = 2 ): x = np.power(x, power); return x >> my_transformer = FunctionTransformer(my_power) >> print (my_transformer) >> test_array = np.array([1 ,2 ,3 ,4 ]) >> print (my_transformer.fit_transform(test_array)) FunctionTransformer(func=<function my_power at 0x000001988F4543A0 >) [ 1 4 9 16 ]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooms_ix, population_ix, households_ix = 3 , 4 , 5 , 6 class CombineAttirbutesAdder (BaseEstimator, TransformerMixin ): def __init__ (self, add_bedrooms_per_room = True ): self.add_bedrooms_per_room = add_bedrooms_per_room def fit (self, X ): return self def transform (self, X ): rooms_per_house = X[:, rooms_ix] / X[:, households_ix] pepoles_per_house = X[:, population_ix] / X[:, households_ix] if self.add_bedrooms_per_room: bedrooms_per_house = X[:,bedrooms_ix] / X[:, households_ix] return np.c_[X, rooms_per_house, bedrooms_per_house] else : return np.c_[X, rooms_per_house]