0%

sklearn数据预处理一般流程

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

sklearn的preprocessing模块提供了一系列包括标准化、数据最大最小缩放处理、正则化、特征二值化和数据缺失值处理在内的数据预处理模块。

基本操作流程为:

1
2
3
4
5
6
7
8
# 1.创建预处理器 transform
test_scaler = StandardScaler()
# 2. 调用fit函数 计算预处理所需要的相关数据(如StandardScaler会计算mean、var等)
test_scaler.fit(input)
# 3. 调用transform函数对数据进行预处理
test_scaler.transform(input)
# 或者直接合并fit和transform两部操作
test_scaler.fit_transform(input)

1. 标准化

使用StandardScaler(mean = 1,std = 0)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>> from sklearn.preprocessing import StandardScaler
>> import numpy as np
>> test_array = np.arange(0,12).reshape((3, 4))
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>> test_scaler = StandardScaler()
>> test_scaler.fit(test_array)
>> print(test_scaler.var_)
>> print(test_scaler.mean_)
>> test_scaler.transform(test_array, copy = True)
[10.66666667 10.66666667 10.66666667 10.66666667]
[4. 5. 6. 7.]
array([[-1.22474487, -1.22474487, -1.22474487, -1.22474487],
[ 0. , 0. , 0. , 0. ],
[ 1.22474487, 1.22474487, 1.22474487, 1.22474487]])

使用MaxMinScaler进行区间缩放(默认0-1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>> from sklearn.preprocessing import MinMaxScaler
>> test_array = np.random.uniform(low=-1, high=16, size=(4, 3))
>> print(test_array)
[[10.45968472 6.52372993 12.08526458]
[ 9.94529398 6.3849395 6.22910917]
[12.02516025 11.50269044 6.380779 ]
[14.67759022 0.36908024 1.13677392]]
[0.42262781 0.17963625 0.18267358]
# 传入要缩放的区间
>> test_scaler = MinMaxScaler((-1, 1))
>> test_scaler.fit(test_array)
>> print(test_scaler.scale_)
>> test_scaler.transform(test_array)
array([[-0.78260417, 0.1055982 , 1. ],
[-1. , 0.08066641, -0.06976488],
[-0.12099067, 1. , -0.04205881],
[ 1. , -1. , -1. ]])

使用MaxAbsScaler稀疏数据标准化

为了避免标准化过程中破坏稀疏数据的稀疏性质,使用MaxAbsScaler,根据样本数据除以最大绝对值,实现到[-1, 1]的映射

使用RobustScaler带有离群值的数据标准化

2.非线性转化

主要包括概率分布转化(Quantile transforms)和正态变换(Power transforms),用来将原特定分布的特征值映射到另一个特征分布。

使用QuantileTransformer进行均匀分布映射转换

1
2
3
4
5
6
7
8
9
10
11
12
>> from sklearn.datasets import load_iris
>> from sklearn.model_selection import train_test_split
>> from sklearn.preprocessing import QuantileTransformer
>> data_x, data_y = load_iris(return_X_y=True)
>> x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.2, random_state = 42)
>> quantileTransformer = QuantileTransformer()
>> x_train_trans = quantileTransformer.fit_transform(x_train)
>> x_test_trans = quantileTransformer.fit_transform(x_test)
>> print(np.percentile(x_train[:, 0], [0, 25, 50, 75, 100]))
>> print(np.percentile(x_train_trans[:, 0], [0, 25, 50, 75, 100]))
[4.3 5.1 5.75 6.4 7.7 ]
[0. 0.24789916 0.5 0.7605042 1. ]

使用PowerTransformer进行正态分布映射转换

使用Yeo-Johnson transform和 Box-Cox transform两种变换方式(暂时还不太懂,不列举代码)

3.标准化(Normalization

直接使用normalize函数,有三种归一化方式 {‘l1’, ‘l2’, ‘max’}, default=’l2’ ,坑爹的是默认使用行向量

1
2
3
4
5
6
7
8
9
10
11
>> from sklearn import preprocessing
>> test_array = np.arange(3, 15).reshape(2, 6)
>> print(test_array)
>> result = preprocessing.normalize(test_array, norm = 'l1', axis = 0)
>> print(result)
>> print(result.sum(axis = 0))
[[ 3 4 5 6 7 8]
[ 9 10 11 12 13 14]]
[[0.25 0.28571429 0.3125 0.33333333 0.35 0.36363636]
[0.75 0.71428571 0.6875 0.66666667 0.65 0.63636364]]
[1. 1. 1. 1. 1. 1.]

或者使用Normalizer类

4.类型转化

主要包括 onehot和数字顺序编码两种形式,主要涉及OneHotEncoder和OrdinalEncoder

5. 遇到继续整理。。。。

自定义转化器

1.使用FunctionTransformer封装函数为转化器

没想到怎么传入多个参数的函数

1
2
3
4
5
6
7
8
9
10
11
>> import numpy as np
>> from sklearn.preprocessing import FunctionTransformer
>> def my_power(x, power = 2):
x = np.power(x, power);
return x
>> my_transformer = FunctionTransformer(my_power)
>> print(my_transformer)
>> test_array = np.array([1,2,3,4])
>> print(my_transformer.fit_transform(test_array))
FunctionTransformer(func=<function my_power at 0x000001988F4543A0>)
[ 1 4 9 16]

3. 继承BaseEstimator, TransformerMixin(自动实现fit_transform)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
# 通过原有属性增加一列属性(来自书籍-机器学习实战)
class CombineAttirbutesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
# 提取某些特征,例如归一化处理是求平均值和方差
def fit(self, X):
return self
def transform(self, X):
rooms_per_house = X[:, rooms_ix] / X[:, households_ix]
pepoles_per_house = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_house = X[:,bedrooms_ix] / X[:, households_ix]
return np.c_[X, rooms_per_house, bedrooms_per_house]
else:
return np.c_[X, rooms_per_house]