sklearn数据预处理一般流程

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

sklearn的preprocessing模块提供了一系列包括标准化、数据最大最小缩放处理、正则化、特征二值化和数据缺失值处理在内的数据预处理模块。

基本操作流程为：

# 1.创建预处理器 transform
test_scaler = StandardScaler()
# 2. 调用fit函数 计算预处理所需要的相关数据(如StandardScaler会计算mean、var等)
test_scaler.fit(input)
# 3. 调用transform函数对数据进行预处理
test_scaler.transform(input)
# 或者直接合并fit和transform两部操作
test_scaler.fit_transform(input)

1. 标准化

使用StandardScaler（mean = 1,std = 0）

>> from sklearn.preprocessing import StandardScaler
>> import numpy as np
>> test_array = np.arange(0,12).reshape((3, 4))
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>> test_scaler = StandardScaler()
>> test_scaler.fit(test_array)
>> print(test_scaler.var_)
>> print(test_scaler.mean_)
>> test_scaler.transform(test_array, copy = True)
[10.66666667 10.66666667 10.66666667 10.66666667]
[4. 5. 6. 7.]
array([[-1.22474487, -1.22474487, -1.22474487, -1.22474487],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 1.22474487,  1.22474487,  1.22474487,  1.22474487]])

使用MaxMinScaler进行区间缩放（默认0-1）

>> from sklearn.preprocessing import MinMaxScaler
>> test_array = np.random.uniform(low=-1, high=16, size=(4, 3))
>> print(test_array)
[[10.45968472  6.52372993 12.08526458]
 [ 9.94529398  6.3849395   6.22910917]
 [12.02516025 11.50269044  6.380779  ]
 [14.67759022  0.36908024  1.13677392]]
[0.42262781 0.17963625 0.18267358]
# 传入要缩放的区间
>> test_scaler = MinMaxScaler((-1, 1))
>> test_scaler.fit(test_array)
>> print(test_scaler.scale_)
>> test_scaler.transform(test_array)
array([[-0.78260417,  0.1055982 ,  1.        ],
       [-1.        ,  0.08066641, -0.06976488],
       [-0.12099067,  1.        , -0.04205881],
       [ 1.        , -1.        , -1.        ]])

使用MaxAbsScaler稀疏数据标准化

为了避免标准化过程中破坏稀疏数据的稀疏性质，使用MaxAbsScaler,根据样本数据除以最大绝对值，实现到[-1, 1]的映射

使用RobustScaler带有离群值的数据标准化

2.非线性转化

主要包括概率分布转化（Quantile transforms）和正态变换（Power transforms），用来将原特定分布的特征值映射到另一个特征分布。

使用QuantileTransformer进行均匀分布映射转换

>> from sklearn.datasets import load_iris
>> from sklearn.model_selection import train_test_split
>> from sklearn.preprocessing import QuantileTransformer
>> data_x, data_y = load_iris(return_X_y=True)
>> x_train, x_test, y_train, y_test = train_test_split(data_x, data_y,  test_size = 0.2, random_state = 42)
>> quantileTransformer = QuantileTransformer()
>> x_train_trans = quantileTransformer.fit_transform(x_train)
>> x_test_trans = quantileTransformer.fit_transform(x_test)
>> print(np.percentile(x_train[:, 0], [0, 25, 50, 75, 100]))
>> print(np.percentile(x_train_trans[:, 0], [0, 25, 50, 75, 100]))
[4.3  5.1  5.75 6.4  7.7 ]
[0.         0.24789916 0.5        0.7605042  1.        ]

使用PowerTransformer进行正态分布映射转换

使用Yeo-Johnson transform和 Box-Cox transform两种变换方式（暂时还不太懂，不列举代码）

3.标准化（Normalization）

直接使用normalize函数，有三种归一化方式 {‘l1’, ‘l2’, ‘max’}, default=’l2’ ，坑爹的是默认使用行向量

>> from sklearn import preprocessing
>> test_array = np.arange(3, 15).reshape(2, 6)
>> print(test_array)
>> result = preprocessing.normalize(test_array, norm = 'l1', axis = 0)
>> print(result)
>> print(result.sum(axis = 0))
[[ 3  4  5  6  7  8]
 [ 9 10 11 12 13 14]]
[[0.25       0.28571429 0.3125     0.33333333 0.35       0.36363636]
 [0.75       0.71428571 0.6875     0.66666667 0.65       0.63636364]]
[1. 1. 1. 1. 1. 1.]

或者使用Normalizer类

4.类型转化

主要包括 onehot和数字顺序编码两种形式，主要涉及OneHotEncoder和OrdinalEncoder

5. 遇到继续整理。。。。

自定义转化器

1.使用FunctionTransformer封装函数为转化器

没想到怎么传入多个参数的函数

>> import numpy as np
>> from sklearn.preprocessing import FunctionTransformer
>> def my_power(x, power = 2):
    x = np.power(x, power);
    return x
>> my_transformer = FunctionTransformer(my_power)
>> print(my_transformer)
>> test_array = np.array([1,2,3,4])
>> print(my_transformer.fit_transform(test_array))
FunctionTransformer(func=<function my_power at 0x000001988F4543A0>)
[ 1  4  9 16]

3. 继承BaseEstimator, TransformerMixin（自动实现fit_transform）

from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
# 通过原有属性增加一列属性（来自书籍-机器学习实战）
class CombineAttirbutesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    # 提取某些特征，例如归一化处理是求平均值和方差
    def fit(self, X):
        return self
    def transform(self, X):
        rooms_per_house = X[:, rooms_ix] / X[:, households_ix]
        pepoles_per_house = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_house = X[:,bedrooms_ix] / X[:, households_ix]
            return np.c_[X, rooms_per_house, bedrooms_per_house]
        else:
            return np.c_[X, rooms_per_house]