0%

model_selection模块进行交叉验证

为了避免模型过拟合,通过添加验证集,模型训练完成后使用验证集验证模型的效果后,再对测试集进行测试。

训练集和测试集划分

使用train_test_split函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>> import numpy as np
>> from sklearn.model_selection import train_test_split
>> from sklearn import datasets

# 加载鸢尾花数据集
>> x_data, y_data = datasets.load_iris(return_X_y=True)
# 划分测试集和训练集
>> x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, random_state = 42)
>> print(x_train.shape)
>> print(x_test.shape)
(120, 4)
(30, 4)
>> from sklearn import svm
>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>> clf.score(X_test, y_test)
1.0

cross validation 交叉验证

image-20210709165817836

cross_val_score函数进行模型交叉验证评价

简单使用

1
2
3
4
5
>> from sklearn.model_selection import cross_val_score
>> clf = svm.SVC(kernel='linear', C=1, random_state=42)
>> scores = cross_val_score(clf, x_data, y_data, cv = 5)
>> print(scores)
[0.96666667 1. 0.96666667 0.96666667 1. ]

通过scoring参数,自定义评估函数,如果不显示指定,则调用estimator的默认score函数

image-20210604090841624

1
2
3
>> scores = cross_val_score(clf, x_data, y_data, cv = 5, scoring = 'f1_macro')
>> print(scores)
[0.96658312 1. 0.96658312 0.96658312 1. ]

cv参数不止可以传入k折的折数,还可以传入cross-validation generator or an iterable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 传入交叉验证生成器
>> from sklearn.model_selection import ShuffleSplit
>> cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state= 42)
>> cross_val_score(clf, x_data, y_data, cv=cv)
array([1. , 1. , 0.96666667, 0.93333333, 0.96666667])
# 传入index生成迭代器(教程代码)
>>> def custom_cv_2folds(X):
... n = X.shape[0]
... i = 1
... while i <= 2:
... idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
... yield idx, idx
... i += 1
...
>>> custom_cv = custom_cv_2folds(X)
>>> cross_val_score(clf, X, y, cv=custom_cv)
array([1. , 0.973...])