为了避免模型过拟合,通过添加验证集,模型训练完成后使用验证集验证模型的效果后,再对测试集进行测试。
训练集和测试集划分
使用train_test_split函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| >> import numpy as np >> from sklearn.model_selection import train_test_split >> from sklearn import datasets
>> x_data, y_data = datasets.load_iris(return_X_y=True)
>> x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, random_state = 42) >> print(x_train.shape) >> print(x_test.shape) (120, 4) (30, 4) >> from sklearn import svm >> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) >> clf.score(X_test, y_test) 1.0
|
cross validation 交叉验证
cross_val_score函数进行模型交叉验证评价
简单使用
1 2 3 4 5
| >> from sklearn.model_selection import cross_val_score >> clf = svm.SVC(kernel='linear', C=1, random_state=42) >> scores = cross_val_score(clf, x_data, y_data, cv = 5) >> print(scores) [0.96666667 1. 0.96666667 0.96666667 1. ]
|
通过scoring参数,自定义评估函数,如果不显示指定,则调用estimator的默认score函数
1 2 3
| >> scores = cross_val_score(clf, x_data, y_data, cv = 5, scoring = 'f1_macro') >> print(scores) [0.96658312 1. 0.96658312 0.96658312 1. ]
|
cv参数不止可以传入k折的折数,还可以传入cross-validation generator or an iterable等
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| >> from sklearn.model_selection import ShuffleSplit >> cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state= 42) >> cross_val_score(clf, x_data, y_data, cv=cv) array([1. , 1. , 0.96666667, 0.93333333, 0.96666667])
>>> def custom_cv_2folds(X): ... n = X.shape[0] ... i = 1 ... while i <= 2: ... idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int) ... yield idx, idx ... i += 1 ... >>> custom_cv = custom_cv_2folds(X) >>> cross_val_score(clf, X, y, cv=custom_cv) array([1. , 0.973...])
|