멋쟁이사자처럼 X K-DIGITAL Training - 06.29
[참고] 2021.06.30 - [python/k-digital] - [K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리
라이브러리
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
- sklearn.ensemble 머신러닝 앙상블 모델
- sklearn.datasets 사이킷런이 제공하는 기본 데이터셋
- sklearn.utils.shuffle 데이터 섞는 메서드
- sklearn.metrics.mean_squared_error MSE 계산
기본 제공 데이터셋 - Boston house datasets
boston = datasets.load_boston()
train data 90% / test data 10%
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9) # 전체 데이터의 90% train data
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
- np.random.shuffle() : 각 배열의 변수 섞음
- np.astype() : 데이터형 dtype 변경(캐스팅)
GradientBoostingRegressor 모델 학습
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2, 'learning_rate': 0.01, 'loss': 'ls'}
# 모델 생성
clf = ensemble.GradientBoostingRegressor(**params)
# ensemble.GradientBoostingRegressor(n_estimators=500, max_depth=4, min_samples_split=2, learning_rate=0.01, loss='ls')
# 학습
clf.fit(X_train, y_train)
MSE
mse = mean_squared_error(y_test, clf.predict(X_test))
>> MSE : 6.4166
Plot training deviance - 의사결정트리수에 따른 편차(에러)
# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
# 예측값과 실제값 사이 차이
for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = clf.loss_(y_test, y_pred)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-', label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
- np.zeros() : 0으로 초기화된 shape 차원의 ndarray 배열 객체 반환
- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환
>> 결정 트리가 증가할수록 편차가 줄어드는 것을 알 수 있음
Feature Importance - 요인별 중요도
feature_importance = clf.feature_importances_
# make importances relative to max importance 중요도 비율 계산
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, boston.feature_names[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
- feature_importances_ : 각 열마다의 중요도값
>> Model Explainablity & Interpretability 모델 설명과 이해, 해석 가능
Feature selection 중요한 열 선택 가능
라이브러리
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, ensemble
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
기본 제공 데이터셋 - Breast Cancer datasets
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target
train data 90% / test data 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13)
* test_size = 0.25 (default)
GradientBoostingClassifier 모델 학습
params = {'n_estimators': 1000,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01}
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
Accuracy
acc = accuracy_score(y_test, clf.predict(X_test))
>> 0.9123
# train_score 초기화
train_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_predict(X_train)):
train_score[i] = accuracy_score(y_train, y_pred)
# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = accuracy_score(y_test, y_pred)
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.title('Accuracy') # Binomial deviance loss function for binary classification
plt.plot(np.arange(params['n_estimators']) + 1, train_score, 'b-', label='Training Set Accuracy')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Accuracy')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Accuracy')
fig.tight_layout()
plt.show()
- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환
>> 결정 트리가 많아질수록 정확도 증가하나 일정 수준 넘어서면 유지
Feature Importance - 요인별 중요도
feature_importance = clf.feature_importances_
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(12, 6))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(cancer.feature_names)[sorted_idx])
plt.title('Feature Importance (MDI)')
fig.tight_layout()
plt.show()
- feature_importances_ : 각 열마다의 중요도값
AUC
from sklearn.metrics import roc_curve, auc
# roc 곡선
fpr, tpr, _ = roc_curve(y_true=y_test, y_score=clf.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr) # AUC 면적의 값 (수치)
plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.title("ROC curve")
plt.show()
Precision / Recall / F1-score
from sklearn.metrics import classification_report
predictions = clf.predict(X_test)
# Precision, Recall, F1-score 등 확인
print(classification_report(y_test, predictions))
print("Accuracy on Training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on Test set: {:.3f}".format(clf.score(X_test, y_test)))
++) 실습 추가
RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 데이터 준비
data = train[feature]
target = train[label]
# 모델 생성
clf = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=0)
# 모델 성능 확인
cross_val_score(clf, data, target, cv=k_fold, scoring='accuracy').mean()
[K-DIGITAL] 머신러닝 알고리즘(4) KNN, K-MEANS, PCA (0) | 2021.07.01 |
---|---|
[K-DIGITAL] 분류 성능에 대한 측정 ROC Curve와 AUC (0) | 2021.07.01 |
[K-DIGITAL] 머신러닝 알고리즘(2) SVM - sklearn 실습 (0) | 2021.06.30 |
[K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리 (0) | 2021.06.30 |
[K-DIGITAL] 머신러닝 알고리즘(2) SVM (0) | 2021.06.30 |
댓글 영역