멋쟁이사자처럼 X K-DIGITAL Training - 06.29
[참고] 2021.06.30 - [python/k-digital] - [K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리
[K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리
멋쟁이사자처럼 X K-DIGITAL Training - 06.29 [이전] 2021.06.29 - [python/k-digital] - [K-DIGITAL] 머신러닝 알고리즘(1) 회귀분석과 분류분석 [K-DIGITAL] 머신러닝 알고리즘(1) 회귀분석과 분류분석 멋쟁이..
juran-devblog.tistory.com
라이브러리
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
- sklearn.ensemble 머신러닝 앙상블 모델
- sklearn.datasets 사이킷런이 제공하는 기본 데이터셋
- sklearn.utils.shuffle 데이터 섞는 메서드
- sklearn.metrics.mean_squared_error MSE 계산
기본 제공 데이터셋 - Boston house datasets
boston = datasets.load_boston()
train data 90% / test data 10%
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9) # 전체 데이터의 90% train data
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
- np.random.shuffle() : 각 배열의 변수 섞음
- np.astype() : 데이터형 dtype 변경(캐스팅)
GradientBoostingRegressor 모델 학습
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2, 'learning_rate': 0.01, 'loss': 'ls'}
# 모델 생성
clf = ensemble.GradientBoostingRegressor(**params)
# ensemble.GradientBoostingRegressor(n_estimators=500, max_depth=4, min_samples_split=2, learning_rate=0.01, loss='ls')
# 학습
clf.fit(X_train, y_train)
MSE
mse = mean_squared_error(y_test, clf.predict(X_test))
>> MSE : 6.4166
Plot training deviance - 의사결정트리수에 따른 편차(에러)
# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
# 예측값과 실제값 사이 차이
for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = clf.loss_(y_test, y_pred)
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-', label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
- np.zeros() : 0으로 초기화된 shape 차원의 ndarray 배열 객체 반환
- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환
>> 결정 트리가 증가할수록 편차가 줄어드는 것을 알 수 있음
Feature Importance - 요인별 중요도
feature_importance = clf.feature_importances_
# make importances relative to max importance 중요도 비율 계산
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, boston.feature_names[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
- feature_importances_ : 각 열마다의 중요도값
>> Model Explainablity & Interpretability 모델 설명과 이해, 해석 가능
Feature selection 중요한 열 선택 가능
라이브러리
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, ensemble
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
기본 제공 데이터셋 - Breast Cancer datasets
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target
train data 90% / test data 10%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13)
* test_size = 0.25 (default)
GradientBoostingClassifier 모델 학습
params = {'n_estimators': 1000,
'max_depth': 4,
'min_samples_split': 5,
'learning_rate': 0.01}
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
Accuracy
acc = accuracy_score(y_test, clf.predict(X_test))
>> 0.9123
# train_score 초기화
train_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_predict(X_train)):
train_score[i] = accuracy_score(y_train, y_pred)
# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = accuracy_score(y_test, y_pred)
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.title('Accuracy') # Binomial deviance loss function for binary classification
plt.plot(np.arange(params['n_estimators']) + 1, train_score, 'b-', label='Training Set Accuracy')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Accuracy')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Accuracy')
fig.tight_layout()
plt.show()
- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환
>> 결정 트리가 많아질수록 정확도 증가하나 일정 수준 넘어서면 유지
Feature Importance - 요인별 중요도
feature_importance = clf.feature_importances_
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(12, 6))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(cancer.feature_names)[sorted_idx])
plt.title('Feature Importance (MDI)')
fig.tight_layout()
plt.show()
- feature_importances_ : 각 열마다의 중요도값
AUC
from sklearn.metrics import roc_curve, auc
# roc 곡선
fpr, tpr, _ = roc_curve(y_true=y_test, y_score=clf.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr) # AUC 면적의 값 (수치)
plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.title("ROC curve")
plt.show()
Precision / Recall / F1-score
from sklearn.metrics import classification_report
predictions = clf.predict(X_test)
# Precision, Recall, F1-score 등 확인
print(classification_report(y_test, predictions))
print("Accuracy on Training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on Test set: {:.3f}".format(clf.score(X_test, y_test)))
++) 실습 추가
RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 데이터 준비
data = train[feature]
target = train[label]
# 모델 생성
clf = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=0)
# 모델 성능 확인
cross_val_score(clf, data, target, cv=k_fold, scoring='accuracy').mean()
[K-DIGITAL] 머신러닝 알고리즘(4) KNN, K-MEANS, PCA (0) | 2021.07.01 |
---|---|
[K-DIGITAL] 분류 성능에 대한 측정 ROC Curve와 AUC (0) | 2021.07.01 |
[K-DIGITAL] 머신러닝 알고리즘(2) SVM - sklearn 실습 (0) | 2021.06.30 |
[K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리 (0) | 2021.06.30 |
[K-DIGITAL] 머신러닝 알고리즘(2) SVM (0) | 2021.06.30 |
댓글 영역