상세 컨텐츠

본문 제목

[K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리 - sklearn 실습

PYTHON/K-DIGITAL

by ranlan 2021. 6. 30. 09:42

본문

728x90

멋쟁이사자처럼 X K-DIGITAL Training - 06.29

 

 

[참고] 2021.06.30 - [python/k-digital] - [K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리

 

[K-DIGITAL] 머신러닝 알고리즘(3) 의사결정트리

멋쟁이사자처럼 X K-DIGITAL Training - 06.29 [이전] 2021.06.29 - [python/k-digital] - [K-DIGITAL] 머신러닝 알고리즘(1) 회귀분석과 분류분석 [K-DIGITAL] 머신러닝 알고리즘(1) 회귀분석과 분류분석 멋쟁이..

juran-devblog.tistory.com


 

 

Gradient Boosting Regression

라이브러리

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

- sklearn.ensemble  머신러닝 앙상블 모델

- sklearn.datasets  사이킷런이 제공하는 기본 데이터셋

- sklearn.utils.shuffle  데이터 섞는 메서드

- sklearn.metrics.mean_squared_error  MSE 계산

 

기본 제공 데이터셋 - Boston house datasets

boston = datasets.load_boston()

train data 90% / test data 10%

X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)

offset = int(X.shape[0] * 0.9) # 전체 데이터의 90% train data

X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

- np.random.shuffle() : 각 배열의 변수 섞음

- np.astype() : 데이터형 dtype 변경(캐스팅)

 

GradientBoostingRegressor 모델 학습

params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2, 'learning_rate': 0.01, 'loss': 'ls'}

# 모델 생성
clf = ensemble.GradientBoostingRegressor(**params)
# ensemble.GradientBoostingRegressor(n_estimators=500, max_depth=4, min_samples_split=2, learning_rate=0.01, loss='ls')

# 학습
clf.fit(X_train, y_train)

MSE

mse = mean_squared_error(y_test, clf.predict(X_test))

>> MSE : 6.4166

 

Plot training deviance - 의사결정트리수에 따른 편차(에러)

# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64) 

# 예측값과 실제값 사이 차이
for i, y_pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-', label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')

- np.zeros() : 0으로 초기화된 shape 차원의 ndarray 배열 객체 반환

- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환

>> 결정 트리가 증가할수록 편차가 줄어드는 것을 알 수 있음

 

Feature Importance - 요인별 중요도

feature_importance = clf.feature_importances_ 

# make importances relative to max importance 중요도 비율 계산
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5

plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, boston.feature_names[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

- feature_importances_ : 각 열마다의 중요도값

>> Model Explainablity & Interpretability 모델 설명과 이해, 해석 가능
     Feature selection 중요한 열 선택 가능

 

 

 

Gradient Boosting Classfication

라이브러리

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets, ensemble
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

 

기본 제공 데이터셋 - Breast Cancer datasets

cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

train data 90% / test data 10%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13)

* test_size = 0.25 (default)

 

GradientBoostingClassifier 모델 학습

params = {'n_estimators': 1000,
          'max_depth': 4,
          'min_samples_split': 5,
          'learning_rate': 0.01}

clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

Accuracy

acc = accuracy_score(y_test, clf.predict(X_test))

>> 0.9123

# train_score 초기화
train_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_train)):
    train_score[i] = accuracy_score(y_train, y_pred)

# test_score 초기화
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = accuracy_score(y_test, y_pred)
    
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.title('Accuracy') # Binomial deviance loss function for binary classification
plt.plot(np.arange(params['n_estimators']) + 1, train_score, 'b-', label='Training Set Accuracy')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-', label='Test Set Accuracy')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Accuracy')
fig.tight_layout()
plt.show()

- staged_predict() : 학습 훈련의 각 단계(트리)에서 앙상블에 의해 만들어진 예측 반복자(iterator)로 반환

>> 결정 트리가 많아질수록 정확도 증가하나 일정 수준 넘어서면 유지

 

Feature Importance - 요인별 중요도

feature_importance = clf.feature_importances_
sorted_idx = np.argsort(feature_importance) # 내림차순 정렬
pos = np.arange(sorted_idx.shape[0]) + .5

fig = plt.figure(figsize=(12, 6))

plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(cancer.feature_names)[sorted_idx])
plt.title('Feature Importance (MDI)')

fig.tight_layout()
plt.show()

- feature_importances_ : 각 열마다의 중요도값

 

AUC

from sklearn.metrics import roc_curve, auc

# roc 곡선 
fpr, tpr, _ = roc_curve(y_true=y_test, y_score=clf.predict_proba(X_test)[:,1]) 
roc_auc = auc(fpr, tpr) # AUC 면적의 값 (수치)

plt.figure(figsize=(10, 10))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.title("ROC curve")
plt.show()

 

Precision / Recall / F1-score

from sklearn.metrics import classification_report

predictions = clf.predict(X_test)

# Precision, Recall, F1-score 등 확인
print(classification_report(y_test, predictions)) 

print("Accuracy on Training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on Test set: {:.3f}".format(clf.score(X_test, y_test)))

 

 

++) 실습 추가

 

RandomForestClassifier

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# 데이터 준비
data = train[feature]
target = train[label]

# 모델 생성
clf = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=0)

# 모델 성능 확인
cross_val_score(clf, data, target, cv=k_fold, scoring='accuracy').mean()
728x90

관련글 더보기

댓글 영역