학습 곡선과 모델 진단 - 배움 에이아이

학습 곡선(Learning Curve)은 모델의 과적합/과소적합 상태를 시각적으로 진단하는 도구입니다. SHAP와 특성 중요도는 모델의 예측을 해석하는 데 사용됩니다.

학습 목표

학습 곡선을 그리고 해석할 수 있습니다.
검증 곡선(Validation Curve)으로 하이퍼파라미터의 영향을 분석할 수 있습니다.
SHAP 값을 활용하여 모델의 예측을 해석할 수 있습니다.
특성 중요도(Feature Importance)를 올바르게 해석할 수 있습니다.

왜 중요한가

모델의 성능 수치만으로는 왜 성능이 낮은지, 어떻게 개선할 수 있는지 알 수 없습니다. 학습 곡선은 데이터 추가 vs 모델 복잡도 조절 중 어떤 방향이 효과적인지 판단하는 근거를 제공합니다.

핵심 개념

학습 곡선 (Learning Curve)

from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt

X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 학습 곡선 계산
train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring="accuracy", n_jobs=-1
)

# 시각화
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_std = val_scores.std(axis=1)

plt.fill_between(train_sizes, train_mean - train_std,
                 train_mean + train_std, alpha=0.1, color="blue")
plt.fill_between(train_sizes, val_mean - val_std,
                 val_mean + val_std, alpha=0.1, color="orange")
plt.plot(train_sizes, train_mean, "o-", label="학습 점수", color="blue")
plt.plot(train_sizes, val_mean, "o-", label="검증 점수", color="orange")
plt.xlabel("학습 데이터 크기")
plt.ylabel("정확도")
plt.title("학습 곡선")
plt.legend()
plt.show()

학습 곡선 해석:

패턴	진단	해결 방향
학습/검증 모두 낮음	과소적합 (높은 편향)	모델 복잡도 증가, 특성 추가
학습 높고 검증 낮음 (큰 갭)	과적합 (높은 분산)	데이터 추가, 정규화, 단순화
두 곡선이 수렴하며 높음	적절한 모델	현재 상태 유지
검증 곡선이 데이터 증가에 따라 상승 중	데이터 부족	데이터 추가가 효과적

검증 곡선 (Validation Curve)

from sklearn.model_selection import validation_curve

param_range = [1, 3, 5, 10, 20, 50, None]

train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, param_name="max_depth",
    param_range=[1, 3, 5, 10, 20],
    cv=5, scoring="accuracy", n_jobs=-1
)

plt.plot([1, 3, 5, 10, 20], train_scores.mean(axis=1), "o-", label="학습")
plt.plot([1, 3, 5, 10, 20], val_scores.mean(axis=1), "o-", label="검증")
plt.xlabel("max_depth")
plt.ylabel("정확도")
plt.title("검증 곡선: max_depth의 영향")
plt.legend()
plt.show()

특성 중요도 (Feature Importance)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# 트리 기반 특성 중요도
importance = pd.DataFrame({
    "특성": load_iris().feature_names,
    "중요도": model.feature_importances_
}).sort_values("중요도", ascending=True)

importance.plot(x="특성", y="중요도", kind="barh")
plt.title("특성 중요도 (Feature Importance)")
plt.xlabel("중요도")
plt.tight_layout()
plt.show()

SHAP (SHapley Additive exPlanations)

import shap

# SHAP 설명자 생성
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# 전체 특성 중요도 (Summary Plot)
shap.summary_plot(shap_values, X, feature_names=load_iris().feature_names)

# 개별 예측 해석 (Force Plot)
# 첫 번째 샘플이 왜 이 클래스로 예측되었는지 설명
shap.force_plot(
    explainer.expected_value[0],
    shap_values[0][0],
    X[0],
    feature_names=load_iris().feature_names
)

AI/ML에서의 활용

도구	용도	관련 레퍼런스
학습 곡선	과적합/과소적합 진단	과적합
검증 곡선	하이퍼파라미터 영향 분석	튜닝
특성 중요도	주요 특성 식별	랜덤 포레스트
SHAP	예측 해석, 모델 투명성	모든 모델

Q: SHAP과 특성 중요도의 차이는 무엇인가요?

트리 기반 특성 중요도는 모델 전체의 평균적인 기여도만 보여줍니다. SHAP은 개별 예측에 대해 각 특성이 어떻게 기여했는지 설명하며, 양방향(양성/음성) 영향을 구분할 수 있습니다.

Q: 학습 곡선에서 두 선이 수렴하지 않으면 어떻게 하나요?

데이터를 추가하면 두 곡선이 수렴할 가능성이 있습니다. 데이터 추가가 불가능하면 모델 복잡도를 줄이거나 정규화를 강화합니다.

체크리스트

학습 곡선의 패턴을 해석할 수 있다
검증 곡선으로 하이퍼파라미터 영향을 분석할 수 있다
트리 기반 특성 중요도를 시각화할 수 있다
SHAP 값의 의미를 설명할 수 있다

다음 문서

머신러닝 파이프라인

전체 머신러닝 워크플로우를 파이프라인으로 자동화합니다.

실무 프로젝트

학습한 내용을 종합 프로젝트에 적용합니다.

​학습 목표

​왜 중요한가

​핵심 개념

​학습 곡선 (Learning Curve)

​검증 곡선 (Validation Curve)

​특성 중요도 (Feature Importance)

​SHAP (SHapley Additive exPlanations)

​AI/ML에서의 활용

​체크리스트

​다음 문서