정형 데이터 분류 프로젝트

정형(Tabular) 데이터에서 이진 분류를 수행하는 전체 머신러닝 워크플로우를 실습합니다. Titanic 생존자 예측 문제를 사용하여 EDA부터 최종 모델 선택까지의 과정을 경험합니다.

프로젝트 개요

항목	내용
문제 유형	이진 분류 (생존 여부)
데이터셋	Titanic (seaborn 내장)
핵심 기법	EDA, 결측치 처리, 특성 공학, 모델 비교
사용 알고리즘	로지스틱 회귀, 랜덤 포레스트, XGBoost
난이도	입문

프로젝트 실습

데이터 로드 및 탐색

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc
rc('font', family='AppleGothic')  # macOS 한글 폰트

# Titanic 데이터 로드
df = sns.load_dataset("titanic")
print(f"데이터 크기: {df.shape}")
print(f"\n데이터 타입:\n{df.dtypes}")
print(f"\n결측치:\n{df.isnull().sum()}")
print(f"\n타겟 분포:\n{df['survived'].value_counts(normalize=True)}")

# 수치형 변수 분포 확인
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, col in zip(axes, ["age", "fare", "pclass"]):
    df[col].hist(ax=ax, bins=30, edgecolor="black")
    ax.set_title(f"{col} 분포")
plt.tight_layout()
plt.show()

# 생존율과 범주형 변수의 관계
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, col in zip(axes, ["sex", "pclass", "embarked"]):
    df.groupby(col)["survived"].mean().plot(kind="bar", ax=ax)
    ax.set_title(f"{col}별 생존율")
    ax.set_ylabel("생존율")
plt.tight_layout()
plt.show()

데이터 전처리

from sklearn.model_selection import train_test_split

# 사용할 열 선택 (불필요한 열 제거)
features = ["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
target = "survived"

df_clean = df[features + [target]].copy()

# 결측치 처리
df_clean["age"].fillna(df_clean["age"].median(), inplace=True)
df_clean["embarked"].fillna(df_clean["embarked"].mode()[0], inplace=True)

# 특성 공학: 가족 크기, 혼자 여행 여부
df_clean["family_size"] = df_clean["sibsp"] + df_clean["parch"] + 1
df_clean["is_alone"] = (df_clean["family_size"] == 1).astype(int)

# 요금 구간화
df_clean["fare_bin"] = pd.qcut(
    df_clean["fare"], q=4, labels=["저가", "중저가", "중고가", "고가"]
)

# 나이 구간화
df_clean["age_group"] = pd.cut(
    df_clean["age"],
    bins=[0, 12, 18, 35, 60, 100],
    labels=["어린이", "청소년", "청년", "중년", "노년"]
)

print(f"전처리 후 데이터 크기: {df_clean.shape}")
print(f"결측치: {df_clean.isnull().sum().sum()}")

파이프라인 구성

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# 특성과 타겟 분리
X = df_clean.drop(columns=[target])
y = df_clean[target]

# 학습/테스트 분할 (계층적 분할)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 수치형/범주형 열 분리
num_cols = ["age", "fare", "sibsp", "parch", "family_size"]
cat_cols = ["sex", "embarked", "fare_bin", "age_group"]

# ColumnTransformer로 전처리 통합
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_cols),
    ],
    remainder="passthrough"  # pclass, is_alone 등은 그대로 통과
)

모델 비교

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

# 비교할 모델 목록
models = {
    "로지스틱 회귀": LogisticRegression(max_iter=1000, random_state=42),
    "랜덤 포레스트": RandomForestClassifier(n_estimators=100, random_state=42),
    "그래디언트 부스팅": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "XGBoost": XGBClassifier(n_estimators=100, random_state=42, eval_metric="logloss"),
}

# 교차검증으로 모델 비교
results = {}
for name, model in models.items():
    pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("model", model),
    ])

    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="accuracy")
    results[name] = {
        "mean": scores.mean(),
        "std": scores.std(),
    }
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

# 결과 시각화
names = list(results.keys())
means = [r["mean"] for r in results.values()]
stds = [r["std"] for r in results.values()]

plt.figure(figsize=(10, 5))
plt.barh(names, means, xerr=stds, capsize=5)
plt.xlabel("정확도")
plt.title("모델별 교차검증 정확도 비교")
plt.xlim(0.7, 0.9)
plt.tight_layout()
plt.show()

하이퍼파라미터 튜닝

from sklearn.model_selection import GridSearchCV

# 최적 모델 선택 후 하이퍼파라미터 튜닝
best_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(random_state=42)),
])

param_grid = {
    "model__n_estimators": [100, 200, 300],
    "model__max_depth": [5, 10, 15, None],
    "model__min_samples_split": [2, 5, 10],
}

grid_search = GridSearchCV(
    best_pipe, param_grid,
    cv=5, scoring="accuracy",
    n_jobs=-1, verbose=1,
)
grid_search.fit(X_train, y_train)

print(f"최적 파라미터: {grid_search.best_params_}")
print(f"최적 교차검증 정확도: {grid_search.best_score_:.4f}")

최종 평가

from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, roc_curve
)

# 최적 모델로 테스트 세트 평가
y_pred = grid_search.predict(X_test)
y_proba = grid_search.predict_proba(X_test)[:, 1]

# 분류 리포트
print("분류 리포트:")
print(classification_report(y_test, y_pred, target_names=["사망", "생존"]))

# 혼동 행렬
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["사망", "생존"],
            yticklabels=["사망", "생존"])
plt.xlabel("예측값")
plt.ylabel("실제값")
plt.title("혼동 행렬")
plt.show()

# ROC 곡선
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC (AUC = {auc:.3f})")
plt.plot([0, 1], [0, 1], "k--", label="기준선")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC 곡선")
plt.legend()
plt.show()

# 특성 중요도 확인
best_model = grid_search.best_estimator_
feature_names = best_model.named_steps["preprocessor"].get_feature_names_out()
importances = best_model.named_steps["model"].feature_importances_

feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=True)
feat_imp.tail(15).plot(kind="barh", figsize=(10, 6))
plt.title("특성 중요도 (상위 15개)")
plt.xlabel("중요도")
plt.tight_layout()
plt.show()

Q: 실무에서 정형 데이터 분류 시 주의할 점은 무엇인가요?

데이터 누수 방지: 전처리(스케일링, 인코딩)는 반드시 학습 데이터에만 fit하고 테스트에는 transform만 적용합니다. 2) 클래스 불균형: 타겟 비율이 심하게 불균형하면 정확도 대신 F1, PR-AUC를 사용합니다. 3) 과적합 확인: 학습 점수와 검증 점수 차이가 크면 모델 복잡도를 줄이거나 정규화를 적용합니다.

Q: 이 프로젝트에서 사용한 기법과 관련된 문서는 어디인가요?

EDA는 탐색적 데이터 분석, 전처리는 데이터 정제와 인코딩과 스케일링, 모델 비교는 교차검증과 분류 평가 지표, 파이프라인은 Scikit-learn Pipeline을 참고합니다.

체크리스트

EDA로 데이터의 특성과 패턴을 파악할 수 있다
전처리 파이프라인을 구성하고 데이터 누수를 방지할 수 있다
여러 모델을 교차검증으로 공정하게 비교할 수 있다
최종 모델을 하이퍼파라미터 튜닝하고 테스트 세트로 평가할 수 있다

정형 데이터 분류 프로젝트

프로젝트 개요

프로젝트 실습

체크리스트

다음 문서

수치 예측 프로젝트

분류 알고리즘 레퍼런스

​프로젝트 개요

​프로젝트 실습

​체크리스트

​다음 문서

수치 예측 프로젝트

분류 알고리즘 레퍼런스

프로젝트 개요

프로젝트 실습

체크리스트

다음 문서