Scikit-learn Pipeline - 배움 에이아이

scikit-learn의 Pipeline은 전처리 단계와 모델 학습을 하나의 객체로 묶어, 데이터 누수를 방지하고 코드를 깔끔하게 유지합니다.

학습 목표

Pipeline으로 전처리와 모델을 연결할 수 있습니다.
ColumnTransformer로 수치형/범주형 변수를 각각 다르게 처리할 수 있습니다.
파이프라인과 교차검증, GridSearchCV를 결합할 수 있습니다.

파이프라인 구축 실습

기본 Pipeline

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# 방법 1: Pipeline (이름 지정)
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression(max_iter=1000))
])

# 방법 2: make_pipeline (자동 이름)
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

# fit → 내부적으로 scaler.fit_transform → classifier.fit 순서로 실행
pipe.fit(X_train, y_train)

# predict → scaler.transform → classifier.predict 순서로 실행
y_pred = pipe.predict(X_test)
score = pipe.score(X_test, y_test)
print(f"파이프라인 정확도: {score:.4f}")

ColumnTransformer: 열별 전처리

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# 수치형 / 범주형 열 구분
numeric_features = ["age", "income", "score"]
categorical_features = ["city", "gender"]

# 수치형 전처리: 결측치 대체 → 스케일링
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# 범주형 전처리: 결측치 대체 → 원-핫 인코딩
categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# ColumnTransformer로 통합
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# 전처리 + 모델 통합 파이프라인
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

full_pipeline.fit(X_train, y_train)
print(f"전체 파이프라인 정확도: {full_pipeline.score(X_test, y_test):.4f}")

파이프라인 + GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 파이프라인 내부 파라미터는 "단계이름__파라미터" 형식으로 접근
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "classifier__n_estimators": [100, 200],
    "classifier__max_depth": [5, 10, None],
}

# 모델을 랜덤 포레스트로 교체
full_pipeline.set_params(classifier=RandomForestClassifier(random_state=42))

grid = GridSearchCV(full_pipeline, param_grid, cv=5,
                    scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)

print(f"최적 파라미터: {grid.best_params_}")
print(f"최적 점수: {grid.best_score_:.4f}")

불균형 데이터 파이프라인

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

# imbalanced-learn의 Pipeline은 샘플링 단계를 포함할 수 있음
imb_pipeline = ImbPipeline([
    ("preprocessor", preprocessor),
    ("smote", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(random_state=42))
])

# SMOTE는 fit 시에만 적용, predict 시에는 건너뜀
imb_pipeline.fit(X_train, y_train)
print(f"불균형 처리 파이프라인: {imb_pipeline.score(X_test, y_test):.4f}")

Q: 왜 파이프라인을 사용해야 하나요?

데이터 누수 방지: fit/transform 순서를 자동 관리합니다. 2) 코드 간결: 전처리+모델이 하나의 객체입니다. 3) 재현성: 동일한 파이프라인으로 새 데이터에 적용 가능합니다. 4) GridSearchCV 통합: 전처리 파라미터도 함께 탐색할 수 있습니다.

Q: 커스텀 변환기를 파이프라인에 추가할 수 있나요?

네. BaseEstimator와 TransformerMixin을 상속하여 fit과 transform 메서드를 구현하면 파이프라인에 추가할 수 있습니다. FunctionTransformer를 사용하면 더 간단하게 커스텀 변환을 추가할 수 있습니다.

체크리스트

Pipeline으로 전처리와 모델을 연결할 수 있다
ColumnTransformer로 열별 전처리를 구성할 수 있다
파이프라인과 GridSearchCV를 결합할 수 있다
불균형 데이터용 파이프라인을 구성할 수 있다

다음 문서

실험 관리 (MLflow)

파이프라인의 실험 결과를 체계적으로 기록합니다.

모델 저장과 배포

학습된 파이프라인을 저장하고 재사용합니다.

언제 쓰나

현재 문제의 목표 지표와 데이터 특성을 먼저 확인한 뒤 적용합니다. 작은 실험셋으로 빠르게 기준 성능을 확인한 뒤, 필요하면 더 복잡한 모델로 확장합니다.

실무 적용 체크리스트

데이터 누수 가능성을 먼저 점검했습니다.
학습/검증/테스트 분할 기준을 고정했습니다.
핵심 지표(예: F1, RMSE, AUC)를 명시했습니다.
베이스라인 대비 개선폭과 비용 변화를 함께 기록했습니다.

자주 나는 실수

데이터 분할 전에 전처리를 수행해 데이터 누수가 발생합니다.
단일 지표만 보고 모델을 선택해 운영 성능이 불안정해집니다.
하이퍼파라미터를 과도하게 조정해 검증셋 과적합이 생깁니다.

​학습 목표

​파이프라인 구축 실습

​체크리스트

​다음 문서