인코더-디코더 모델

인코더-디코더(Encoder-Decoder) 모델은 Transformer의 인코더와 디코더를 모두 사용하는 구조입니다. 입력을 인코더로 이해하고, 디코더로 새로운 시퀀스를 생성합니다. 번역, 요약, 질의응답 등 입력과 출력이 다른 형태인 태스크에 특히 적합합니다.

핵심 아이디어

인코더 전용(BERT)은 이해에 강하지만 생성이 어렵고, 디코더 전용(GPT)은 생성에 강하지만 입력 이해가 단방향입니다. 인코더-디코더는 양방향 이해 + 자기회귀 생성을 결합합니다.

구조	인코더	디코더	강점
BERT (인코더)	양방향	없음	이해, 분류, 추출
GPT (디코더)	없음	자기회귀	텍스트 생성
T5, BART (인코더-디코더)	양방향	자기회귀	이해 + 생성

T5 — Text-to-Text Transfer Transformer

핵심 아이디어: 모든 것을 텍스트로

Google이 2019년에 발표한 T5는 모든 NLP 태스크를 텍스트-투-텍스트(text-to-text) 문제로 통일합니다. 분류든, 번역이든, 요약이든 입력과 출력 모두 텍스트입니다.

T5 아키텍처 상세

항목	T5-Small	T5-Base	T5-Large	T5-3B	T5-11B
인코더 레이어	6	12	24	24	24
디코더 레이어	6	12	24	24	24
Hidden 차원	512	768	1,024	1,024	1,024
Attention Head	8	12	16	32	128
FFN 차원	2,048	3,072	4,096	16,384	65,536
파라미터	60M	220M	770M	3B	11B

T5 사전학습: Span Corruption

T5는 MLM의 변형인 Span Corruption으로 사전학습합니다. 연속된 토큰 범위(span)를 하나의 센티넬 토큰으로 대체하고, 디코더가 원래 내용을 생성합니다.

입력:  "Thank you <X> me to your party <Y> week"
타겟:  "<X> for inviting <Y> last <Z>"

여기서 <X>, <Y>, <Z>는 센티넬(sentinel) 토큰입니다.

비교	BERT MLM	T5 Span Corruption
마스킹 단위	개별 토큰	연속 토큰 범위 (span)
출력	마스킹 위치의 토큰	센티넬 + 원본 span
효율성	15%만 예측	더 긴 span 예측
학습 방식	인코더만	인코더-디코더

T5 태스크 프리픽스 예시

# T5의 텍스트-투-텍스트 입력 형식

tasks = {
    "번역":     "translate English to German: That is good.",
    "요약":     "summarize: State authorities dispatched teams Tuesday...",
    "분류":     "sst2 sentence: This movie was absolutely wonderful.",
    "유사도":   "stsb sentence1: The cat sat on the mat. sentence2: A cat is on the mat.",
    "QA":       "question: What is gravity? context: Gravity is a force...",
}

# 모든 태스크가 동일한 모델, 동일한 인터페이스
for task, input_text in tasks.items():
    output = model.generate(tokenizer(input_text, return_tensors="pt").input_ids)
    print(f"{task}: {tokenizer.decode(output[0], skip_special_tokens=True)}")

BART — Bidirectional and Auto-Regressive Transformers

핵심 아이디어: 디노이징 오토인코더

Facebook AI(현 Meta)가 2019년에 발표한 BART는 텍스트를 손상(corrupt)시키고 복원하는 디노이징(denoising) 오토인코더입니다.

BART의 노이즈 함수들

BART는 다양한 방식으로 입력을 손상시킵니다.

노이즈 유형	설명	예시
Token Masking	BERT처럼 토큰을 [MASK]로 대체	`A B C` → `A [M] C`
Token Deletion	토큰을 삭제 (위치 정보 손실)	`A B C` → `A C`
Text Infilling	연속 범위를 하나의 [MASK]로 대체	`A B C D` → `A [M] D`
Sentence Permutation	문장 순서를 무작위 섞기	`S1. S2. S3.` → `S3. S1. S2.`
Document Rotation	문서의 시작점을 무작위 변경	`A B C D` → `C D A B`

Text Infilling이 가장 효과적인 것으로 실험에서 확인되었습니다. 이는 T5의 Span Corruption과 유사하지만, 마스크의 수가 span 수와 다를 수 있어 모델이 누락된 토큰 수도 추정해야 합니다.

BART vs T5 비교

항목	BART	T5
사전학습	디노이징 오토인코더 (다양한 노이즈)	Span Corruption
태스크 형식	태스크별 Fine-tuning	텍스트-투-텍스트 통일
강점	요약, 생성	범용 태스크
위치 인코딩	학습된 절대 위치	상대 위치 바이어스
디코더 어텐션	전체 인코더 출력 참조	전체 인코더 출력 참조
노이즈 전략	5가지 다양한 노이즈	Span Corruption만

mBART — Multilingual BART

mBART는 BART를 **다국어(multilingual)**로 확장한 모델입니다. 25개 이상의 언어에 대해 동시에 디노이징 사전학습을 수행합니다.

항목	mBART	mBART-50
발표	2020	2020
지원 언어	25개	50개
학습 데이터	CC25 (다국어 Common Crawl)	CC50
핵심 용도	다국어 번역, 요약	더 많은 언어 지원
한국어	포함	포함

mBART는 입력에 언어 태그를 추가하여 언어를 지정합니다.

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt"
)
tokenizer = MBart50TokenizerFast.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt"
)

# 한국어 → 영어 번역
tokenizer.src_lang = "ko_KR"
text = "인공지능은 인간의 학습, 추론, 지각 능력을 컴퓨터로 구현하는 기술입니다."
inputs = tokenizer(text, return_tensors="pt")

generated = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"],
    max_new_tokens=128,
)

translation = tokenizer.decode(generated[0], skip_special_tokens=True)
print(f"원문: {text}")
print(f"번역: {translation}")

인코더-디코더 vs 디코더 전용: 언제 무엇을 사용할까?

태스크	추천 구조	이유
기계 번역	인코더-디코더	입력(소스어)과 출력(타겟어)이 다른 형태
텍스트 요약	인코더-디코더	입력(긴 문서)을 이해하고 출력(요약)을 생성
추출형 QA	인코더-디코더 또는 인코더	지문을 이해하고 답을 추출/생성
자유 대화	디코더 전용	열린 형태의 긴 텍스트 생성
텍스트 분류	인코더 전용	이해만 필요, 생성 불필요
코드 생성	디코더 전용	긴 코드를 자기회귀적으로 생성

구현 예제

T5로 텍스트 요약

from transformers import T5ForConditionalGeneration, T5Tokenizer

# T5-Base 모델 로드
model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# 요약할 텍스트 (프리픽스 "summarize:" 추가)
article = """
summarize: The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building,
and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side.
During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest
man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 metres.
"""

inputs = tokenizer(article, return_tensors="pt", max_length=512, truncation=True)

# 요약 생성
summary_ids = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    num_beams=4,           # 빔 서치
    length_penalty=2.0,    # 길이 패널티
    early_stopping=True,
)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"요약: {summary}")

BART로 텍스트 생성

from transformers import BartForConditionalGeneration, BartTokenizer

# BART-Large CNN 모델 (요약 Fine-tuned)
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

# 뉴스 기사 요약
article = """
New York (CNN) -- More than 80 million Americans are under a winter storm warning as a major
storm system moves across the eastern United States. The storm is expected to bring heavy snow,
ice, and freezing rain to a wide area from the Midwest to the Northeast. Schools, government
offices, and businesses have closed in several states. Airlines have cancelled thousands of
flights. The National Weather Service warned of dangerous travel conditions and potential power
outages.
"""

inputs = tokenizer(article, return_tensors="pt", max_length=1024, truncation=True)

summary_ids = model.generate(
    inputs.input_ids,
    max_new_tokens=150,
    num_beams=4,
    length_penalty=2.0,
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

T5 텍스트-투-텍스트 다중 태스크

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")

def text_to_text(input_text):
    """T5의 텍스트-투-텍스트 인터페이스로 다양한 태스크를 수행합니다."""
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs.input_ids, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 다양한 태스크를 동일한 인터페이스로 수행
tasks = [
    "translate English to German: How are you?",
    "summarize: The quick brown fox jumps over the lazy dog. This is a famous pangram.",
    "sst2 sentence: This movie was really enjoyable and fun to watch.",
    "cola sentence: The cat sat on the mat.",
]

for task in tasks:
    result = text_to_text(task)
    print(f"입력: {task[:60]}...")
    print(f"출력: {result}\n")

모델	발표	사전학습	주요 태스크	파라미터 (Base)
T5	2019, Google	Span Corruption	범용 (텍스트-투-텍스트)	220M
BART	2019, Meta	디노이징 오토인코더	요약, 생성	140M
mBART	2020, Meta	다국어 디노이징	다국어 번역, 요약	610M
Flan-T5	2022, Google	Span Corruption + Instruction Tuning	Instruction Following	250M~11B
mT5	2021, Google	다국어 Span Corruption	다국어 범용	300M~13B

참고 논문

논문	저자	연도	핵심 기여
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)	Raffel et al.	2019	텍스트-투-텍스트 프레임워크, 대규모 실험
BART: Denoising Sequence-to-Sequence Pre-training	Lewis et al.	2019	디노이징 오토인코더 사전학습
Multilingual Denoising Pre-training for Neural Machine Translation (mBART)	Liu et al.	2020	다국어 BART
Scaling Instruction-Finetuned Language Models (Flan-T5)	Chung et al.	2022	Instruction Tuning + T5

스케일링 법칙

모델 크기, 데이터, 컴퓨팅의 관계를 이해합니다

NLP 핵심 태스크

PLM을 활용한 다양한 NLP 태스크 실습으로 넘어갑니다

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

핵심 아이디어

T5 — Text-to-Text Transfer Transformer

핵심 아이디어: 모든 것을 텍스트로

T5 아키텍처 상세

T5 사전학습: Span Corruption

T5 태스크 프리픽스 예시

BART — Bidirectional and Auto-Regressive Transformers

핵심 아이디어: 디노이징 오토인코더

BART의 노이즈 함수들

BART vs T5 비교

mBART — Multilingual BART

인코더-디코더 vs 디코더 전용: 언제 무엇을 사용할까?

구현 예제

T5로 텍스트 요약

BART로 텍스트 생성

T5 텍스트-투-텍스트 다중 태스크

관련 기술 비교

참고 논문

스케일링 법칙

NLP 핵심 태스크

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

​핵심 아이디어

​T5 — Text-to-Text Transfer Transformer

​핵심 아이디어: 모든 것을 텍스트로

​T5 아키텍처 상세

​T5 사전학습: Span Corruption

​T5 태스크 프리픽스 예시

​BART — Bidirectional and Auto-Regressive Transformers

​핵심 아이디어: 디노이징 오토인코더

​BART의 노이즈 함수들

​BART vs T5 비교

​mBART — Multilingual BART

​인코더-디코더 vs 디코더 전용: 언제 무엇을 사용할까?

​구현 예제

​T5로 텍스트 요약

​BART로 텍스트 생성

​T5 텍스트-투-텍스트 다중 태스크

​관련 기술 비교

​참고 논문

스케일링 법칙

NLP 핵심 태스크

핵심 아이디어

T5 — Text-to-Text Transfer Transformer

핵심 아이디어: 모든 것을 텍스트로

T5 아키텍처 상세

T5 사전학습: Span Corruption

T5 태스크 프리픽스 예시

BART — Bidirectional and Auto-Regressive Transformers

핵심 아이디어: 디노이징 오토인코더

BART의 노이즈 함수들

BART vs T5 비교

mBART — Multilingual BART

인코더-디코더 vs 디코더 전용: 언제 무엇을 사용할까?

구현 예제

T5로 텍스트 요약

BART로 텍스트 생성

T5 텍스트-투-텍스트 다중 태스크

관련 기술 비교

참고 논문