Attention 메커니즘

학습 목표

Seq2Seq 모델의 정보 병목(Information Bottleneck) 문제를 설명할 수 있다
Attention 메커니즘의 핵심 아이디어인 “선택적 집중”을 이해한다
Bahdanau Attention(가산 방식)과 Luong Attention(곱셈 방식)의 차이를 구분할 수 있다
정렬 점수(Alignment Score)와 어텐션 가중치(Attention Weight)의 계산 과정을 설명할 수 있다
Attention 가중치 시각화의 의미를 해석할 수 있다

왜 중요한가

Attention은 현대 NLP의 핵심 메커니즘입니다. Transformer, BERT, GPT 등 거의 모든 최신 모델이 Attention을 기반으로 동작합니다. Attention의 원리를 이해하지 않으면 이후의 모든 아키텍처를 표면적으로만 이해하게 됩니다. Attention은 원래 기계번역(Machine Translation)에서 Seq2Seq 모델의 한계를 극복하기 위해 제안되었습니다. 긴 문장을 하나의 고정 길이 벡터로 압축하는 방식의 근본적인 한계를 해결하면서, 이후 자연어 처리의 패러다임을 완전히 바꿔놓았습니다.

핵심 개념

Seq2Seq의 정보 병목 문제

Seq2Seq(Sequence-to-Sequence) 모델은 인코더(Encoder)가 입력 시퀀스 전체를 하나의 고정 길이 컨텍스트 벡터(Context Vector)

\mathbf{c}

로 압축하고, 디코더(Decoder)가 이 벡터만을 참고하여 출력을 생성합니다. 이 구조의 문제점은 명확합니다.

정보 병목: 입력이 아무리 길어도 단 하나의 벡터 $\mathbf{c}$ 에 모든 정보를 담아야 합니다
장거리 의존성 소실: 입력 시퀀스가 길어질수록 초반 토큰의 정보가 점점 희미해집니다
균일한 참조: 디코더의 매 시점에서 동일한 컨텍스트 벡터를 참조하므로, 현재 생성해야 하는 토큰에 가장 관련 있는 입력 부분에 집중할 수 없습니다

예를 들어 “The agreement on the European Economic Area was signed in August 1992”를 번역할 때, “1992”를 생성하는 시점에서 입력의 마지막 부분에 집중해야 하지만, 고정 벡터

\mathbf{c}

는 이런 선택적 집중을 할 수 없습니다.

Attention의 핵심 아이디어

Attention은 “디코더가 출력을 생성할 때, 인코더의 모든 은닉 상태를 다시 참조하되 현재 시점에 관련 있는 부분에 더 많은 가중치를 부여하자”는 아이디어입니다. 고정 벡터

\mathbf{c}

대신, 디코더의 각 시점

t

마다 동적으로 컨텍스트 벡터

\mathbf{c}_t

를 계산합니다.

\mathbf{c}_t = \sum_{i=1}^{T_x} \alpha_{t,i} \cdot \mathbf{h}_i

여기서:

$\mathbf{h}_i$ : 인코더의 $i$ 번째 시점 은닉 상태
$\alpha_{t,i}$ : 디코더 시점 $t$ 에서 인코더 시점 $i$ 에 부여하는 어텐션 가중치(Attention Weight)
$T_x$ : 입력 시퀀스 길이

어텐션 가중치

\alpha_{t,i}

는 정렬 점수(Alignment Score)

e_{t,i}

에 Softmax를 적용하여 얻습니다.

\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T_x} \exp(e_{t,j})}

정렬 점수

e_{t,i}

를 계산하는 방식에 따라 Bahdanau 방식과 Luong 방식으로 나뉩니다.

Bahdanau Attention (가산 방식)

Bahdanau et al. (2015)이 제안한 가산(Additive) Attention은 정렬 점수를 학습 가능한 신경망으로 계산합니다.

e_{t,i} = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{s}_{t-1} + \mathbf{W}_2 \mathbf{h}_i)

여기서:

$\mathbf{s}_{t-1}$ : 디코더의 이전 시점 은닉 상태
$\mathbf{h}_i$ : 인코더의 $i$ 번째 은닉 상태
$\mathbf{W}_1, \mathbf{W}_2$ : 학습 가능한 가중치 행렬
$\mathbf{v}$ : 학습 가능한 가중치 벡터

핵심 특징은 디코더의 이전 시점 상태

\mathbf{s}_{t-1}

을 사용한다는 것입니다. 현재 시점의 출력을 생성하기 전에 어텐션을 계산합니다.

import torch
import torch.nn as nn
import torch.nn.functional as F

class BahdanauAttention(nn.Module):
    """Bahdanau (가산) Attention 메커니즘"""
    def __init__(self, hidden_size):
        super().__init__()
        self.W1 = nn.Linear(hidden_size, hidden_size, bias=False)  # 디코더 상태용
        self.W2 = nn.Linear(hidden_size, hidden_size, bias=False)  # 인코더 상태용
        self.v = nn.Linear(hidden_size, 1, bias=False)             # 스칼라 점수 변환

    def forward(self, decoder_hidden, encoder_outputs):
        """
        decoder_hidden: (batch, hidden_size) - 디코더의 이전 시점 상태
        encoder_outputs: (batch, src_len, hidden_size) - 인코더 전체 은닉 상태
        """
        # decoder_hidden을 시퀀스 차원으로 확장
        decoder_hidden = decoder_hidden.unsqueeze(1)  # (batch, 1, hidden)

        # 정렬 점수 계산: v^T * tanh(W1 * s + W2 * h)
        scores = self.v(
            torch.tanh(self.W1(decoder_hidden) + self.W2(encoder_outputs))
        )  # (batch, src_len, 1)

        # 어텐션 가중치
        attention_weights = F.softmax(scores.squeeze(-1), dim=-1)  # (batch, src_len)

        # 컨텍스트 벡터: 가중 합
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        )  # (batch, 1, hidden)

        return context.squeeze(1), attention_weights

Luong Attention (곱셈 방식)

Luong et al. (2015)이 제안한 곱셈(Multiplicative) Attention은 정렬 점수를 내적(Dot Product) 기반으로 계산하여 더 효율적입니다. 세 가지 변형이 있습니다.

방식	정렬 점수 $e_{t,i}$	특징
dot	$\mathbf{s}_t^T \mathbf{h}_i$	가장 단순, 추가 파라미터 없음
general	$\mathbf{s}_t^T \mathbf{W} \mathbf{h}_i$	학습 가능한 변환 행렬 하나
concat	$\mathbf{v}^T \tanh(\mathbf{W}[\mathbf{s}_t; \mathbf{h}_i])$	Bahdanau와 유사

Luong Attention의 핵심 차이는 디코더의 현재 시점 상태

\mathbf{s}_t

를 사용한다는 것입니다.

class LuongAttention(nn.Module):
    """Luong (곱셈) Attention 메커니즘"""
    def __init__(self, hidden_size, method="dot"):
        super().__init__()
        self.method = method
        if method == "general":
            self.W = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        decoder_hidden: (batch, hidden_size) - 디코더의 현재 시점 상태
        encoder_outputs: (batch, src_len, hidden_size) - 인코더 전체 은닉 상태
        """
        if self.method == "dot":
            # s_t^T * h_i
            scores = torch.bmm(
                encoder_outputs,
                decoder_hidden.unsqueeze(-1)
            ).squeeze(-1)  # (batch, src_len)
        elif self.method == "general":
            # s_t^T * W * h_i
            scores = torch.bmm(
                self.W(encoder_outputs),
                decoder_hidden.unsqueeze(-1)
            ).squeeze(-1)  # (batch, src_len)

        attention_weights = F.softmax(scores, dim=-1)  # (batch, src_len)
        context = torch.bmm(
            attention_weights.unsqueeze(1), encoder_outputs
        ).squeeze(1)  # (batch, hidden)

        return context, attention_weights

Bahdanau vs Luong 비교

특성	Bahdanau (2015)	Luong (2015)
정렬 함수	가산 (Additive)	곱셈 (Multiplicative)
디코더 상태	이전 시점 $\mathbf{s}_{t-1}$	현재 시점 $\mathbf{s}_t$
인코더	양방향(Bidirectional) RNN	단방향/양방향 모두 가능
계산 비용	높음 (신경망 연산)	낮음 (내적 연산)
파라미터 수	$\mathbf{W}_1, \mathbf{W}_2, \mathbf{v}$	dot: 없음 / general: $\mathbf{W}$
표현력	비선형 함수로 복잡한 관계 포착	선형적, 빠르지만 제한적

Attention 가중치의 시각화

Attention 메커니즘의 중요한 부산물은 가중치 행렬을 시각화할 수 있다는 점입니다. 기계번역에서 어텐션 가중치를 히트맵으로 그리면, 소스 언어와 타겟 언어 사이의 단어 정렬(Word Alignment)을 확인할 수 있습니다.

import matplotlib.pyplot as plt
import numpy as np

def plot_attention(source_tokens, target_tokens, attention_matrix):
    """어텐션 가중치를 히트맵으로 시각화합니다."""
    fig, ax = plt.subplots(figsize=(8, 6))
    im = ax.imshow(attention_matrix, cmap="Blues", aspect="auto")

    ax.set_xticks(range(len(source_tokens)))
    ax.set_xticklabels(source_tokens, rotation=45, ha="right")
    ax.set_yticks(range(len(target_tokens)))
    ax.set_yticklabels(target_tokens)

    ax.set_xlabel("소스 (입력)")
    ax.set_ylabel("타겟 (출력)")
    ax.set_title("Attention Weights")
    plt.colorbar(im)
    plt.tight_layout()
    plt.show()

# 예시: 영어 → 한국어 번역의 어텐션 가중치
source = ["The", "cat", "sat", "on", "the", "mat"]
target = ["고양이가", "매트", "위에", "앉았다"]
# attention_weights: (len(target), len(source)) 형태의 numpy 배열

이 시각화를 통해 “고양이가”를 생성할 때 “cat”에 높은 가중치가 부여되는 것을 확인할 수 있으며, 이는 모델이 올바른 정렬을 학습했음을 의미합니다.

AI/ML에서의 활용

Attention 메커니즘은 기계번역을 넘어 다양한 AI/머신러닝 분야에서 핵심 역할을 합니다.

자연어 처리: 텍스트 분류, 감성 분석, 질의응답, 문서 요약
컴퓨터 비전: Image Captioning에서 이미지의 특정 영역에 집중
음성 인식: 음성 프레임과 텍스트 토큰 간의 정렬
추천 시스템: 사용자 행동 시퀀스에서 중요한 상호작용에 집중
Transformer의 탄생: Attention 메커니즘의 성공이 “Attention만으로도 충분하다”는 Transformer 아키텍처의 발상으로 이어졌습니다

Attention은 RNN에서만 사용되나요?

아닙니다. Attention은 RNN 기반 Seq2Seq에서 처음 제안되었지만, RNN 없이도 사용할 수 있습니다. Transformer는 RNN을 완전히 제거하고 Self-Attention만으로 시퀀스를 처리합니다. 즉, Attention은 특정 아키텍처에 종속된 기법이 아니라 범용적인 메커니즘입니다.

Soft Attention과 Hard Attention의 차이는 무엇인가요?

Soft Attention은 모든 입력 위치에 연속적인 가중치(0~1)를 부여합니다. 미분 가능하므로 역전파로 학습할 수 있습니다. Hard Attention은 특정 위치만 선택(0 또는 1)합니다. 미분 불가능하므로 강화학습(REINFORCE) 등으로 학습해야 합니다. 본문에서 설명한 Bahdanau, Luong Attention은 모두 Soft Attention입니다.

Attention의 시간 복잡도는 어떻게 되나요?

Attention의 시간 복잡도는

O(T_x \cdot T_y)

입니다. 입력 길이

T_x

와 출력 길이

T_y

의 모든 쌍에 대해 정렬 점수를 계산해야 하기 때문입니다. Bahdanau는 추가로 신경망 연산이 있어 상수 계수가 더 크고, Luong의 dot 방식은 단순 내적이므로 상대적으로 빠릅니다.

어텐션 가중치가 항상 해석 가능한가요?

어텐션 가중치는 모델의 “집중 영역”에 대한 근사적 해석을 제공하지만, 반드시 인과적 설명을 의미하지는 않습니다. Jain & Wallace (2019)의 연구에 따르면, 어텐션 가중치가 항상 예측에 대한 충실한 설명(Faithful Explanation)을 제공하는 것은 아닙니다. 해석에 주의가 필요합니다.

체크리스트

Seq2Seq의 고정 컨텍스트 벡터 문제를 설명할 수 있다
Attention이 “선택적 집중”을 수행하는 원리를 이해했다
정렬 점수 → Softmax → 어텐션 가중치 → 컨텍스트 벡터 계산 과정을 따라갈 수 있다
Bahdanau Attention의 수식과 특징을 설명할 수 있다
Luong Attention의 세 가지 변형(dot, general, concat)을 구분할 수 있다
두 방식의 차이(이전 시점 vs 현재 시점 상태 사용)를 명확히 이해했다
어텐션 가중치 히트맵의 의미를 해석할 수 있다

다음 문서

Self-Attention

Attention을 동일한 시퀀스 내에 적용하는 Self-Attention과 Multi-Head Attention

Transformer 아키텍처

Attention만으로 구축된 Transformer의 전체 구조 분석

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

학습 목표

왜 중요한가

핵심 개념

Seq2Seq의 정보 병목 문제

Attention의 핵심 아이디어

Bahdanau Attention (가산 방식)

Luong Attention (곱셈 방식)

Bahdanau vs Luong 비교

Attention 가중치의 시각화

AI/ML에서의 활용

체크리스트

다음 문서

Self-Attention

Transformer 아키텍처

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

​학습 목표

​왜 중요한가

​핵심 개념

​Seq2Seq의 정보 병목 문제

​Attention의 핵심 아이디어

​Bahdanau Attention (가산 방식)

​Luong Attention (곱셈 방식)

​Bahdanau vs Luong 비교

​Attention 가중치의 시각화

​AI/ML에서의 활용

​체크리스트

​다음 문서

Self-Attention

Transformer 아키텍처

학습 목표

왜 중요한가

핵심 개념

Seq2Seq의 정보 병목 문제

Attention의 핵심 아이디어

Bahdanau Attention (가산 방식)

Luong Attention (곱셈 방식)

Bahdanau vs Luong 비교

Attention 가중치의 시각화

AI/ML에서의 활용

체크리스트

다음 문서