Self-Attention

학습 목표

Self-Attention이 기존 Attention과 다른 점(“자기 자신에 대한 어텐션”)을 이해한다
Query, Key, Value의 개념과 역할을 설명할 수 있다
Scaled Dot-Product Attention의 수식을 단계별로 따라갈 수 있다
Multi-Head Attention이 필요한 이유와 동작 방식을 설명할 수 있다
차원(Dimension) 변화를 추적하며 전체 계산 과정을 이해한다

왜 중요한가

Self-Attention은 Transformer 아키텍처의 핵심 구성요소입니다. 기존 Attention이 인코더와 디코더 “사이”의 관계를 포착했다면, Self-Attention은 동일한 시퀀스 내에서 토큰 간의 관계를 포착합니다. 이 메커니즘 덕분에 Transformer는 RNN처럼 순차적으로 처리하지 않고도 시퀀스 내의 장거리 의존성을 효과적으로 모델링할 수 있습니다. “The animal didn’t cross the street because it was too tired”에서 “it”이 “animal”을 가리킨다는 것을 파악하려면, 문장 내 모든 단어 간의 관계를 동시에 고려해야 합니다. Self-Attention이 바로 이 역할을 합니다.

핵심 개념

”Self”의 의미

기존 Attention(Cross-Attention)에서는 디코더의 상태가 인코더의 상태를 참조합니다. 즉, 서로 다른 두 시퀀스 사이의 관계를 봅니다. Self-Attention에서는 하나의 시퀀스가 자기 자신을 참조합니다. 입력 시퀀스의 각 토큰이 같은 시퀀스의 모든 토큰과의 관계를 계산합니다.

Query, Key, Value

Self-Attention은 정보 검색(Information Retrieval)의 비유로 이해할 수 있습니다.

개념	역할	비유
Query (Q)	“무엇을 찾고 있는가”	검색창에 입력한 질문
Key (K)	“무엇을 제공할 수 있는가”	문서의 제목/태그
Value (V)	“실제로 전달할 정보”	문서의 본문 내용

입력 시퀀스

\mathbf{X} \in \mathbb{R}^{n \times d_{\text{model}}}

(

n

: 시퀀스 길이,

d_{\text{model}}

: 모델 차원)에서 Q, K, V를 생성합니다.

\mathbf{Q} = \mathbf{X} \mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X} \mathbf{W}^K, \quad \mathbf{V} = \mathbf{X} \mathbf{W}^V

여기서

\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d_{\text{model}} \times d_k}

\mathbf{W}^V \in \mathbb{R}^{d_{\text{model}} \times d_v}

는 학습 가능한 가중치 행렬입니다.

Self-Attention에서는 Q, K, V가 모두 같은 입력

\mathbf{X}

에서 파생됩니다. 이것이 “Self”의 핵심입니다. Cross-Attention에서는 Q는 디코더에서, K와 V는 인코더에서 옵니다.

Scaled Dot-Product Attention

Transformer에서 사용하는 어텐션 함수는 다음과 같습니다.

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

각 단계를 분해하면 다음과 같습니다. 단계 1: 유사도 계산 —

\mathbf{Q}\mathbf{K}^T

Query와 Key의 내적으로 유사도 행렬을 계산합니다. 결과는

n \times n

행렬로, 모든 토큰 쌍의 관련성 점수를 담고 있습니다.

\mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{n \times n}

단계 2: 스케일링 —

/ \sqrt{d_k}

내적 값은

d_k

가 커질수록 분산이 커져 Softmax의 기울기가 매우 작아집니다(Saturation).

\sqrt{d_k}

로 나누어 안정적인 기울기를 유지합니다.

d_k

가 64라면

\sqrt{d_k} = 8

로 나눕니다. 이 간단한 트릭이 학습 안정성에 큰 영향을 미칩니다. 논문에서는

d_k

가 크면 내적 값의 분산이

d_k

에 비례하여 커진다고 설명합니다.

단계 3: Softmax — 정규화 행(Row) 방향으로 Softmax를 적용하여 각 Query에 대한 어텐션 가중치 분포를 만듭니다. 각 행의 합은 1이 됩니다. 단계 4: 가중합 —

\times \mathbf{V}

어텐션 가중치로 Value의 가중합을 구하여 최종 출력을 생성합니다.

차원 추적

구체적인 수치로 차원 변화를 추적해 보겠습니다.

단계	텐서	차원	예시 ( $n{=}4$ , $d_k{=}64$ )
입력	$\mathbf{X}$	$(n, d_{\text{model}})$	$(4, 512)$
Q 생성	$\mathbf{X}\mathbf{W}^Q$	$(n, d_k)$	$(4, 64)$
K 생성	$\mathbf{X}\mathbf{W}^K$	$(n, d_k)$	$(4, 64)$
V 생성	$\mathbf{X}\mathbf{W}^V$	$(n, d_v)$	$(4, 64)$
유사도	$\mathbf{Q}\mathbf{K}^T$	$(n, n)$	$(4, 4)$
스케일링	$/\sqrt{d_k}$	$(n, n)$	$(4, 4)$
가중치	$\text{softmax}(\cdot)$	$(n, n)$	$(4, 4)$
출력	$\text{attn} \times \mathbf{V}$	$(n, d_v)$	$(4, 64)$

수치 예제

간단한 예로 직접 계산해 보겠습니다.

d_k = 3

이고 시퀀스 길이 2인 경우:

import torch
import torch.nn.functional as F

# 간단한 예: 시퀀스 길이 2, d_k = 3
Q = torch.tensor([[1.0, 0.0, 1.0],   # 토큰 1의 Query
                   [0.0, 1.0, 0.0]])   # 토큰 2의 Query

K = torch.tensor([[1.0, 1.0, 0.0],   # 토큰 1의 Key
                   [0.0, 0.0, 1.0]])   # 토큰 2의 Key

V = torch.tensor([[1.0, 2.0, 3.0],   # 토큰 1의 Value
                   [4.0, 5.0, 6.0]])   # 토큰 2의 Value

d_k = Q.size(-1)  # 3

# 단계 1: Q @ K^T (유사도)
scores = Q @ K.T
print(f"유사도 행렬:\n{scores}")
# tensor([[1., 1.],
#         [1., 0.]])

# 단계 2: 스케일링
scores_scaled = scores / (d_k ** 0.5)
print(f"스케일링 후:\n{scores_scaled}")

# 단계 3: Softmax
attention_weights = F.softmax(scores_scaled, dim=-1)
print(f"어텐션 가중치:\n{attention_weights}")
# 토큰 1: [0.5, 0.5] → 토큰 1과 2에 비슷하게 집중
# 토큰 2: [0.64, 0.36] → 토큰 1에 더 집중

# 단계 4: 가중합
output = attention_weights @ V
print(f"최종 출력:\n{output}")
# 토큰 1의 출력: 0.5 * [1,2,3] + 0.5 * [4,5,6] = [2.5, 3.5, 4.5]

Multi-Head Attention

단일 Attention Head는 한 가지 관점으로만 관계를 포착합니다. Multi-Head Attention은 여러 개의 Attention Head를 병렬로 실행하여 다양한 관점에서 토큰 간 관계를 동시에 포착합니다.

\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\mathbf{W}^O

\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)

여기서:

$h$ : 헤드 수 (원 논문에서 $h = 8$ )
$d_k = d_v = d_{\text{model}} / h$ (512 / 8 = 64)
$\mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ : 출력 투영 행렬

각 헤드는 서로 다른

\mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V

를 가지므로, 서로 다른 부분 공간(Subspace)에서 어텐션을 수행합니다. 예를 들어:

Head 1: 구문적 관계 (주어-동사)
Head 2: 의미적 유사성 (동의어, 관련어)
Head 3: 근접 위치 관계 (인접 토큰)
Head 4: 장거리 의존성 (대명사-선행사)

Multi-Head Attention의 총 계산량은 단일 헤드의 Full-Dimension Attention과 유사합니다.

h

개의 헤드를 사용하되 각 헤드의 차원을

d_{\text{model}}/h

로 줄이기 때문입니다.

PyTorch 구현

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    """Multi-Head Attention 메커니즘"""
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model은 num_heads로 나누어떨어져야 합니다"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Q, K, V 투영 행렬 (한 번에 모든 헤드 처리)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # 출력 투영

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """Scaled Dot-Product Attention"""
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # 선형 투영 후 헤드 분리: (batch, seq, d_model) → (batch, heads, seq, d_k)
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 어텐션 계산 (모든 헤드 병렬)
        output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # 헤드 병합: (batch, heads, seq, d_k) → (batch, seq, d_model)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # 최종 선형 변환
        return self.W_o(output), attention_weights

마스킹 (Masking)

Self-Attention에서는 두 가지 마스킹이 사용됩니다.

마스킹 유형	목적	적용 위치
패딩 마스크 (Padding Mask)	패딩 토큰이 어텐션에 참여하지 않도록 함	인코더, 디코더 모두
미래 마스크 (Look-Ahead Mask)	디코더에서 미래 토큰을 참조하지 못하도록 함	디코더의 Self-Attention만

미래 마스크는 하삼각 행렬(Lower Triangular Matrix)로 구현합니다.

def create_look_ahead_mask(size):
    """디코더용 미래 마스크 생성 (상삼각을 -inf로 채움)"""
    mask = torch.tril(torch.ones(size, size))  # 하삼각 행렬
    return mask  # 1: 참조 가능, 0: 참조 불가

# 예시: 시퀀스 길이 4
mask = create_look_ahead_mask(4)
# tensor([[1, 0, 0, 0],
#         [1, 1, 0, 0],
#         [1, 1, 1, 0],
#         [1, 1, 1, 1]])
# → 각 위치는 자신과 이전 위치만 참조 가능

AI/ML에서의 활용

Self-Attention은 Transformer 계열 모델의 근본이며, 다양한 형태로 변형되어 활용됩니다.

BERT: 양방향 Self-Attention으로 문맥을 양쪽에서 동시에 파악
GPT: 미래 마스크가 적용된 인과적(Causal) Self-Attention으로 텍스트 생성
Vision Transformer (ViT): 이미지 패치를 토큰으로 취급하여 Self-Attention 적용
Efficient Attention: Linear Attention, Sparse Attention 등 $O(n^2)$ 복잡도를 줄이는 연구

Self-Attention의 시간 복잡도는 왜 O(n²)인가요?

\mathbf{Q}\mathbf{K}^T

연산이

(n, d_k) \times (d_k, n)

으로

(n, n)

행렬을 생성하기 때문입니다. 시퀀스의 모든 토큰 쌍에 대해 유사도를 계산하므로

O(n^2 \cdot d_k)

의 복잡도를 갖습니다. 이 때문에 매우 긴 시퀀스(수만 토큰)에서는 메모리와 계산 비용이 급증하며, 이를 해결하기 위한 Efficient Attention 연구가 활발합니다.

Cross-Attention은 어디에 사용되나요?

Cross-Attention은 Transformer 디코더에서 사용됩니다. 디코더의 Self-Attention 이후, 디코더 상태를 Query로, 인코더 출력을 Key/Value로 사용하여 입력 시퀀스의 정보를 참조합니다. 기계번역, 텍스트 요약 등 인코더-디코더 구조에서 핵심 역할을 합니다.

Q, K, V에 같은 가중치 행렬을 쓰면 안 되나요?

같은 가중치를 사용하면 “찾는 것”(Q)과 “제공하는 것”(K)의 구분이 없어집니다. 서로 다른 가중치 행렬을 사용함으로써, 동일한 입력 토큰이라도 Query 역할과 Key 역할에서 서로 다른 표현을 가질 수 있습니다. 이 비대칭성이 풍부한 관계 포착의 핵심입니다.

체크리스트

Self-Attention에서 “Self”가 의미하는 바를 설명할 수 있다
Query, Key, Value의 역할과 생성 과정을 설명할 수 있다
Scaled Dot-Product Attention의 4단계 계산 과정을 따라갈 수 있다
$\sqrt{d_k}$ 로 나누는 이유(Softmax Saturation 방지)를 이해했다
Multi-Head Attention이 필요한 이유(다양한 관점)를 설명할 수 있다
차원 변화를 추적할 수 있다: $(n, d_{\text{model}}) \to (n, d_k) \to (n, n) \to (n, d_v)$
패딩 마스크와 미래 마스크의 차이를 구분할 수 있다

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

학습 목표

왜 중요한가

핵심 개념

”Self”의 의미

Query, Key, Value

Scaled Dot-Product Attention

차원 추적

수치 예제

Multi-Head Attention

PyTorch 구현

마스킹 (Masking)

AI/ML에서의 활용

체크리스트

다음 문서

Transformer 아키텍처

Attention 메커니즘

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

​학습 목표

​왜 중요한가

​핵심 개념

​”Self”의 의미

​Query, Key, Value

​Scaled Dot-Product Attention

​차원 추적

​수치 예제

​Multi-Head Attention

​PyTorch 구현

​마스킹 (Masking)

​AI/ML에서의 활용

​체크리스트

​다음 문서

Transformer 아키텍처

Attention 메커니즘

학습 목표

왜 중요한가

핵심 개념

”Self”의 의미

Query, Key, Value

Scaled Dot-Product Attention

차원 추적

수치 예제

Multi-Head Attention

PyTorch 구현

마스킹 (Masking)

AI/ML에서의 활용

체크리스트

다음 문서