LLM API 활용

LLM 애플리케이션 개발의 첫 단계는 API를 올바르게 설정하고 호출하는 것입니다. 이 문서에서는 OpenAI, Anthropic Claude, HuggingFace Inference API의 기본 사용법부터 스트리밍 응답 처리, 에러 핸들링까지 단계별로 실습합니다.

사전 준비

시작하기 전에 필요한 패키지를 설치합니다.

pip install openai anthropic huggingface_hub python-dotenv

API 키를 .env 파일에 저장하고 로드합니다.

# .env 파일
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
HF_TOKEN=hf_...

# API 키 로드
from dotenv import load_dotenv
import os

load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
hf_token = os.getenv("HF_TOKEN")

API 키는 절대 코드에 직접 하드코딩하지 마세요. 환경 변수나 시크릿 관리 도구를 사용합니다.

실습

OpenAI Chat Completions API

OpenAI의 Chat Completions API는 대화형 LLM 상호작용의 표준 인터페이스입니다.

from openai import OpenAI

# 클라이언트 초기화
client = OpenAI()  # OPENAI_API_KEY 환경 변수에서 자동 로드

# 기본 호출
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "당신은 친절한 한국어 AI 어시스턴트입니다."
        },
        {
            "role": "user",
            "content": "자연어 처리가 무엇인지 한 문장으로 설명해 주세요."
        }
    ],
    temperature=0.7,  # 생성 다양성 (0.0 ~ 2.0)
    max_tokens=200,   # 최대 출력 토큰 수
)

# 응답 추출
answer = response.choices[0].message.content
print(answer)

# 토큰 사용량 확인
usage = response.usage
print(f"입력 토큰: {usage.prompt_tokens}")
print(f"출력 토큰: {usage.completion_tokens}")
print(f"총 토큰: {usage.total_tokens}")

주요 파라미터 정리:

파라미터	설명	기본값	권장 범위
`model`	사용할 모델 ID	-	`gpt-4o-mini`, `gpt-4o`
`temperature`	생성 다양성 제어	1.0	0.0 (결정적) ~ 1.5 (창의적)
`max_tokens`	최대 출력 토큰 수	모델별 상이	태스크에 맞게 설정
`top_p`	누적 확률 기반 샘플링	1.0	0.1 ~ 1.0
`frequency_penalty`	반복 토큰 패널티	0.0	0.0 ~ 1.0
`presence_penalty`	새 토픽 유도	0.0	0.0 ~ 1.0

temperature와 top_p는 동시에 조절하지 않는 것이 좋습니다. 둘 중 하나를 고정하고 나머지를 조절합니다.

멀티턴 대화 관리

Chat Completions API는 stateless입니다. 대화 히스토리를 직접 관리해야 합니다.

from openai import OpenAI

client = OpenAI()

# 대화 히스토리 관리
conversation = [
    {"role": "system", "content": "당신은 NLP 전문가입니다. 간결하게 답변합니다."}
]

def chat(user_message: str) -> str:
    """사용자 메시지를 추가하고 응답을 받아 히스토리에 저장합니다."""
    conversation.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation,
        temperature=0.7,
    )

    assistant_message = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# 대화 진행
print(chat("Transformer의 핵심 메커니즘은 무엇인가요?"))
print(chat("그것이 RNN보다 나은 점은 무엇인가요?"))  # 이전 문맥을 기억합니다
print(chat("구체적인 예시를 들어주세요."))

대화가 길어지면 토큰 수가 누적됩니다. 컨텍스트 윈도우 제한(예: GPT-4o-mini는 128K 토큰)에 유의하고, 필요하면 오래된 메시지를 정리하는 전략이 필요합니다.

Anthropic Claude API

Anthropic의 Claude API는 메시지 구조와 시스템 프롬프트 처리 방식이 OpenAI와 다릅니다.

import anthropic

# 클라이언트 초기화
client = anthropic.Anthropic()  # ANTHROPIC_API_KEY 환경 변수에서 자동 로드

# 기본 호출
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="당신은 친절한 한국어 AI 어시스턴트입니다.",  # system은 별도 파라미터
    messages=[
        {
            "role": "user",
            "content": "BERT와 GPT의 핵심 차이를 표로 정리해 주세요."
        }
    ],
)

# 응답 추출
answer = response.content[0].text
print(answer)

# 토큰 사용량 확인
print(f"입력 토큰: {response.usage.input_tokens}")
print(f"출력 토큰: {response.usage.output_tokens}")

OpenAI vs Anthropic API 비교:

항목	OpenAI	Anthropic
시스템 프롬프트	`messages` 배열 내 `role: system`	별도 `system` 파라미터
응답 구조	`response.choices[0].message.content`	`response.content[0].text`
토큰 카운트	`prompt_tokens` / `completion_tokens`	`input_tokens` / `output_tokens`
모델 명명	`gpt-4o`, `gpt-4o-mini`	`claude-sonnet-4-20250514` 등
최대 출력	`max_tokens` (선택)	`max_tokens` (필수)

HuggingFace Inference API

HuggingFace Inference API를 사용하면 오픈소스 모델을 서버 없이 호출할 수 있습니다.

from huggingface_hub import InferenceClient

# 클라이언트 초기화
client = InferenceClient(token=os.getenv("HF_TOKEN"))

# 텍스트 생성 (오픈소스 모델 활용)
response = client.text_generation(
    prompt="자연어 처리(NLP)란",
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

Chat 형태로도 호출할 수 있습니다.

# Chat Completion 스타일 호출
response = client.chat_completion(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "당신은 유용한 AI 어시스턴트입니다."},
        {"role": "user", "content": "트랜스포머의 핵심 구성요소를 설명해 주세요."},
    ],
    max_tokens=500,
)
print(response.choices[0].message.content)

HuggingFace의 무료 Inference API는 요청 속도 제한이 있습니다. 프로덕션 환경에서는 Inference Endpoints로 전용 인스턴스를 배포하는 것이 안정적입니다.

스트리밍 응답 처리

긴 응답을 생성할 때 스트리밍을 사용하면 사용자 경험이 크게 향상됩니다. 토큰이 생성되는 즉시 화면에 표시할 수 있습니다.OpenAI 스트리밍:

from openai import OpenAI

client = OpenAI()

# 스트리밍 호출 (stream=True)
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Transformer 아키텍처를 상세하게 설명해 주세요."}
    ],
    stream=True,  # 스트리밍 활성화
)

# 토큰 단위로 출력
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        token = chunk.choices[0].delta.content
        print(token, end="", flush=True)
        full_response += token

print()  # 줄바꿈
print(f"\n전체 응답 길이: {len(full_response)}자")

Anthropic 스트리밍:

import anthropic

client = anthropic.Anthropic()

# 스트리밍 호출
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Self-Attention의 동작 과정을 단계별로 설명해 주세요."}
    ],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

print()

# 최종 메시지에서 토큰 사용량 확인
final_message = stream.get_final_message()
print(f"입력 토큰: {final_message.usage.input_tokens}")
print(f"출력 토큰: {final_message.usage.output_tokens}")

비동기(Async) 스트리밍:

import asyncio
from openai import AsyncOpenAI

async def stream_response():
    """비동기 스트리밍으로 응답을 받습니다."""
    client = AsyncOpenAI()

    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "NLP의 주요 태스크를 나열해 주세요."}
        ],
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

# 실행
asyncio.run(stream_response())

에러 핸들링과 재시도 전략

프로덕션 환경에서는 API 에러를 체계적으로 처리해야 합니다.

import time
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(timeout=30.0)  # 타임아웃 30초 설정

def call_with_retry(
    messages: list,
    model: str = "gpt-4o-mini",
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> str:
    """지수 백오프(Exponential Backoff)로 API 호출을 재시도합니다."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0.7,
            )
            return response.choices[0].message.content

        except RateLimitError as e:
            # 429: 요청 속도 제한 초과
            delay = base_delay * (2 ** attempt)
            print(f"[Rate Limit] {delay}초 후 재시도 ({attempt + 1}/{max_retries})")
            time.sleep(delay)

        except APITimeoutError:
            # 408: 요청 시간 초과
            delay = base_delay * (2 ** attempt)
            print(f"[Timeout] {delay}초 후 재시도 ({attempt + 1}/{max_retries})")
            time.sleep(delay)

        except APIConnectionError:
            # 네트워크 연결 오류
            delay = base_delay * (2 ** attempt)
            print(f"[Connection Error] {delay}초 후 재시도 ({attempt + 1}/{max_retries})")
            time.sleep(delay)

        except Exception as e:
            # 기타 에러 (400 Bad Request, 401 Unauthorized 등)
            print(f"[Error] 재시도 불가: {e}")
            raise

    raise Exception(f"최대 재시도 횟수({max_retries})를 초과했습니다.")

# 사용 예시
result = call_with_retry(
    messages=[{"role": "user", "content": "NLP란 무엇인가요?"}]
)
print(result)

주요 HTTP 에러 코드:

코드	원인	대응 방법
`400`	잘못된 요청 (메시지 형식 오류 등)	요청 파라미터 확인
`401`	인증 실패 (API 키 오류)	API 키 확인
`403`	권한 없음 (모델 접근 불가)	계정 플랜 확인
`429`	요청 속도 제한 초과	지수 백오프로 재시도
`500`	서버 내부 오류	재시도 후 지속 시 지원팀 문의
`503`	서비스 과부하	지수 백오프로 재시도

tenacity 라이브러리를 사용하면 재시도 로직을 더 간결하게 구현할 수 있습니다.

from tenacity import retry, stop_after_attempt, wait_exponential
from openai import RateLimitError

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, max=10),
    retry=retry_if_exception_type(RateLimitError),
)
def call_api(messages):
    return client.chat.completions.create(
        model="gpt-4o-mini", messages=messages
    )

트러블슈팅

AuthenticationError: Incorrect API key

API 키가 잘못 설정되었습니다. .env 파일의 키 형식을 확인합니다.

OpenAI: sk- 접두사로 시작
Anthropic: sk-ant- 접두사로 시작
HuggingFace: hf_ 접두사로 시작

환경 변수가 올바르게 로드되었는지 print(os.getenv("OPENAI_API_KEY")[:10])으로 확인합니다.

RateLimitError: You exceeded your current quota

API 사용량 한도를 초과했습니다. OpenAI 대시보드에서 사용량과 결제 정보를 확인합니다.

무료 계정: 분당 3 요청, 일일 200 요청 등 제한이 있습니다
gpt-4o-mini를 사용하면 비용과 한도 부담이 줄어듭니다
지수 백오프 재시도 전략을 적용합니다

APITimeoutError: Request timed out

요청이 설정된 시간 내에 완료되지 않았습니다.

max_tokens를 줄여 응답 길이를 제한합니다
타임아웃 값을 늘리세요: OpenAI(timeout=60.0)
네트워크 상태를 확인합니다

스트리밍 응답이 중간에 끊깁니다

네트워크 불안정이나 서버 부하가 원인일 수 있습니다.

try/except로 스트리밍 루프를 감싸 에러를 처리합니다
끊긴 지점까지의 응답을 저장하고 이어서 요청하는 패턴을 고려합니다
프록시나 방화벽이 장시간 연결을 차단하는지 확인합니다

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

사전 준비

실습

트러블슈팅

다음 단계

Function Calling

출력 파싱

00. 시작하기

01. 텍스트 전처리

02. Transformer 기초

03. 사전학습 모델과 LLM

04. NLP 핵심 태스크

05. 프롬프트 엔지니어링

06. LLM 실무 적용

07. 실무 프로젝트

​사전 준비

​실습

​트러블슈팅

​다음 단계

Function Calling

출력 파싱

사전 준비

실습

트러블슈팅

다음 단계