LLM으로 프롬프트 실전 성능 평가하기

feat. Prometheus 2 & OpenAI API

Harvey

AI Engineer

2025년 6월 11일 · 약 31분

안녕하세요. 인포그랩 AI DevOps 엔지니어 Harvey입니다. 프롬프트 최적화는 오늘날 LLM 성능을 극대화하는 핵심 요소로 주목받고 있습니다. 동일한 질문도 프롬프트의 구조와 표현 방식에 따라 LLM의 응답 품질이 달라지는데요. 이는 LLM 응답의 신뢰도와 사용자 경험, 운영 비용 절감과 직결돼 중요합니다.

프롬프트를 최적화하려면 품질을 정량적으로 측정하고, 개선하는 작업부터 시작해야 합니다. 그렇지 않으면, 사용자 경험이나 직관에 의존해 프롬프트를 주관적으로 평가할 수 있는데요. 이때 프롬프트의 성능 개선에 실제 영향을 미치는 요소를 놓칠 수 있고요. 프롬프트를 효율적으로 개선하기가 더 어려워질 수 있습니다.

Prometheus 2와 OpenAI API를 활용하면 프롬프트 품질을 더욱 빠르게 정량 평가하고, 데이터에 기반해 실질적인 프롬프트 개선 방안을 객관적으로 도출할 수 있는데요. 이 글에서는 두 도구를 활용한 프롬프트 품질 평가 방법을 실습 예제와 함께 알아보겠습니다.

프롬프트와 평가 기준

프롬프트 품질 평가 방법을 다루기에 앞서, 프롬프트와 프롬프트 평가의 개념, 지표를 먼저 살펴보겠습니다.

프롬프트

프롬프트는 LLM에 입력하는 텍스트로, 모델이 응답을 생성하는 데 필요한 핵심 데이터입니다. 입력한 프롬프트의 품질은 LLM이 생성하는 응답의 품질에 직접적인 영향을 미칩니다. 특히 구체적이고 구조화된 고품질의 프롬프트는 LLM의 잠재력을 효과적으로 끌어내, 사용자가 기대하는 양질의 응답을 도출하는 데 도움이 됩니다.

프롬프트 평가

프롬프트 평가는 ‘프롬프트가 LLM에서 기대하는 응답을 얼마나 잘 끌어냈는지’를 특정 기준에 따라 측정하는 활동입니다. 이는 응답 품질을 평가해 입력 프롬프트의 성능을 진단하는 방식으로 이뤄집니다.

프롬프트의 성능을 올바로 평가하려면 먼저 프롬프트의 효과를 명확히 정의해야 합니다. 즉, ‘어떤 응답이 좋은지’ 기준을 구체적으로 정의해야 하죠. 그다음, 해당 기준에 따라 프롬프트의 응답 결과를 평가합니다.

다만, 모든 프롬프트에 동일한 평가 기준을 적용하기는 어렵습니다. 프롬프트의 맥락과 수행 작업, 기대하는 출력물의 특징이 다르기 때문입니다.

그래도 다양한 프롬프트 유형에 공통으로 적용할 수 있는 일반적인 평가 지표는 있습니다. 다음 항목이 대표적입니다.

환각 탐지: 응답에 거짓 정보나 조작된 내용이 있는지 확인합니다.
정확성: 응답이 사실에 부합하고, 주어진 맥락과 모순되지 않는지 확인합니다.
효율성: 응답 생성 속도와 계산 비용을 측정합니다.
유해성, 편향성: 출력물에 부적절한 언어, 편향, 또는 유해한 내용이 있는지 평가합니다.
일관성, 유창성: 생성된 텍스트의 논리적 흐름과 문장 구성의 자연스러움을 평가합니다.
관련성: 응답과 프롬프트의 질문, 주제 관련성을 평가합니다.

프롬프트 유형에 따른 평가 기준의 차이

창의적인 글쓰기를 요청하는 프롬프트에는 응답의 유창성과 독창성이 중요합니다. 정보를 요청하는 프롬프트에는 응답의 정확성과 근거 제시가 핵심입니다. 따라서 프롬프트의 유형에 따라 적절한 평가 기준을 정의해야 합니다.

프롬프트 평가 방식

프롬프트를 평가하는 방식은 크게 세 가지로 나뉩니다. 인간 평가, 정량 지표 기반 평가, LLM 기반 평가(LLM-as-a-judge)가 대표적입니다. 각 평가 방식의 의미와 장단점을 살펴보겠습니다.

인간 평가

인간 평가는 사람이 LLM의 응답을 보고, 그 품질과 프롬프트의 적합성을 평가하는 방식입니다. 평가자는 응답의 적절성, 정확성, 자연스러움, 유용성 등을 종합적으로 판단합니다. 이 방식은 실제 사용자 만족도와 상관관계가 높고, 가장 신뢰할 만한 평가 방식으로 평가됩니다. 그러나 평가자의 주관이 개입될 수 있고, 평가 비용과 시간이 많이 들어가는 게 단점입니다.

정량 지표 기반 평가

BLEU, ROUGE, BERTScore, Perplexity 등 텍스트 유사도나 예측 정확도 기반의 정량 지표로 평가하는 방식입니다. 이 방식은 계산하기가 간편하고 빨리 평가할 수 있다는 장점이 있습니다. 그러나 프롬프트의 맥락을 충분히 반영하기 어려우며, LLM 응답의 품질과 프롬프트의 적합성을 온전히 평가하기는 어렵습니다. 정량 지표가 높다고 사용자가 항상 만족하는 건 아닙니다.

LLM 기반 평가(LLM-as-a-judge)

고성능 LLM으로 사전에 정의된 평가 기준과 프롬프트에 따라 다른 LLM의 응답 품질과 프롬프트의 적합성을 평가하는 방식입니다. 이는 상용 API나 오픈 소스 모델로 프롬프트를 평가하는 등 다양한 접근법을 취합니다. 이 방식은 인간 평가와 정량 지표 기반 평가의 한계를 보완합니다.

최근 LLM 성능이 발전하면서 LLM-as-a-judge와 인간 평가 간의 유사도는 점점 높아지고 있습니다. 그러나 이 방식에도 한계는 있습니다. 평가 기준이 부정확하거나, 명확하지 않으면 결과가 왜곡될 가능성이 있습니다. 또 고성능 모델을 사용할 때, 비용 부담이 커질 수 있습니다.

LLM 기반 프롬프트 평가 실습

이 글에서는 LLM을 활용해 다른 LLM의 응답 품질과 프롬프트의 적합성을 평가하는 방법을 실습하겠습니다. 실습은 Python 3.12.10 환경에서 진행하며, 평가 도구로 Prometheus 2와 OpenAI API를 사용합니다. 두 도구를 동일한 조건에서 실행해 평가의 타당성과 정확성을 살펴보겠습니다.

사전 준비 사항

LLM으로 프롬프트 성능을 평가하려면 다음과 같은 데이터 구조가 필요합니다. 이 데이터는 프롬프트 유형과 평가 기준에 맞게 수정하거나, 다른 LLM으로 생성할 수 있습니다.

instruction: LLM에 입력한 프롬프트입니다.
response: 입력한 프롬프트에 대한 LLM의 응답입니다.
reference_answer: response 평가에 참고할 답변입니다. 필수값은 아닙니다.
rubric_data
- criteria: response 평가 기준입니다.
- description: 각 평가 점수 설명입니다.

Prometheus 2 기반 평가

Prometheus 2는 언어 모델 평가에 특화된 오픈 소스 LLM입니다. 이 모델은 상대 평가와 절대 평가를 모두 지원합니다. 이번 실습에서는 prometheus-7b-v2.0 모델을 사용해 절대 평가 방식으로 프롬프트의 성능을 평가하겠습니다.

절대 평가

다음 명령어를 실행해 라이브러리를 설치합니다.
```
pip install vllm
pip install prometheus_eval
```
instruction, response, rubric_data(criteria, description)를 각각 입력합니다. 이 실습에서는 다음과 같이 작성했습니다(response는 제외).
1. instruction: “모든 프로덕션 데이터베이스 서버의 최근 메모리 사용량에 대한 상세 운영 사항을 요약하세요. 서버별 메모리 사용량, 스왑 사용량, 관련 CPU와 디스크 I/O 메트릭을 포함하고, 사용량이 운영 임곗값을 초과하면 실행 가능한 권장 사항을 제시하세요.”
2. rubric_data
  - criteria: "응답이 각 서버 인스턴스의 정확한 CPU 사용률 값을 포함하고, 메모리와 디스크 I/O 상태를 적절히 요약하며, 운영팀이 따를 수 있는 실행 가능한 권장 사항을 제공합니까?"
  - description:

평가 코드를 실행합니다.

from prometheus_eval.vllm import VLLM
from prometheus_eval import PrometheusEval
from prometheus_eval.prompts import ABSOLUTE_PROMPT, SCORE_RUBRIC_TEMPLATE

model = VLLM(model="prometheus-eval/prometheus-7b-v2.0")
judge = PrometheusEval(model=model, absolute_grade_template=ABSOLUTE_PROMPT)

instruction = (
    "Provide a detailed operational summary of the recent memory usage for all production database servers. "
    "Include per-server memory usage, swap usage, related CPU and disk I/O metrics, and actionable recommendations if any usage exceeds operational thresholds."
),
response = (
    "Staging Application Server Disk Usage Overview (as of 2025-06-04 15:40 KST):\n\n"
    "Server      | Disk Usage | 7-Day Trend\n"
    "------------|------------|----------------\n"
    "stg-app-01  | 77%        | +2% (gradual rise)\n"
    "stg-app-02  | 83%        | +5% (spike since yesterday)\n"
    "stg-app-03  | 69%        | -1% (stable)\n"
    "stg-app-04  | 85%        | +3% (steady increase)\n\n"
    "Alert: stg-app-02 and stg-app-04 are above the 80% disk usage threshold.\n\n"
    "Recommended actions:\n"
    "- For stg-app-02: Investigate recent log or data growth causing the spike. Consider temporary log archival.\n"
    "- For stg-app-04: Schedule disk cleanup or plan for volume expansion within the next 48 hours to prevent service impact.\n"
    "- All others: Continue routine monitoring.\n"
)

rubric_data = {
    "criteria": (
        "Does the response include accurate CPU usage values for each server instance, "
        "appropriately summarize memory and disk I/O status, "
        "and provide actionable recommendations that the operations team can follow?"
    ),
    "score1_description": (
        "The response uses the wrong server name (e.g., refers to the dev server), "
        "provides inaccurate CPU usage numbers, and fails to include any additional information "
        "(memory or disk I/O)."
    ),
    "score2_description": (
        "The response mentions CPU usage only vaguely (e.g., “around 80–85%”), "
        "does not give per-server details, and the recommendation is either missing or too unclear "
        "(e.g., “Consider checking the CPU”)."
    ),
    "score3_description": (
        "The response provides CPU usage for one or two servers but does not calculate the overall average, "
        "or it includes only partial memory/disk I/O information (e.g., memory usage only)."
    ),
    "score4_description": (
        "The response lists accurate CPU usage for all prod instances and calculates the overall average, "
        "and it also includes memory and disk I/O details, but the recommendation (e.g., “Consider autoscaling”) "
        "is somewhat generic."
    ),
    "score5_description": (
        "The response lists accurate CPU usage for all prod instances and specifies the overall average, "
        "provides detailed memory and disk I/O status, and based on the condition “average CPU usage > 90%,” "
        "clearly presents actionable recommendations such as “scale out the Auto Scaling group” "
        "or “temporarily pause batch jobs.”"
    )
}

score_rubric = SCORE_RUBRIC_TEMPLATE.format(**rubric_data)


feedback, score = judge.single_absolute_grade(
    instruction=instruction,
    response=response,
    rubric=score_rubric,
)

print("Feedback:", feedback)
print("Score:", score)

Prometheus 2가 LLM 응답(response)을 “모든 프로덕션 서버의 정확한 CPU 사용률 수치를 제시하고, 전체 평균을 계산해 평가 기준의 요구 사항에 부합합니다”라고 피드백합니다. 프롬프트(instruction) 점수는 5점을 줬습니다. 평가 기준과 실제 응답의 부합 여부를 고려할 때, 일관성 있고 신뢰성 있는 평가 결과로 해석할 수 있습니다.

Feedback: The response presents accurate CPU usage figures for all production servers and calculates the overall average, in alignment with the rubric's requirements.
It also provides detailed memory and disk I/0 metrics for each server instance, ensuring a comprehensive understanding of the servers' current status.
Furthermore, the actionable recommendations are well-defined and clearly structured based on specific operational thresholds.
This level of detail and precision in the recommendations is highly beneficial to the operations team, who can take immediate steps in response to the analysis presented.
Hence, the response is well-rounded, directly addressing each aspect of the rubric.
By accurately summarizing server usage, providing detailed status metrics, and offering clear, actionable recommendations, this response fulfills the criteria outlined in the rubric to the highest standard.
[RESULT] 5
Score: 5

환각 현상 평가

이제 LLM 응답에 포함된 정보의 정확성을 기준으로, 환각 발생 여부를 평가하겠습니다. Prometheus 2로 채점하고, 결과와 실제 공식 자료의 일치 여부를 검증합니다.

instruction, response, rubric_data(criteria, description)를 각각 입력합니다. 이 실습에서는 다음과 같이 작성했습니다(response는 제외).
1. instruction: “Claude 4(Opus 4와 Sonnet 4)의 출시일, 입력과 출력에 대한 토큰당 가격을 알려주세요.”
2. rubric_data
  - criteria: "응답의 출시일과 토큰당 가격이 공식 자료와 얼마나 일치하는지, 허위 또는 과장된 정보가 포함되었는지를 기준으로 평가하세요."
  - description:

평가 코드를 실행합니다.

from prometheus_eval.vllm import VLLM
from prometheus_eval import PrometheusEval
from prometheus_eval.prompts import ABSOLUTE_PROMPT, SCORE_RUBRIC_TEMPLATE

model = VLLM(model="prometheus-eval/prometheus-7b-v2.0")
judge = PrometheusEval(model=model, absolute_grade_template=ABSOLUTE_PROMPT)

instruction = (
    "Please provide the release date of Claude 4 (Opus 4 and Sonnet 4) and the per-token pricing for both input and output."
)
response = (
    "Claude 4 (Opus 4 and Sonnet 4) was officially released on May 1, 2025. \n"
    "Pricing details are as follows:\n"
    "- Opus 4: $10 per input token, $60 per output token ($10/$60 per million tokens). (https://www.anthropic.com/news/claude-4) (https://www.anthropic.com/claude/opus)\n\n"
    "- Sonnet 4: $1.50 per input token, $12 per output token ($1.5/$12 per million tokens). (https://www.anthropic.com/news/claude-4) (https://www.anthropic.com/claude/sonnet)\n"
    "Additionally, a 50% discount promotion was available to all users for the first six months after launch. (https://www.anthropic.com/pricing)"
)
rubric_data = {
    "criteria": "The extent to which the response’s release date and per-token pricing align with official sources, and whether any false or exaggerated information is included.",
    "score1_description": "The response contains numerous incorrect dates or pricing details (release date, input/output rates, etc.), making fact-checking nearly impossible.",
    "score2_description": "Only one of the two pieces of information (release date or pricing) is correct, while the rest contains serious errors that undermine credibility.",
    "score3_description": "One piece of critical information (either release date or pricing) is accurate, but other details (e.g., pricing units, specific day/month) are slightly off and require further verification.",
    "score4_description": "Nearly all information is correct, but a very minor detail (e.g., omission of a prompt-caching discount) is missing or slightly incorrect, which can be resolved with minimal fact-checking.",
    "score5_description": "Every detail (release date, per-token pricing for input and output, etc.) perfectly matches official documentation with no false information."
}

score_rubric = SCORE_RUBRIC_TEMPLATE.format(**rubric_data)


feedback, score = judge.single_absolute_grade(
    instruction=instruction,
    response=response,
    rubric=score_rubric,
)

print("Feedback:", feedback)
print("Score:", score)

Prometheus 2가 LLM 응답(response)을 “제공된 응답은 정확하고 출처가 확실합니다. 이 응답은 공식 자료와 완전히 일치하며 잘못된 정보를 전혀 제시하지 않습니다”라고 피드백합니다. 프롬프트 점수는 5점을 줬습니다.

Feedback: The response provided is accurate and well-sourced.
It gives the exact release date of Claude 4, as well as detailed pricing information for both Opus 4 and Sonnet 4.
Moreover, the response includes a valuable piece of information regarding a discount promotion for the first six months after the launch of Claude 4, which is crucial for users considering this product.
This added value aligns with official documentation and fact-checking processes, ensuring that the information is reliable and current.
The details presented are consistent with what one would expect to find in official documentation and do not contain any false or exaggerated information.
Therefore, the response demonstrates a high level of accuracy and reliability in both the date and pricing details for the release of Claude 4.
Hence, according to the score rubric, the response fully aligns with official sources and presents no incorrect information, thereby achieving a score of 5.
[RESULT] 5
Score: 5

그러나 LLM 응답의 사실 여부를 확인해 보니, 출시일과 가격, 할인 정책이 잘못됐습니다. LLM 응답과 공식 정보(실제 내용)를 비교하면 아래와 같습니다.

총평

Prometheus 2는 웹 검색으로 최신 정보를 수집해 사실 관계를 검증하고 환각 현상을 탐지하는 데 취약합니다. 다국어 지원 성능도 아직 부족합니다. 그러나 비용 부담이 크지 않고, 배치를 사용해 다양한 기준에 따라 LLM의 응답 품질과 프롬프트의 적합성을 빠르게 평가할 수 있습니다. 이 도구는 맥락과 문장 구조 평가에 탁월하므로 문서 작성, 요약 프롬프트를 평가할 때 활용하면 좋습니다.

OpenAI API 기반 평가

이번에는 OpenAI API와 gpt-4.1 모델을 사용해 절대 평가 방식으로 프롬프트의 성능을 평가하겠습니다.

절대 평가

OpenAI Platform에서 API 키를 생성합니다.
다음 명령어를 실행해 라이브러리를 설치합니다.
```
pip install openai
```
instruction, response, rubric_data(criteria, description)를 각각 입력합니다. 이 실습에서는 Prometheus 2 기반 평가와 동일한 데이터를 사용합니다.

평가 코드에서 your_api_key를 실제 발급받은 API 키로 교체하고, 코드를 실행합니다.

from openai import OpenAI

client = OpenAI(api_key="your_api_key")

instruction = (
    "Provide a detailed operational summary of the recent memory usage for all production database servers. "
    "Include per-server memory usage, swap usage, related CPU and disk I/O metrics, and actionable recommendations if any usage exceeds operational thresholds."
),
response = ("""
**Production Database Server: Memory, CPU, and Disk I/O Summary (as of 2025-06-04 15:40 KST)**

| Server       | Mem Usage | Swap Usage | CPU Usage | Disk I/O (MB/s) | 7-Day Mem Trend | Status/Alert  |
|--------------|-----------|------------|-----------|-----------------|-----------------|---------------|
| prod-db-01   | 82%       | 0.4 GB     | 88%       | 35 read/18 write| +5%             | At threshold  |
| prod-db-02   | 91%       | 2.0 GB     | 92%       | 40 read/22 write| +7%             | **Overload**  |
| prod-db-03   | 68%       | 0.1 GB     | 75%       | 22 read/10 write| -2%             | Normal        |
| prod-db-04   | 87%       | 1.1 GB     | 94%       | 55 read/29 write| +4%             | **Overload**  |

**Overall averages:**
- Memory Usage: **82%**
- Swap Usage: **0.9 GB**
- CPU Usage: **87.3%**
- Disk I/O: **38 MB/s read**, **19.75 MB/s write**

**Operational Thresholds (for alerting):**
- Memory/CPU > 90%
- Swap > 1 GB
- Disk I/O > 50 MB/s for more than 2 hours

**Actionable Recommendations:**
1. **prod-db-02 and prod-db-04:**
   - Average CPU usage above 90%. Memory and swap are elevated.
   - **Action**:  
     - **Scale out Auto Scaling group if possible, or provision additional replicas.**
     - **Temporarily pause heavy ETL/batch jobs until usage normalizes.**
     - Investigate top queries and recent workload spikes.
2. **prod-db-01:**
   - Approaching threshold on memory and CPU.
   - **Action**:  
     - Monitor closely; prepare for scale/out or rerouting traffic if further increase is observed.
3. **prod-db-03:**
   - All metrics within normal range.
   - **Action**:  
     - No action needed. Continue routine monitoring.
4. **General:**
   - Review slow queries, clean up unused indices or bloat, and ensure swap is not growing.
   - If high I/O persists, consider upgrading disk throughput or moving large maintenance tasks to off-peak times.
""")

rubric_data = {
    "criteria": (
        "Does the response include accurate CPU usage values for each server instance, "
        "appropriately summarize memory and disk I/O status, "
        "and provide actionable recommendations that the operations team can follow?"
    ),
    "score1_description": (
        "The response uses the wrong server name (e.g., refers to the dev server), "
        "provides inaccurate CPU usage numbers, and fails to include any additional information "
        "(memory or disk I/O)."
    ),
    "score2_description": (
        "The response mentions CPU usage only vaguely (e.g., “around 80–85%”), "
        "does not give per-server details, and the recommendation is either missing or too unclear "
        "(e.g., “Consider checking the CPU”)."
    ),
    "score3_description": (
        "The response provides CPU usage for one or two servers but does not calculate the overall average, "
        "or it includes only partial memory/disk I/O information (e.g., memory usage only)."
    ),
    "score4_description": (
        "The response lists accurate CPU usage for all prod instances and calculates the overall average, "
        "and it also includes memory and disk I/O details, but the recommendation (e.g., “Consider autoscaling”) "
        "is somewhat generic."
    ),
    "score5_description": (
        "The response lists accurate CPU usage for all prod instances and specifies the overall average, "
        "provides detailed memory and disk I/O status, and based on the condition “average CPU usage > 90%,” "
        "clearly presents actionable recommendations such as “scale out the Auto Scaling group” "
        "or “temporarily pause batch jobs.”"
    )
}

def openai_evaluate(prompt: str) -> str:
    response = client.responses.create(
    model="gpt-4.1",
    tools=[
        {
            "type": "web_search"
        }
    ],
    input=prompt,
    )

    return response.output[1].content[0].text

full_prompt = (
    f"[Instruction]\n{instruction}\n\n"
    f"[Response]\n{response}\n\n"
    f"[Rubric]\n{rubric_data}\n\n"
    f"Provide me feedback and score only."
)

evaluation_output = openai_evaluate(full_prompt)

print("=== Evaluation Output ===")
print(evaluation_output)

GPT가 “응답이 잘 구성됐으며 메모리, 스왑, CPU, 디스크 I/O에 대한 정확한 서버별 메트릭을 포함하고, 전체 평균과 임곗값 조건을 모두 명확히 강조한다”라고 피드백합니다. 프롬프트(instruction) 점수는 5점을 줬습니다. 평가 기준과 실제 응답의 부합 여부를 고려할 때, 타당하고 정확한 평가 결과로 해석할 수 있습니다.

=== Evaluation Output ===
**Feedback:**
Your response is well-structured, includes accurate per-server metrics for memory, swap, CPU, and disk I/0, and clearly highlights both overall averages and threshold conditions. The recommendations are specific, actionable, and tailored to the operational state of each server-especially for those exceeding defined alert levels (e.g., scaling out, pausing batch jobs, investigating queries).
The table also summarizes trends and status/alerts effectively.
The escalation steps and monitoring guidance are explicit, practical, and relevant for the operations team.
**Score:** 5/5

환각 현상 평가

이제 LLM 응답에 포함된 정보의 정확성을 기준으로, 환각 발생 여부를 평가하겠습니다. gpt-4.1로 채점하고, 결과와 실제 공식 자료의 일치 여부를 검증합니다.

instruction, response, rubric_data(criteria, description)를 각각 입력합니다. 이 실습에서는 Prometheus 2 기반 평가와 동일한 데이터를 사용합니다.

평가 코드를 실행합니다.

from openai import OpenAI

client = OpenAI(api_key="your_api_key")

instruction = (
    "Please provide the release date of Claude 4 (Opus 4 and Sonnet 4) and the per-token pricing for both input and output."
)
response = (
    "Claude 4 (Opus 4 and Sonnet 4) was officially released on May 1, 2025. \n"
    "Pricing details are as follows:\n"
    "- Opus 4: $10 per input token, $60 per output token ($10/$60 per million tokens). (https://www.anthropic.com/news/claude-4) (https://www.anthropic.com/claude/opus)\n\n"
    "- Sonnet 4: $1.50 per input token, $12 per output token ($1.5/$12 per million tokens). (https://www.anthropic.com/news/claude-4) (https://www.anthropic.com/claude/sonnet)\n"
    "Additionally, a 50% discount promotion was available to all users for the first six months after launch. (https://www.anthropic.com/pricing)"
)
rubric_data = {
    "criteria": "The extent to which the response’s release date and per-token pricing align with official sources, and whether any false or exaggerated information is included.",
    "score1_description": "The response contains numerous incorrect dates or pricing details (release date, input/output rates, etc.), making fact-checking nearly impossible.",
    "score2_description": "Only one of the two pieces of information (release date or pricing) is correct, while the rest contains serious errors that undermine credibility.",
    "score3_description": "One piece of critical information (either release date or pricing) is accurate, but other details (e.g., pricing units, specific day/month) are slightly off and require further verification.",
    "score4_description": "Nearly all information is correct, but a very minor detail (e.g., omission of a prompt-caching discount) is missing or slightly incorrect, which can be resolved with minimal fact-checking.",
    "score5_description": "Every detail (release date, per-token pricing for input and output, etc.) perfectly matches official documentation with no false information."
}

def openai_evaluate(prompt: str) -> str:
    response = client.responses.create(
    model="gpt-4.1",
    tools=[
        {
            "type": "web_search"
        }
    ],
    input=prompt,
    )

    return response.output[1].content[0].text

full_prompt = (
    f"[Instruction]\n{instruction}\n\n"
    f"[Response]\n{response}\n\n"
    f"[Rubric]\n{rubric_data}\n\n"
    f"Provide me feedback and score only."
)

evaluation_output = openai_evaluate(full_prompt)

print("=== Evaluation Output ===")
print(evaluation_output)

GPT가 웹 검색으로 잘못된 정보를 확인하고, 낮은 점수를 줍니다. “이 응답은 수많은 잘못된 날짜와 가격 세부 사항을 포함해 사실 확인이 거의 불가능하다”라고 피드백합니다. 프롬프트 점수는 1점을 줬습니다.

=== Evaluation Output ===
The response contains incorrect per-token pricing details for both Opus 4 and Sonnet 4.
The correct pricing for Opus 4 is $15 per million input tokens and $75 per million output tokens, while for Sonnet 4, it is $3 per million input tokens and $15 per million output tokens.
Additionally, the release date of May 1, 2025, is not supported by the available sources.
Therefore, the response aligns with a score of 1, as it contains numerous incorrect dates and pricing details, making fact-checking nearly impossible.

LLM 응답(response)에서 잘못된 정보를 수정한 뒤, 다시 평가합니다.

response = (
    "Claude 4 (Opus 4 and Sonnet 4) was officially released on May 22, 2025. [oai_citation:0‡https://docs.anthropic.com/en/release-notes/system-prompts]\n"
    "Pricing details are as follows: [oai_citation:1‡https://www.anthropic.com/claude/opus] [oai_citation:2‡https://www.anthropic.com/claude/sonnet]\n"
    "- Opus 4: $15 per million input tokens, $75 per million output tokens. [oai_citation:1‡https://www.anthropic.com/claude/opus] \n"
    "- Sonnet 4: $3 per million input tokens, $15 per million output tokens. [oai_citation:2‡https://www.anthropic.com/claude/sonnet]\n"
)

GPT는 “응답이 Claude 4(Opus 4와 Sonnet 4)가 2025년 5월 22일에 공식 출시됐다고 정확히 명시했습니다. 토큰당 가격 세부 사항도 정확합니다”라고 피드백합니다. 프롬프트 점수는 5점을 줬습니다.

=== Evaluation Output ===
The response accurately states that Claude 4 (Opus 4 and Sonnet 4) was officially released on May 22, 2025. ([reuters.com] (https://www.reuters.com/business/startup-anthropic-satus-its-new-ai-model-can-code-hours-time-2025-05-22/?utm_source=openai)) The per-token pricing details provided are also correct:
- Opus 4: $15 per million input tokens, $75 per million output tokens. ([itpro.com] (https://www.itpro.com/software/development/anthropic-claude-opus-4-software-development?utm_source=openai))
- Sonnet 4: $3 per million input tokens, $15 per million output tokens. ([itpro.com] (https://www.itpro.com/software/development/anthropic-claude-sonnet-4-software-development?utm_source=openai))
Therefore, the response perfectly matches official documentation with no false information.
**Score: 5**

총평

gpt-4.1은 prometheus-7b-v2.0보다 성능이 우수하고, 다국어 지원과 다양한 도구 연동으로 웹 검색 기반 사실 확인에 강점을 보입니다. 이는 전문성과 최신성이 중요한 평가 상황에서 유용합니다. 그러나 토큰을 많이 사용해 비용 부담이 클 수 있습니다.

맺음말

오늘날 LLM은 단순 응답 생성을 넘어 스스로 프롬프트를 생성하고, APE(Automatic Prompt Engineering) 기법으로 프롬프트를 자동 개선하는 단계까지 진화했습니다. 모델의 발전 속도는 매우 빠르며, 이제 프롬프트도 더욱 간편하게 최적화할 수 있게 됐죠.

이 글에서 소개했듯 프롬프트는 다양한 방식으로 평가할 수 있으며, 각 방식에는 고유한 장단점이 있습니다. 신뢰도 높은 결과를 얻으려면 프롬프트의 목적과 맥락을 고려한다면·반복 평가가 필수입니다.

이러한 전략을 기반으로 자동화된 프롬프트 평가 파이프라인을 구축해 보시는 건 어떨까요? 이는 LLM 응답의 품질을 높이고, 운영 비용을 줄이며, 업무 효율을 획기적으로 향상하는 데 도움이 될 것입니다.

참고 자료

Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.”, 2023, https://arxiv.org/html/2306.05685
prometheus-eval GitHub, https://github.com/prometheus-eval/prometheus-eval
Kim et al., “Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models.”, 2024, https://arxiv.org/html/2405.01535

지금 이 기술이 더 궁금하세요? 인포그랩의 DevOps 전문가가 알려드립니다.

프롬프트와 평가 기준​

프롬프트​

프롬프트 평가​

프롬프트 평가 방식​

인간 평가​

정량 지표 기반 평가​

LLM 기반 평가(LLM-as-a-judge)​

LLM 기반 프롬프트 평가 실습​

사전 준비 사항​

Prometheus 2 기반 평가​

절대 평가​

환각 현상 평가​

총평​

OpenAI API 기반 평가​

절대 평가​

환각 현상 평가​

총평​

맺음말​

참고 자료​

프롬프트와 평가 기준

프롬프트

프롬프트 평가

프롬프트 평가 방식

인간 평가

정량 지표 기반 평가

LLM 기반 평가(LLM-as-a-judge)

LLM 기반 프롬프트 평가 실습

사전 준비 사항

Prometheus 2 기반 평가

절대 평가

환각 현상 평가

총평

OpenAI API 기반 평가

절대 평가

환각 현상 평가

총평

맺음말

참고 자료