本地规则评测器
基于规则的快速评测器,确定性高、执行速度快,包括:
- 任务完成度: 检查响应长度和错误状态
- 工具调用准确性: 验证工具调用匹配度
- 关键词覆盖: 检查预期关键词出现频率
- 上下文保持: 多轮对话实体引用率
- 风险披露: 检查投资风险警告
- 禁止内容检测: 识别违规表述
Investment Agent 内置了一套完整的智能体评测系统,用于评估 AI 智能体在投资分析场景中的表现。该系统基于模块化架构设计,支持多维度评测、多引擎对比、实时进度追踪,并提供针对性的改进建议。
评测系统采用分层架构设计,包含四个核心层级:
┌─────────────────────────────────────────────────────────────────┐│ 应用层 (Application) ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Web UI │ │ CLI │ │ API Route │ ││ │ (React) │ │ (Command) │ │ (Next.js) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ 核心层 (Core) ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Evaluator │ │ Scorers │ │ Runners │ ││ │ (评估器) │ │ (评测器集合) │ │ (执行引擎) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ 适配层 (Adapters) ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ WebAPI Runner│ │ Real Runner │ │Event Collector│ ││ │ (HTTP/SSE) │ │ (Hermes SDK) │ │ (事件收集) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘ ↓┌─────────────────────────────────────────────────────────────────┐│ 数据层 (Data) ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Benchmark │ │ Persistence │ │ Reports │ ││ │ Cases (JSON)│ │ (SQLite) │ │ (JSON/MD) │ ││ └──────────────┘ └──────────────┘ └──────────────┘ │└─────────────────────────────────────────────────────────────────┘评估器是评测系统的核心协调模块,负责:
export async function evaluateCases( cases: BenchmarkCase[], options: EvaluateCasesOptions,): Promise<EvaluationReport> { // 1. 并发执行测试用例 const results = await mapWithConcurrency(cases, options.concurrency, evaluateOneCase);
// 2. 计算统计汇总 const summary = summarizeResults(results);
// 3. 生成改进建议 const suggestions = generateRuleBasedSuggestions(report);
// 4. 可选:LLM-Judge 增强建议 if (options.llmJudge) { const llmSuggestions = await generateLlmSuggestions(report, options.llmJudge); report.suggestions = report.suggestions.concat(llmSuggestions); }
return report;}评测器分为两大类:
本地规则评测器
基于规则的快速评测器,确定性高、执行速度快,包括:
大模型评测器 (Mastra)
基于大模型语义理解的深度评测器,需要额外调用 LLM API,包括:
执行引擎负责实际运行智能体并收集执行记录:
适用场景: 通过 HTTP API 调用已部署的智能体服务
interface WebApiRunOptions { baseUrl: string; // API 基础地址 model: string; // 模型名称 provider: string; // 模型提供商 maxIterations: number; // 最大迭代次数 timeoutMs: number; // 超时时间 authToken?: string; // 认证令牌}关键特性:
适用场景: 直接调用智能体 SDK(当前仅支持 Hermes)
interface RealRunOptions { model: string; // 模型名称 provider: string; // 模型提供商 timeoutMs: number; // 超时时间 userId: number; // 用户 ID}关键特性:
适用场景: 测试评测流程本身,不实际调用智能体
返回数据: 基于 GPT-4 生成的高质量模拟响应 用途: 验证评测器逻辑、CI/CD 管道测试
评测系统的数据流分为四个阶段:
用例加载阶段
// 从 JSON 文件加载测试用例const cases = loadBenchmarkCases(['asset-query', 'portfolio-analysis']);// 每个用例包含:输入、预期关键词、预期工具、难度等级等执行阶段
// 根据配置选择执行方式const record = options.transport === 'web-api' ? await runWebApiCase(testCase, engine, webApiOptions) : await runRealCase(testCase, engine, realOptions);
// 收集执行记录interface EvaluationRunRecord { id: string; engine: EvaluationEngine; input: string | EvaluationMessage[]; output: string; messages: EvaluationMessage[]; toolCalls: EvaluationToolCall[]; status: 'completed' | 'failed'; cost: { inputTokens?: number; outputTokens?: number; costUsd?: number; };}评分阶段
// 运行所有评测器const scorers = runAllScorers(testCase, record);
// 可选:运行 Mastra 大模型评测器if (options.mastraModel) { const mastraResults = await runMastraScorers(testCase, record, options.mastraModel); scorers = scorers.concat(mastraResults);}
// 计算维度得分const dimensionScores = calculateDimensionScores(scorers);聚合与建议生成阶段
// 计算总体得分const score = average(Object.values(dimensionScores));
// 生成改进建议const suggestions = generateRuleBasedSuggestions(report);// 可选:LLM 增强建议if (options.llmJudge) { const llmSuggestions = await generateLlmSuggestions(report, options.llmJudge);}评测系统基于五个核心维度对智能体表现进行综合评估,每个维度对应多个评测器:
评估智能体是否成功完成用户的投资分析任务。
核心评测器
task-completion
completed评分标准
{ dimension: 'mission', name: 'task-completion', passed: hasOutput && hasNoError, score: hasOutput && hasNoError ? 1.0 : 0.0}评估智能体是否正确调用了预期的金融工具(如股票查询、新闻搜索等)。
评分机制
tool-call-accuracy
计算公式
matched = expected.filter(tool => actual.includes(tool));errored = toolCalls.filter(call => call.isError).length;score = (matched.length / expected.length) × (errored > 0 ? 0.5 : 1.0);评估智能体在多轮对话中是否能引用之前提到的实体。
实体抽取
context-retention
自动识别前文中的关键实体:
评分阈值
// 抽取前文实体entities = extractEntities(priorMessages);// 计算引用率referenced = entities.filter(e => output.includes(e));score = referenced.length / entities.length;// 通过阈值:30%passed = score >= 0.3;综合评估响应内容的质量,涵盖多个子评测器:
关键词覆盖
keyword-coverage
数据准确性
data-accuracy
建议质量
advice-quality
评估输出内容是否符合投资建议的合规要求。
风险披露
risk-disclosure
检查是否包含投资风险警告:
禁止内容检测
prohibited-words
检测禁止性表述:
最终得分是五个维度得分的加权平均:
function calculateDimensionScores(scorers: ScorerResult[]): Record<Dimension, number> { const dimensions: Dimension[] = ['mission', 'action', 'context', 'execution', 'ethics'];
return Object.fromEntries( dimensions.map(dimension => [ dimension, average(scorers.filter(s => s.dimension === dimension).map(s => s.score)) ]) );}
// 最终得分 = 所有维度得分的平均值function calculateFinalScore(dimensionScores: Record<Dimension, number>): number { return average(Object.values(dimensionScores));}
// 用例通过条件:总分 ≥ 阈值 且 mission 得分 > 0function isPassed(score: number, dimensionScores: Record<Dimension, number>, threshold: number): boolean { return score >= threshold && dimensionScores.mission > 0;}系统预置了五大评测类别,覆盖投资分析的主要场景:
测试场景
关键评测点
典型用例:
{ "id": "asset-query-001", "title": "AAPL fundamental valuation", "difficulty": "medium", "input": "What is Apple's current P/E ratio and how does it compare to its 5-year average?", "expected": { "keywords": ["P/E", "valuation", "average", "Apple"], "minKeywordCoverage": 0.5, "requireRiskDisclosure": true, "tools": ["stock_get_price"] }}测试场景
关键评测点
测试场景
关键评测点
测试场景
关键评测点
典型多轮对话用例:
{ "id": "multi-turn-001", "title": "Progressive portfolio deep-dive", "difficulty": "hard", "input": [ { "role": "user", "content": "What's the current price of AAPL?" }, { "role": "assistant", "content": "Apple (AAPL) is currently trading at $178.50." }, { "role": "user", "content": "How does it compare to its 52-week high?" }, { "role": "assistant", "content": "AAPL's 52-week high is $199.62. It's about 10.6% below that peak." }, { "role": "user", "content": "What's driving the decline and should I buy now?" } ]}测试场景
关键评测点
本地评测器直接在评测进程中执行,无需外部 API 调用:
export function runAllScorers( testCase: BenchmarkCase, record: EvaluationRunRecord): ScorerResult[] { return [ scoreMission(record), // 任务完成度 scoreToolCalls(testCase, record), // 工具调用 scoreContext(testCase, record), // 上下文保持 scoreKeywordCoverage(testCase, record), // 关键词覆盖 scoreRiskDisclosure(testCase, record), // 风险披露 scoreProhibitedWords(testCase, record), // 禁止内容 scoreDataAccuracy(testCase, record), // 数据准确性 scoreAdviceQuality(testCase, record), // 建议质量 ];}Mastra 评测器通过 LLM 进行语义级别的深度评估:
async function runMastraScorer( name: string, record: EvaluationRunRecord, model: MastraModelConfig, options?: WrapperOptions): Promise<ScorerResult | null> { const registry = getGlobalRegistry({ model }); if (!registry.has(name)) return null;
// 调用 Mastra 评测器 const result = await registry.run(name, record, options);
// 降级处理:如果 Mastra 不可用,回退到本地评测器 if (result.score === 0 && result.reason.includes('not available')) { return null; }
return result;}支持的 Mastra 评测器:
| 评测器 | 维度 | 说明 | 降级策略 |
|---|---|---|---|
answer-relevancy | execution | 回答相关性 | 返回 score: 0, passed: false |
faithfulness | execution | 上下文忠实度 | 返回 score: 0, passed: false |
hallucination | execution | 幻觉检测 | 返回 score: 0, passed: false |
completeness | execution | 完整性 | 降级到 keyword-coverage |
content-similarity | execution | 内容相似度 | 降级到 keyword-coverage |
keyword-coverage | execution | 关键词覆盖 | 本地实现 |
toxicity | ethics | 毒性检测 | 返回 score: 0, passed: false |
bias | ethics | 偏见检测 | 返回 score: 0, passed: false |
prompt-alignment | ethics | 提示对齐 | 返回 score: 0, passed: false |
context-relevance | context | 上下文相关性 | 返回 score: 0, passed: false |
tone | execution | 语气分析 | 返回 score: 0, passed: false |
Web API Runner 通过 HTTP SSE 与已部署的智能体服务通信:
interface StreamState { sawText: boolean; finalContent?: string; tokens: StreamTokens; traceMetrics?: TraceMetrics; traceCost?: TraceCost;}
async function runWebApiCase( testCase: BenchmarkCase, engine: EvaluationEngine, options: WebApiRunOptions): Promise<EvaluationRunRecord> { // 1. 初始化事件收集器 const collector = new EvaluationEventCollector({ agentId: 'investment_advisor', caseId: testCase.id, engine, input: testCase.input, });
// 2. 构建请求 const endpoint = endpointForEngine(engine); // hermes → /api/chat/hermes const requestBody = buildRequestBody(testCase, engine, options);
// 3. 发起 SSE 流式请求 const response = await fetch(new URL(endpoint, options.baseUrl), { method: 'POST', headers: { 'Accept': 'text/event-stream', 'Content-Type': 'application/json' }, body: JSON.stringify(requestBody), });
// 4. 解析流式事件 const reader = response.body.getReader(); const decoder = new TextDecoder(); let pending = '';
for (;;) { const { done, value } = await reader.read(); if (done) break;
// 分割 SSE 事件块 const split = splitSseEvents(pending + decoder.decode(value, { stream: true })); pending = split.pending;
// 处理每个事件 for (const part of split.parts) { const event = parseSsePayload(part); if (event) applyStreamEvent(collector, event, state); } }
// 5. 返回执行记录 return collector.toRecord();}支持的事件类型:
| 事件类型 | 说明 | 数据提取 |
|---|---|---|
text | 文本增量 | collector.addAssistantDelta(event.delta) |
tool_use | 工具调用 | collector.addToolCall({ name, args }) |
span_end | Span 结束 | 记录工具调用耗时和状态 |
trace_end | Trace 结束 | 提取指标和成本数据 |
result | 最终结果 | 提取完整响应和 Token 统计 |
error | 错误 | collector.fail(message, code) |
Real Runner 直接调用智能体 SDK,当前支持 Hermes 引擎:
export async function runRealCase( testCase: BenchmarkCase, engine: EvaluationEngine, options: RealRunOptions): Promise<EvaluationRunRecord> { // 1. 动态导入 Hermes Agent const hermes = await import('@investment-agent/hermes-agent');
// 2. 初始化事件收集器 const collector = new EvaluationEventCollector({ agentId: 'investment_advisor', caseId: testCase.id, engine, input: testCase.input, });
// 3. 创建 Hermes Agent 实例 const agent = new hermes.HermesAgent({ model: hermes.getModel(options.provider, options.model), callbacks: { onTextDelta: (delta) => collector.addAssistantDelta(delta), onToolStart: (name, args) => collector.addToolCall({ name, args, isError: false }), onToolEnd: (result) => collector.addToolCall({ name: result.toolName, durationMs: result.durationMs, isError: result.isError, }), onError: (error) => collector.fail(error.message, 'hermes_error'), }, streaming: true, maxIterations: 15, });
// 4. 执行 Agent const result = await agent.run({ context: { messages: previousMessages }, message: lastUserMessage, });
// 5. 返回执行记录 return collector.toRecord();}Event Collector 是执行记录收集的核心组件:
export class EvaluationEventCollector { private messages: EvaluationMessage[] = []; private toolCalls: EvaluationToolCall[] = []; private output: string = ''; private status: 'completed' | 'failed' = 'completed'; private error?: { message: string; code?: string };
addAssistantDelta(delta: string): void { this.output += delta; }
addToolCall(call: Partial<EvaluationToolCall>): void { this.toolCalls.push({ name: call.name ?? 'unknown', args: call.args ?? {}, isError: call.isError ?? false, durationMs: call.durationMs, error: call.error, result: call.result, }); }
fail(message: string, code?: string): void { this.status = 'failed'; this.error = { message, code }; }
toRecord(): EvaluationRunRecord { return { id: generateId(), engine: this.config.engine, caseId: this.config.caseId, agentId: this.config.agentId, input: this.config.input, output: this.output, messages: this.messages, toolCalls: this.toolCalls, status: this.status, error: this.error, startedAt: this.startTime, completedAt: new Date().toISOString(), cost: {}, trace: { spans: [], metrics: [] }, }; }}系统会分析失败用例的模式,生成结构化改进建议:
interface StructuredSuggestion { id: string; category: 'system-prompt' | 'tool-config' | 'architecture' | 'timeout'; title: string; description: string; dimension: string; affectedCases: string[]; effort: 'small' | 'medium' | 'large'; priority: 'high' | 'medium' | 'low';}
function generateRuleBasedSuggestions(report: EvaluationReport): StructuredSuggestion[] { const suggestions: StructuredSuggestion[] = [];
// 1. 维度得分分析 for (const [dimension, score] of Object.entries(report.summary.byDimension)) { if (score < threshold) { suggestions.push(DIMENSION_SUGGESTIONS[dimension]); } }
// 2. 模式匹配分析 for (const failedCase of report.results.filter(r => !r.passed)) { for (const scorer of failedCase.scorers) { if (!scorer.passed) { const matchedPattern = SCORER_PATTERNS.find(p => p.match(scorer.reason)); if (matchedPattern) { suggestions.push(matchedPattern.template); } } } }
// 3. 优先级排序 return suggestions.sort((a, b) => PRIORITY_ORDER[a.priority] - PRIORITY_ORDER[b.priority] );}Mission 维度
Action 维度
Context 维度
Execution 维度
Ethics 维度
可选启用 LLM-Judge 生成更精准的改进建议:
export async function generateLlmSuggestions( report: EvaluationReport, options: LlmJudgeOptions): Promise<EvaluationSuggestion[]> { const prompt = buildSuggestionPrompt(report);
const response = await fetch(options.baseUrl, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: options.model, messages: [ { role: 'system', content: SUGGESTION_SYSTEM_PROMPT }, { role: 'user', content: prompt }, ], }), });
return parseSuggestions(await response.json());}访问评测页面 进入 设置 → 评测 页面
配置评测参数
启动评测 点击 开始评测 按钮,实时查看进度
查看结果
历史对比 在 历史 标签页查看所有评测记录,对比不同配置的效果
# 运行完整评测(所有类别)pnpm eval
# 指定引擎和类别pnpm eval --engine hermes \ --categories asset-query,portfolio-analysis
# 设置并发度和阈值pnpm eval --concurrency 5 \ --threshold 0.75
# 使用 Web API 模式pnpm eval --transport web-api \ --base-url http://localhost:3000 \ --model gpt-4o-mini
# 启用 Mastra 评测器pnpm eval --mastra-model openai/gpt-4o-mini
# 启用 LLM-Judge 建议pnpm eval --llm-judge \ --llm-judge-model gpt-4
# 限制测试用例数量pnpm eval --limit 10
# 自定义运行 IDpnpm eval --run-id eval-$(date +%Y%m%d-%H%M%S)import { loadBenchmarkCases, evaluateCases, summarizeResults,} from '@investment-agent/evaluation';
// 加载测试用例const cases = loadBenchmarkCases(['asset-query', 'portfolio-analysis']);
// 运行评测const report = await evaluateCases(cases, { engine: 'hermes', categories: ['asset-query', 'portfolio-analysis'], threshold: 0.7, concurrency: 3, transport: 'web-api', runId: 'eval-custom-001', webApiRun: { baseUrl: 'http://localhost:3000', model: 'Kimi-K2.6', provider: 'ant', maxIterations: 15, timeoutMs: 180000, }, // 可选:启用 Mastra 评测器 mastraModel: 'openai/gpt-4o-mini', // 可选:启用 LLM-Judge 建议 llmJudge: { baseUrl: 'https://api.openai.com/v1', model: 'gpt-4', provider: 'openai', },});
// 打印结果console.log(`Score: ${report.summary.score.toFixed(3)}`);console.log(`Passed: ${report.summary.passed}/${report.summary.total}`);console.log(`By Dimension:`, report.summary.byDimension);console.log(`Suggestions:`, report.suggestions);import { FilePersistenceAdapter } from '@investment-agent/evaluation/persistence';
// 使用文件持久化const persistence = new FilePersistenceAdapter('./eval-reports');
const report = await evaluateCases(cases, { // ...options persistenceAdapter: persistence,});
// 报告会自动保存到 ./eval-reports/eval-{runId}.json评测系统会将运行记录保存到 SQLite 数据库:
// 评测运行表interface EvaluationRun { id: string; engine: string; categories: string; // JSON 数组 status: 'running' | 'completed' | 'failed'; score: number; totalCases: number; passedCases: number; failedCases: number; threshold: number; createdAt: string; completedAt?: string;}
// 用例结果表interface CaseResult { id: number; runId: string; caseId: string; category: string; passed: boolean; score: number; dimensionScores: string; // JSON 对象 engine: string; runRecord: string; // JSON 对象 scorers: string; // JSON 数组}查询历史记录:
// 通过 API 查询GET /api/evaluation?limit=20&engine=hermes
// 获取单次评测详情GET /api/evaluation/{runId}
// 实时进度流GET /api/evaluation/{runId}/stream (EventSource)提示词变更后
每次修改 System Prompt 或工具描述后,应运行评测验证效果。
模型升级时
切换或升级模型版本前,先运行评测建立基线,升级后再对比。
发布前验证
正式发布前运行完整评测,确保得分不低于基准线。
定期回归
建议每周运行一次完整评测,监控智能体表现趋势。
建立基准线 首次运行完整评测,记录各维度得分作为基准
针对性验证 修改特定功能后,运行相关类别的评测
asset-query 类别multi-turn 类别对比不同引擎 使用相同评测集对比 Hermes、Claude、DeepAgents 的表现
关注低分维度 优先改进持续低于阈值的维度
扩展用例集 根据实际业务场景添加自定义评测用例
在 packages/evaluation/src/benchmarks/datasets/ 目录下添加 JSON 文件:
[ { "id": "custom-query-001", "title": "Custom stock analysis", "difficulty": "medium", "category": "asset-query", "input": "分析茅台的投资价值", "expected": { "keywords": ["茅台", "估值", "增长", "风险"], "minKeywordCoverage": 0.5, "prohibitedPhrases": ["保证收益", "必涨"], "requireRiskDisclosure": true, "tools": ["stock_get_price", "stock_search_news"] } }]可能原因:
排查步骤:
解决方案: Mastra 评测器需要配置 LLM API Key 和基础 URL:
const report = await evaluateCases(cases, { // ...other options mastraModel: 'openai/gpt-4o-mini', llmJudge: { baseUrl: 'https://api.openai.com/v1', model: 'gpt-4', provider: 'openai', },});检查环境变量:
export OPENAI_API_KEY=sk-xxx# 或export ANTHROPIC_API_KEY=sk-xxx解决方案: 增加超时时间和减少并发数:
pnpm eval --timeout 300000 --concurrency 1或在 API 中配置:
await evaluateCases(cases, { // ... webApiRun: { timeoutMs: 300000, // 5 分钟 },});解决方案: 定期清理旧评测记录:
-- 删除 30 天前的评测记录DELETE FROM evaluation_runs WHERE created_at < datetime('now', '-30 days');DELETE FROM case_results WHERE run_id NOT IN (SELECT id FROM evaluation_runs);或保留最近的 N 条记录:
DELETE FROM evaluation_runsWHERE id NOT IN ( SELECT id FROM evaluation_runs ORDER BY created_at DESC LIMIT 20);评测系统使用并发控制确保资源合理使用:
async function mapWithConcurrency<T, R>( items: T[], concurrency: number, fn: (item: T) => Promise<R>,): Promise<R[]> { const results: R[] = new Array(items.length); let nextIndex = 0;
async function worker() { while (nextIndex < items.length) { const index = nextIndex++; results[index] = await fn(items[index]); } }
await Promise.all(Array.from({ length: Math.min(concurrency, items.length) }, worker)); return results;}推荐配置:
Web API Runner 使用流式处理避免超时:
// SSE 流解析优化function splitSseEvents(buffer: string): { pending: string; parts: string[] } { const parts = buffer.split(/\r?\n\r?\n/); const pending = parts.pop() ?? ''; return { pending, parts };}
// 增量解析,避免内存占用过大for (const part of split.parts) { const event = parseSsePayload(part); if (event) applyStreamEvent(collector, event, state);}评测系统与 Investment Agent 的可观测性框架深度集成:
追踪关联
每个评测用例的执行都会生成完整的追踪记录(Trace),包含 Span、指标和日志。
指标上报
评测指标可导出到监控系统:
成本核算
自动统计评测过程的:
历史对比
SQLite 持久化支持:
// Hermes Agent 执行记录包含完整可观测性数据interface HermesResult { completed: boolean; observability: { tokens: { input: number; output: number; total: number; }; cost: number; duration: number; toolCalls: number; iterations: number; }; trace: { traceId: string; spans: Span[]; };}export function scoreCustomMetric( testCase: BenchmarkCase, record: EvaluationRunRecord): ScorerResult { // 实现自定义评分逻辑 const score = calculateCustomScore(testCase, record);
return { dimension: 'execution', // 或自定义维度 name: 'custom-metric', passed: score >= 0.7, reason: `Custom metric score: ${score.toFixed(2)}`, score, };}
// 在 runAllScorers 中注册export function runAllScorers( testCase: BenchmarkCase, record: EvaluationRunRecord): ScorerResult[] { return [ // ...existing scorers scoreCustomMetric(testCase, record), ];}export interface CustomRunOptions { // 自定义配置}
export async function runCustomCase( testCase: BenchmarkCase, options: CustomRunOptions): Promise<EvaluationRunRecord> { // 实现自定义执行逻辑 const collector = new EvaluationEventCollector({ agentId: 'custom_agent', caseId: testCase.id, engine: 'custom', input: testCase.input, });
// 执行自定义 Agent // ...
return collector.toRecord();}
// 在 evaluator.ts 中添加引擎支持if (options.engine === 'custom') { record = await runCustomCase(testCase, options.customRun);}