Most AI finance tools do the same thing: take your portfolio, pass it to a large language model, and return a recommendation. One call. One perspective.
That bothered me. Because that's not how any serious investment decision actually gets made.
Real investment decisions come from debate: analysts challenging each other's assumptions, a risk manager pushing back on the growth thesis, a quant pointing out that the momentum signal just reversed. The final recommendation is the synthesis of disagreement, not the output of a single mind. So I built something different.
The Problem with Single-Agent Finance AI
When you ask a single LLM "should I rebalance my portfolio?", you're getting one model's best guess based on its training data and whatever context you provided. Even with retrieval-augmented generation (RAG) and live market data, the model has no mechanism to challenge its own reasoning.
This leads to what I call confident mediocrity: the model sounds authoritative but anchors on its first inference and doesn't revise.
The fix isn't a better prompt. It's a better architecture.
The Delphi Method
In the 1950s, researchers at RAND Corporation faced a problem: how do you aggregate expert forecasts on genuinely uncertain questions without the loudest voice dominating the outcome? Their solution was the Delphi Method: structured rounds of independent expert judgment, with cross-visibility between rounds and the freedom to revise. Named after the Oracle of Delphi, it was designed for exactly the kind of question where no single expert has the full picture.
I applied this structure directly to the agent architecture:
The 10 Analyst Personas
I wanted the analysts to represent genuinely different investment philosophies, not 10 slightly-different clones. Here's the lineup:
The diversity matters. Bridgewater and Citadel will often disagree. The meta-agent CIO has to synthesize those disagreements into a coherent recommendation. That is the most interesting output of the whole system.
Live Market Data (RAG)
Static analysis isn't useful for portfolio decisions. The system fetches live data before every Delphy run: current prices and daily changes for all holdings, S&P 500 / Nasdaq / VIX, treasury yields, top analyst ratings, earnings calendar, and market-moving news.
This uses Claude's web_search_20250305 tool, running 10 parallel search calls before Round 1 begins. Every analyst has the same live data going into their independent analysis.
The Tech Stack
# Stack
Frontend: React 19 + Vite (SPA)
Backend: Node.js + Express (REST API)
AI: Anthropic Claude Sonnet 4 (claude-sonnet-4-20250514)
RAG: Claude web_search_20250305 tool
Deploy: Cloudflare Pages
The backend is intentionally simple: Express with 6 REST endpoints. All the intelligence lives in the prompt engineering and the orchestration logic that sequences the 33 API calls per Delphy run. Total cost per run: roughly $0.10–0.30 depending on portfolio size and question length.
What I Learned
Orchestration is the hard part.
Getting a single agent to give a good answer is relatively easy. Getting 10 agents to give diverse answers, then synthesizing those into something coherent, requires careful prompt engineering at every layer. The meta-agent system prompt (telling the CIO how to handle genuine disagreements versus noise) took the most iteration.
Parallelism matters for latency.
Round 1 runs all 10 analysts in parallel. If those ran serially, the wait would be unbearable. The architecture choice of "parallel where independent, serial where dependent" is what makes it feel real-time.
The Delphi structure works.
After adding Round 2 cross-validation, the recommendations noticeably improved. Agents that were overconfident in Round 1 revised their views, and the final synthesis was more nuanced. Project Delphy genuinely outperforms any single analyst's Round 1 output.
The framing matters as much as the build.
For a technical audience, the right frame is Stacked Generalization (10 base models feeding a meta-learner CIO) and Mixture of Experts (each analyst a specialized model). For a business audience, it is the Delphi Method. Both are accurate. The choice of which to lead with depends entirely on who is reading.
What's Next
- → Demo mode with synthetic portfolio data so anyone can try it without uploading personal statements
- → A Python version using LangGraph for more complex workflow orchestration
- → Backtesting: run Project Delphy on historical portfolio snapshots and score recommendations against actual returns
Hosung Kim
MSBA student at USC Marshall School of Business, focused on AI systems and Data Science. Building open-source tools at the intersection of machine learning and investment analysis.
@HosungKim48