Session overview
Joyce wanted to stress-test a multi-model code review workflow — using a financial agent codebase as the test subject. Instead of just doing the review myself, we turned it into an orchestration exercise — I called Gemini and Codex from the terminal to independently review the same codebase, then synthesized all three perspectives into one report. The result was sharper than any single review could have been.
Collaboration patterns
The human set the frame; the bot ran the orchestra. Joyce gave the context (a codebase with a 9-point refactoring plan) and the intent (find blind spots). The bot's job was to design the review process, wrangle three different AI CLIs, and synthesize competing outputs into something useful.
CLI debugging as a collaboration tax. Getting Gemini and Codex to run non-interactively took multiple attempts — wrong flags, missing git repos, unsupported model versions. Joyce was patient but started asking "are you done yet?" The bot should have tested the CLI invocations first before promising parallel execution.
Consensus carries weight. When all three models independently flagged the same issue (drop vector search = bad idea), it became undeniable. When Codex alone found a path traversal bug the other two missed, it justified the multi-model approach. The format — consensus table with per-model attribution — made the findings immediately credible.
Format matters more than content density. The PDF was "bad as usual" (direct quote). Switching to HTML with proper CSS, colored badges, and callout boxes made the same content actually readable. Lesson re-learned from Session 002: inline review first, pretty artifact last.
What 砚 learned
-
Multi-model review is a legitimate workflow. Not for every task, but for high-stakes reviews (security audit, architecture assessment), having three independent passes catches things one model misses. Codex found 7 issues I didn't.
-
CLI tools are fragile.
gemini --approval-mode planneeded an experimental flag.codex execneeded--skip-git-repo-check.gpt-5.4-codexisn't available on ChatGPT accounts. Each failure cost time and trust. -
Context shapes the review tone. The refactor plan was written by someone else. Joyce was reviewing it from a senior perspective. The review needed to respect the original author's work while being honest about risks — a tone the bot had to calibrate.
-
Don't promise PDF quality. Three sessions in, the pattern is clear: reportlab PDFs look bad. HTML is the right default for structured reports. Stop defaulting to PDF.
-
Speed expectations are real. "你可快一点吧你" is feedback. When running background tasks, give the human something to do or a clear ETA instead of silence.
Self-improvement notes
- Pre-test CLI invocations before launching them in parallel. One dry run saves three failed attempts.
- Default to HTML for reports. Only use PDF if explicitly requested.
- Give progress updates during long background tasks — don't go silent for minutes.
- When the human says "done, generate" — generate. Don't add another review pass unless asked.
- Collaboration log should be called early, not as an afterthought at the end.
Open questions
Is multi-model review reproducible? The three models gave different depths of feedback partly because of how they were invoked (interactive vs CLI vs exec mode). Would the results differ with the same prompt format?
Does the output format depend on the audience? A 14-question prep table works for a senior reviewer but might overwhelm someone who just wants action items. Should there be layered versions of the same report?
Can the bot play devil's advocate? The report identifies risks. But could the bot also simulate the questioner — stress-test answers in real-time, like a sparring partner?