AI AgentsBenchmarkingGemini-3-FlashTerminalBenchNode.jsAI DevelopmentPerformance OptimizationDebugging

Gemini-3-Flash: My ai agent benchmark terminalbench Win & 3 Fixes

Umair, buildzn.com, shares the architecture, prompt engineering, and fixes for his top-scoring ai agent benchmark terminalbench on Gemini-3-Flash. Real bugs,...

U

Umair · Flutter & AI Engineer

April 28, 2026 · 10 min read

Everyone talks about building AI agents that "just work," but nobody tells you how much low-level crap you debug to get there. I spent weeks wrestling with gemini-3-flash-preview on TerminalBench, hitting every wall from bad tool calls to silent API failures. Figured it out the hard way.

Why TerminalBench Matters for AI Agent Benchmark

Look, benchmarks are usually fluff. But TerminalBench is different. It’s a real-world gauntlet for AI agents, pushing them through complex CLI tasks. We’re talking file operations, network requests, package management – actual dev work. For me, getting my Node.js AI agent to top scores meant validating the multi-agent architecture I've been refining for NexusOS. Plus, I needed to see how gemini-3-flash-preview actually performed under pressure, not just theoretical token counts.

This wasn't just about showing off. Building an agent capable of navigating intricate command-line environments helps you truly understand the model's reasoning, tool-use capabilities, and error handling. It's a brutal, honest ai agent benchmark terminalbench.

The Agent Architecture: Lean & Mean Node.js

My setup for this challenge was pretty standard for my agent work: Node.js backend, @google/generative-ai SDK, and a custom toolset. I don't get why people over-engineer with massive frameworks for basic agents. Keep it simple.

Here’s the core structure:

  1. Orchestrator (agent.js): The brain. Manages the conversation, parses model responses, dispatches tool calls, and maintains state. This is where most of the build AI agent challenges manifest.
  2. Tool Registry (tools.js): A collection of functions exposed to the Gemini model. Each tool maps to a specific shell command or utility.
  3. State Manager (state.js): Simple in-memory object for TerminalBench runs. For production (like FarahGPT or NexusOS), this would be Firebase or Redis.
  4. Prompt Templates (prompts.js): Critical for guiding gemini-3-flash agent behavior. System instructions, few-shot examples, and tool definitions live here.

For TerminalBench, the agent needed access to common shell commands like ls, cd, cat, echo, mkdir, and curl. I wrapped these in Node.js child process calls, returning stdout/stderr. Simple, but effective.

// tools.js
const { exec } = require('child_process');

async function executeCommand(command) {
  return new Promise((resolve) => {
    exec(command, (error, stdout, stderr) => {
      if (stderr) {
        // Important: return stderr as part of success for the agent to debug
        resolve({ success: false, output: stderr.trim() });
      } else if (error) {
        // Specific non-zero exit code errors
        resolve({ success: false, output: error.message.trim() });
      } else {
        resolve({ success: true, output: stdout.trim() });
      }
    });
  });
}

const tools = [
  {
    name: "run_shell_command",
    description: "Executes a shell command on the system.",
    parameters: {
      type: "object",
      properties: {
        command: {
          type: "string",
          description: "The shell command to execute."
        }
      },
      required: ["command"]
    },
    func: executeCommand
  }
  // ... other tools like 'read_file', 'write_file'
];

module.exports = tools;

Prompt Engineering for Precise Tool Use

This is where the agent performance tuning really kicks in. Gemini-3-Flash is good, but it's not telepathic. You need to be explicit. My prompt had three key components:

  1. System Instruction: Define the agent's persona and objective.
  2. Tool Definitions: Passed directly via the Gemini API's tools parameter.
  3. Few-Shot Examples: Crucial for teaching the model how to use run_shell_command correctly, especially for multi-step tasks.

Here's the basic structure for the system instruction and a simplified example:

// prompts.js
const systemInstruction = `
You are an expert Linux sysadmin assistant. Your goal is to solve complex system tasks by executing shell commands.
Always think step-by-step.
Your output must be a tool call to 'run_shell_command' to interact with the environment.
If you need to analyze output, call 'run_shell_command' and wait for results.
Do not assume success or file contents. Always verify.
If a command fails, try to debug it using other commands (e.g., 'ls -l', 'cat error.log').
When you believe the task is complete, use the 'final_answer' tool.
`;

const fewShotExamples = [
  // Example 1: List directory contents
  {
    role: "user",
    parts: [{ text: "List the files in the current directory." }]
  },
  {
    role: "model",
    parts: [{
      functionCall: {
        name: "run_shell_command",
        args: { command: "ls -F" }
      }
    }]
  },
  {
    role: "tool",
    parts: [{
      functionResponse: {
        name: "run_shell_command",
        response: { success: true, output: "file1.txt\nsubdir/\nscript.sh" }
      }
    }]
  },
  {
    role: "model",
    parts: [{
      functionCall: {
        name: "final_answer",
        args: { answer: "The files are file1.txt, subdir/, and script.sh." }
      }
    }]
  },
  // Example 2: Create a directory and then list it
  {
    role: "user",
    parts: [{ text: "Create a directory called 'testdir' and list its contents." }]
  },
  {
    role: "model",
    parts: [{
      functionCall: {
        name: "run_shell_command",
        args: { command: "mkdir testdir" }
      }
    }]
  },
  {
    role: "tool",
    parts: [{
      functionResponse: {
        name: "run_shell_command",
        response: { success: true, output: "" }
      }
    }]
  },
  {
    role: "model",
    parts: [{
      functionCall: {
        name: "run_shell_command",
        args: { command: "ls -F testdir" }
      }
    }]
  },
  {
    role: "tool",
    parts: [{
      functionResponse: {
        name: "run_shell_command",
        response: { success: true, output: "" }
      }
    }]
  },
  {
    role: "model",
    parts: [{
      functionCall: {
        name: "final_answer",
        args: { answer: "Created 'testdir'. It is currently empty." }
      }
    }]
  },
  // ... more complex examples with error handling and debugging
];

module.exports = { systemInstruction, fewShotExamples };

The final_answer tool is just a special tool that signals the task is complete and provides the final output. This is crucial for TerminalBench's scoring mechanism. Without it, the agent would just keep generating commands.

What I Got Wrong First: The Gemini API & Tool Call Hell

Okay, so getting a top score on TerminalBench wasn't a walk in the park. The initial attempts were filled with build AI agent challenges. Here's the thing — gemini-3-flash-preview is fast, but it has quirks.

1. The Silent Tool Call Failure: My biggest headache came from the Gemini API client itself, specifically the @google/generative-ai library version 0.11.0. I'd send a request, expecting a functionCall, but sometimes I'd just get a text response or even nothing, even when the model should have used a tool.

Turns out, if the model hallucinates a tool name or arguments that don't precisely match your tools definition, the API sometimes doesn't throw a proper error telling you the tool call was invalid. It just defaults to generating text or an empty response. This is infuriating for agent performance tuning.

My console was clean, but the agent wasn't calling run_shell_command. I debugged by logging the raw API response object.

// Snippet of the raw API response when things went south
// This *should* have been a tool call, but came back as text
// or even an empty 'parts' array if the model was confused.
// The actual error was usually something I couldn't log directly from the SDK,
// but implied by the model's *lack* of tool call where expected.
/*
{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "I can't find a tool to perform that action." // Or sometimes just an empty array
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP"
    }
  ]
}
*/

The Fix: I had to explicitly include extremely detailed and specific examples in the few-shot section of the prompt. Not just "use ls," but "when asked to list files, always call run_shell_command with command: 'ls -F'." I also added robust input validation on my tool functions, so if gemini-3-flash agent sent malformed JSON args (e.g., command: 123 instead of a string), my wrapper would catch it and return a clear error back to the model. This taught the agent faster than just letting the API silently fail.

2. State Management with Multiple Turns: TerminalBench often requires multiple commands to complete a task. My initial agent wasn't good at carrying context. It would forget what it just did or what the previous command's output meant.

The Fix: The fewShotExamples were key here again, demonstrating chained commands. But more importantly, I started treating each tool response as a critical part of the conversation history. Instead of just logging the output, I explicitly added a role: "tool" entry with the functionResponse to the history array for the model. This is standard, but easy to gloss over.

3. Over-Reliance on Pure Reasoning: I initially thought gemini-3-flash-preview would just "figure it out" from the system prompt. Wrong. It needs concrete examples of problem-solving. Asking it to debug a failed command without an example of how to debug (e.g., ls -l for permissions, cat error.log for details) led to vague or incorrect follow-up actions.

The Fix: Expanded the fewShotExamples to include scenarios where commands failed, and the agent then used another tool call to diagnose the issue. This taught the ai agent benchmark terminalbench to be resilient.

Optimization & Gotchas

To really nail the agent performance tuning, a few things made a difference:

  • Token Budget Discipline: gemini-3-flash is cheaper, but long histories still cost. I implemented a simple sliding window for conversation history, keeping the last N turns, with a hard cut-off. For TerminalBench, the tasks are usually contained enough that full history works, but for complex, long-running agents, this is critical.
  • Response Schema Enforcement: For tools like final_answer, I made the answer argument a strict string. If the model started outputting JSON or other formats, my validation caught it. This ensures TerminalBench's scoring parser gets what it expects.
  • Retries and Backoff: For any external API calls made by the agent's tools (e.g., a curl tool hitting a flaky external service), implementing basic exponential backoff and retries dramatically improved stability. Not directly for TerminalBench's shell commands, but crucial for build AI agent challenges in general.

FAQs

How do you prevent AI agents from hallucinating tool calls?

You can't eliminate it entirely, but you can drastically reduce it. Provide clear, concise system instructions. More importantly, use strong few-shot examples that demonstrate correct tool usage, including edge cases. Finally, validate the arguments received by your tools; if they're malformed, return a clear error message back to the model in the tool response.

Is Gemini-3-Flash suitable for complex AI agents?

Yes, gemini-3-flash-preview is surprisingly capable for its speed and cost. Its tool-use capabilities are solid, especially with careful prompt engineering. However, for highly complex, multi-modal reasoning or extremely long contexts, larger models might still be necessary. For many ai agent benchmark terminalbench tasks, it performs exceptionally.

What's the best way to handle state in a multi-turn AI agent?

For simple benchmarks, in-memory state is fine. For production Node.js AI agent applications, use a persistent store like Firebase, MongoDB, or Redis. Store the full conversation history, including tool calls and their outputs, to give the agent a complete picture of past interactions.

Final Thoughts

Building an AI agent that consistently scores high on something like TerminalBench isn't about finding some magic prompt. It's about meticulous engineering: solid architecture, precise prompt engineering with detailed few-shot examples, and brutal debugging of integration issues. My top score with gemini-3-flash-preview wasn't because the model "just worked," but because I hammered out every single edge case and API quirk. Honestly, anyone who says agent development is just "prompt engineering" hasn't actually shipped anything complex.

U

Umair Bilal

Flutter & AI Engineer with 4+ years experience and 20+ production apps shipped. I build mobile apps, AI-powered systems, and full-stack SaaS. Founder of BuildZn and NexusOS (AI agent governance SaaS). Full-stack: Flutter, Node.js, Next.js, AI APIs, Firebase, MongoDB, Stripe, RevenueCat.

Need a Flutter developer?

I build production apps from scratch — iOS, Android, AI features, payments. Fixed price, App Store guaranteed.

Get a Free Proposal →