gadget/docs/streaming-responses.md
2026-05-07 12:30:58 -04:00

9.9 KiB

Gadget Code Streaming Responses

The Architecture and Socket Protocol documents describe the Gadget Code system, and the socket protocol we use for transmitting events, respectively.

The User selects a Project and Drone from a list of available Projects and Drones. The User then starts a chat session with the Drone to work on the Project. When the user submits a prompt:

  1. gadget-code:frontend generates the submitPrompt event and sends it to gadget-code:backend.
  2. gadget-code:backend wraps the prompt as a Work Order, selects the gadget-drone process locked to the chat session (CodeSession), and forwards the Work Order to the drone for processing.
  3. The drone executes the Agentic Workflow Loop, calling the specified AI SDK's chat endpoint to process a prompt until done.

As long as the response contains tool calls, the loop continues inserting the tool call response output into the context and calling the AI ASK again for additional processing.

When there are no further tool calls received, the loop ends and communicates the final results of the turn back to gadget-code:backend, which then forwards the information back to the gadget-code:frontend (IDE) that sent the command.

The system currently performs all work, receives the response, and sends back a response at the conclusion of the work.

Streaming Responses

The OpenAI and Ollama SDKs both support streaming responses. We want to convert/refactor our Agentic Workflow Loop's response - to streaming responses.

  • As thinking/reasoning tokens are emitted, we want gadget-drone to emit the thinking event to the backend on it's DroneSession so the client IDE can update as the model is processing the work.
  • As response tokens are emitted from the streaming response, we want to emit our own response event so the IDE can update as the model is processing the work.
  • As tool calls are being requested by the agent, we want to emit our toolCall event so the IDE can update with information about the tool call in real-time as the model is processing the work.

Streaming: OpenAI

We will need to add streaming response processing to test. The generate and chat methods currently accept the streamCallback parameter, which is a callback that gets called with each token and tool call as it's emitted from the SDK.

We need to verify that the full path for these events is implemented, from gadget-drone generating the event to gadget-code:frontend (IDE) receiving the event, with gadget-code:backend (web server) processing things correctly to get the message routed back to the IDE CodeSession that wants it based on the work order's information for chat session, etc.

Streaming: Ollama

We will need to add streaming response processing to test. The generate and chat methods currently accept the streamCallback parameter, which is a callback that gets called with each token and tool call as it's emitted from the SDK.

We need to verify that the full path for these events is implemented, from gadget-drone generating the event to gadget-code:frontend (IDE) receiving the event, with gadget-code:backend (web server) processing things correctly to get the message routed back to the IDE CodeSession that wants it based on the work order's information for chat session, etc.

IDE Processing

When the User submits a prompt for processing as a work order, the IDE creates a ChatTurn component instance in the Chat View, and uses the Turn component to display these streaming response updates as they arrive.

The Turn component will display a spinner in it's header area while processing the work order/prompt.

Within the Agent's portion of the Turn display, it should look like the agent is having it's own conversation. Updates should occur in modes or phases that change based on the most recent content received. The agent's response, while processing a prompt, has three modes:

  • Idle: The agent's content display begins as Idle when the Turn is created. This allows it to 'sense' a transition to either thinking or responding.
  • Thinking: When transitioning to the thinking mode, a new thinking div is added to receive the token(s). We continue streaming thinking tokens into the thinking div for as long as the tokens are still thinking tokens. When a response-mode token is received, we transition TO response mode, insert a new response div, and proceed.
  • Responding: When transitioning to the responding mode, a new response div is added to receive the token(s). We continue streaming further response-mode tokens into this div for as long as the tokens are still response-mode tokens. If a thinking token is received, we transition TO the thinking mode, insert a new thinking div, and proceed.

If the content received is in the same mode as the current agent response mode (thinking/responding), then this is an append operation. As content arrives, we append the content to the current thinking/responding content, and update the display of it. We do this as efficiently as possible, trying to avoid copies as much as possible, as these messages can arrive at a rather high rate of speed.

IDE Tool Call Displays

When a toolCall message is received, break out of the current thinking/responding block to append a tool call div/display in the agent's response. We will use this to provide SUMMARY information about the tool call:

  • An ● indicator showing success status (green for success, red for error, yellow for warning, etc.)
  • The name of the tool called (e.g., search_google)

● search_google

We don't summarize successful or erroneous responses here in the tool call indicator in the chat message flow. We will be adding a separate view to display these messages in full detail later in the SESSION panel of the Chat Session view.

Persistence

The primary goal of this application is to make the agentic engineering process as observable as possible. If there is data available from a work order's Agentic Workflow Loop and processing, we:

  1. Normalize that data to our own interfaces as defined by:
    1. @gadget/ai and @gadget/api;
    2. ChatTurn interface; and
    3. ChatTurn model.
  2. Store that normalized data into our own database (MongoDB) within the current ChatTurn record being managed by gadget-code:backend (web server).

The gadget-code:backend (web) server will be responsible for persisting this data to our database. As events arrive FROM gadget-drone indicating progress within a Work Order (thinking, response, toolCall), we will update the ChatTurn record with the new data. And this is why stateful sessions exist for DroneSession and CodeSession.

While receiving thinking tokens, we should be aggregating that content into the session object in memory instead of calling the database with every micro-update. At mode changes, such as when changing from thinking => responding, or responding => thinking, we perform one update to write that block's aggregated content to the ChatTurn as it's type of content (thinking, responding, tool).

We need to adapt the persistence within ChatTurn to record this stateful/modal flow. When the User loads an existing ChatSession and it's associated turns for display, they should receive the same display as they saw while working.

Changes to ChatTurn Model

We need to add fields to our ChatTurn interface and ChatTurn model that will allow us to store the stateful information for display in the UI.

Currently, there are simple string/text fields for thinking and response. This is insufficient for being able to reconstruct the actual flow that happened while processing the work order.

We will repurpose the response field to become an array of objects, each object representing a thinking/responding/toolCall block. The thinking and responding blocks will be that which stores the aggregated content that was received while processing the work order, in order. The toolCall block will be that which stores the information about each tool call that happens, in order.

Each object will have a mode (thinking, responding, tool), a createdAt timestamp, and a content field which is the content of the event for display in the UI. The content field can be a string, such as when storing thinking and responding text content, or an object that records tool call information and metadata for analysis later by the User and by analytics tools.

Chat Turn Component Updates

The Chat Turn Component will require updates to enhance the Agent's portion of the turn display. The Agent's block will now resemble:

Thinking: [content streams in]

● search_google
● search_google

Thinking: [content streams in]

Responding: [content streams in]

● search_google

Responding: [content streams in]

We don't label the thinking and responding blocks with a header. Instead, we use style to indicate the type of block presented:

  • Thinking is muted
  • Responding is standard formatting
  • Tool calls are just a one-line element that summarizes the call

Markdown

All text content within a Chat Turn display is in Markdown format. The User uses Markdown, and the Agent responds with Markdown. We will use the marked package to render the Markdown text to HTML for display.

Caveat: The marked library sets it's breaks option to false by default. This is not the behavior what we want. When rendering Markdown content for display in the Chat Turn component for either the User or the Aget, breaks must be set to true to honor line breaks and display them appropriately.