gadget/docs/socket-protocol.md
2026-05-11 20:27:24 -04:00

306 lines
11 KiB
Markdown

# Gadget Code Socket Protocol Reference
This document serves as a "Cheat Sheet" for AI agents and developers working on the Gadget Code real-time messaging system.
## 1. Components & Connections
| Component | Role | Protocol | Auth Method |
|-----------|------|----------|-------------|
| `gadget-code:web` | Hub / Router / Server | Socket.IO Server | N/A |
| `gadget-code:ide` | Frontend Control Surface | Socket.IO Client | JWT Token |
| `gadget-drone` | Worker / AWL Runner | Socket.IO Client | Drone Registration ID |
---
## 2. Event Map Overview
Defined in `packages/api/src/messages/socket.ts`.
### IDE -> Web (Client to Server)
* `requestSessionLock`: Request to exclusive-lock a drone for a project session.
* `requestWorkspaceMode`: Request a mode change (Idle, User, Agent).
* `submitPrompt`: Submit a user prompt for agent processing.
### Drone -> Web (Client to Server)
* `thinking`: Stream reasoning/thought process text.
* `response`: Stream natural language response text.
* `toolCall`: Emit a specific tool execution event with result.
* `workOrderComplete`: Signal that a prompt processing turn is finished.
* `requestCrashRecovery`: Inbound from drone on restart if it finds a stalled work order.
* `requestTermination`: Acknowledgment from drone that termination request was received.
### Web -> Drone (Server to Client)
* `processWorkOrder`: Command to start processing a specific prompt/turn.
* `crashRecoveryResponse`: Command to `discard` or `retry` a stalled work order.
* `requestTermination`: Command to immediately terminate the drone process.
### Web -> IDE (Server to Client)
* `sessionUpdated`: Notify the IDE that a chat session property has changed (e.g. auto-generated name).
---
## 3. Core Sequences & Routing
### 3.1 Prompt Submission Flow
1. **IDE** emits `submitPrompt(content)`.
2. **Web (`CodeSession.ts`)**:
* Creates a `ChatTurn` document (status: `processing`).
* Increments the chat session's `stats.turnCount`.
* Finds the target `DroneSession`.
* Caches the updated session and signals the **IDE** to enter Processing state.
* Emits `processWorkOrder` to the **Drone**.
* On first prompt (name is still the default), calls AI API to auto-generate session name.
* Emits `sessionUpdated({ name })` to **IDE** if the name changed.
3. **Drone (`gadget-drone.ts`)**:
* Writes a local `.gadget/work-order.json` cache (for crash recovery).
* Calls `AgentService.process()`.
* Emits streaming events back to **Web**.
### 3.2 Result Streaming Flow
1. **Drone** emits `thinking(text)`, `response(text)`, or `toolCall(id, name, args, result)`.
2. **Web (`DroneSession.ts`)**:
* Locates the associated `CodeSession` via `SocketService.getCodeSessionByChatSessionId()`.
* Updates the `ChatTurn` document in MongoDB incrementally.
* Forwards the event to the **IDE**.
3. **IDE**: Updates the UI in real-time.
### 3.3 Session Termination
1. **Drone** emits `workOrderComplete(turnId, success, message)`.
2. **Web (`DroneSession.ts`)**:
* Sets `ChatTurn` status to `finished` or `error`.
* Forwards event to **IDE**.
* Clears `currentTurnId` from the drone session.
### 3.4 Drone Termination Flow
1. **User** clicks "Terminate" button in Drone Manager UI.
2. **IDE** calls `POST /api/v1/drone/registration/:id/terminate`.
3. **Web (`DroneService.ts`)**:
* Checks if drone is already offline → returns error if so.
* Looks up `DroneSession` via `SocketService.getDroneSession()`.
* If drone not connected → marks as offline immediately, returns success.
* Emits `requestTermination` to drone socket with callback.
* Starts 10-second timeout.
4. **Web (`DroneSession.ts`)**:
* Receives `requestTermination` event.
* Logs the termination request.
* Forwards `requestTermination` to drone socket (passthrough).
5. **Drone (`gadget-drone.ts`)**:
* Receives `requestTermination` from platform.
* Calls callback with `success: true`.
* Sends `SIGINT` to self, triggering graceful shutdown.
* Updates status to `Offline` during shutdown.
6. **Web (`DroneService.ts`)**:
* Drone accepts termination → polls DB every 500ms waiting for `Offline` status.
* Drone goes offline → resolves with success.
* Timeout expires (10s) → forces status to `Offline`, resolves with success.
---
## 4. Message Signatures (TS Reference)
### IDE -> Web
```typescript
type RequestSessionLockMessage = (
registration: IDroneRegistration,
project: IProject,
chatSession: IChatSession,
cb: (success: boolean, chatSessionId: string) => void
) => void;
type SubmitPromptMessage = (prompt: string) => void;
```
### Web -> Drone
```typescript
type ProcessWorkOrderMessage = (
registration: IDroneRegistration,
project: IProject,
chatSession: IChatSession,
turn: IChatTurn,
cb: (success: boolean, message?: string) => void
) => void;
type RequestTerminationMessage = (
cb: (success: boolean) => void
) => void;
```
### Web -> IDE
```typescript
type SessionUpdatedMessage = (
updates: Partial<IChatSession>
) => void;
```
### Drone -> Web (Streaming)
```typescript
type ThinkingMessage = (content: string) => void;
type ResponseMessage = (content: string) => void;
type ToolCallMessage = (
callId: string,
name: string,
params: string, // JSON.stringify
response: string // JSON.stringify
) => void;
type WorkOrderCompleteMessage = (
workOrderId: string,
success: boolean,
message?: string
) => void;
type RequestTerminationMessage = (
cb: (success: boolean) => void
) => void;
```
---
## 5. Session Implementation Guide (Web Server)
The web server (`gadget-code:web`) implements two wrapper classes in `src/lib/`:
### `CodeSession.ts`
Manages the IDE socket.
* **Logic**: Maps User ID -> Socket ID.
* **Routing**: When an IDE sends a message, `CodeSession` finds the selected drone's `DroneSession` and forwards the command.
### `DroneSession.ts`
Manages the Drone socket.
* **Logic**: Maps Drone Registration ID -> Socket ID.
* **Routing**: When a drone streams, `DroneSession` looks up the `chatSessionId` in the `SocketService` index to find the return path to the IDE.
* **Session Lookup**: `SocketService` maintains a `droneRegistrationIndex` Map that maps `registration._id``DroneSession` for efficient lookup by registration ID.
### Session Indexing Architecture
The `SocketService` maintains multiple indexes for efficient session lookup:
1. **`droneSessions`**: Map<socket.id, DroneSession> - Primary storage by socket ID
2. **`droneRegistrationIndex`**: Map<registration._id, DroneSession> - Lookup by drone registration
3. **`codeSessions`**: Map<socket.id, CodeSession> - Primary storage by socket ID
4. **`codeSessionUserIndex`**: Map<user._id, CodeSession> - Lookup by user ID
5. **`chatSessionIndex`**: Map<chatSessionId, CodeSession> - Reverse lookup from chat session to IDE
All indexes are kept in sync during connection and disconnection.
---
## 6. Workspace Crash Recovery
1. **Drone** starts -> checks for `.gadget/work-order.json`.
2. If found, emits `requestCrashRecovery({ workspaceId, turnId, chatSessionId })`.
3. **Web (`DroneSession.ts`)**:
* Checks DB for `ChatTurn` status.
* If turn is already `finished`, responds with `{ action: "discard" }`.
* If turn is `processing`, responds with `{ action: "retry" }` and schedules a new `processWorkOrder` after a delay.
4. **Drone**: Deletes local cache upon receiving any `crashRecoveryResponse`.
---
## 7. Extending the Protocol
To add a new message:
1. Add the message type to `packages/api/src/messages/ide.ts`, `drone.ts`, or `web.ts`.
2. Register it in `ClientToServerEvents` or `ServerToClientEvents` in `packages/api/src/messages/socket.ts`.
3. Re-export from `packages/api/src/index.ts`.
4. Implement the sender (emit) in the Client (`ide` or `drone`) or Server (`CodeSession`/`DroneSession`).
5. Implement the handler in the corresponding class or frontend component.
6. Implement the forward-path routing if needed.
---
## 8. Reconnection & Message Queuing
### 8.1 Problem Statement
When the browser refreshes during work order processing:
1. Old `CodeSession` disconnects, but `DroneSession` continues routing to it
2. Drone emits events but they go to a disconnected socket
3. New `CodeSession` connects but isn't linked to the active chat session
4. Messages are lost; IDE never receives streaming updates
### 8.2 Solution Architecture
**Three-phase approach:**
1. **Redis Message Queue** (`src/lib/message-queue.ts`)
- Messages enqueued when routing fails (disconnected socket)
- FIFO ordering with RPUSH/LPOP
- 30-minute TTL (1800 seconds)
- Max 1000 messages (drop oldest)
- Aggregates adjacent thinking/response messages during drain
2. **Redis Tab Lock** (`src/lib/tab-lock.ts`)
- Prevents concurrent tab access to same chat session
- 1-minute timeout (requires heartbeat renewal)
- Includes socket ID and user ID for validation
- Auto-cleanup of stale locks
3. **Auto-Reconnection** (`CodeSession.checkAndReestablishActiveSession()`)
- On connect, checks for active processing turn in DB
- If found, attempts to acquire tab lock
- On success, re-establishes chat session index
- Drains queued messages from Redis
- Aggregates and delivers messages to client
### 8.3 Message Queue Flow
```
Drone emits thinking() → DroneSession.onThinking()
SocketService.getCodeSessionByChatSessionId() throws (disconnected)
MessageQueue.enqueue(chatSessionId, { type: 'thinking', args: [...] })
[30 minutes later] Queue expires automatically
OR
[On reconnect] MessageQueue.drain() → aggregateMessages() → deliver
```
### 8.4 Tab Lock Flow
```
IDE connects → CodeSession.register()
checkAndReestablishActiveSession()
Find active chat session with processing turn
TabLock.acquire(chatSessionId, userId, socketId)
Success: Register chat session, drain queue, emit status
OR
Failure: Emit 'tabLockDenied' → IDE navigates away
```
### 8.5 Frontend Reconciliation
The frontend handles reconnection gracefully:
1. **Load history first** - Fetch chat session and turns from DB
2. **Connect socket** - Establish WebSocket connection
3. **Backend auto-reconnects** - If processing turn found, backend re-establishes
4. **Receive queued messages** - Aggregated messages delivered in order
5. **Handle duplicates** - Frontend merges with existing history
### 8.6 Single Tab Enforcement
Only one tab can control a chat session at a time:
- First tab acquires Redis lock
- Subsequent tabs receive `tabLockDenied` event
- UI shows "Chat session open in another browser tab"
- User must navigate away or close the duplicate tab
### 8.7 Status Indicators
The status bar shows connection state:
- **Connected** (green ●) - Socket connected, receiving messages
- **Connecting** (yellow ●) - Attempting to connect
- **Error** (red ●) - Connection failed
- **Disconnected** (gray ●) - No active connection
Status messages inform the user:
- "Connecting..." - Initial connection
- "Reconnecting to active session..." - Auto-reconnect in progress
- "Reconnected" - Successfully reconnected
- "Chat session is open in another browser tab" - Tab lock denied