Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff', 'primaryColor': '#ffffff'}}}%%
flowchart LR
subgraph Controller["**Controller** (Rust)"]
C[Orchestration]
end
subgraph VLM["**VLM** (OpenRouter)"]
V[Vision-Language Model]
end
subgraph Simulation["**Simulation** (Zig/raylib)"]
S[Game State]
end
C -->|"screenshot + prompt"| V
C <-->|"cmds + state<br>**UDP:9999**"| S
style Controller fill:#8B5A2B,stroke:#5C3A1A,color:#fff
style VLM fill:#87CEEB,stroke:#5BA3C6,color:#1a1a1a
style Simulation fill:#4A7C23,stroke:#2D5A10,color:#fff
style C fill:#B8864A,stroke:#8B5A2B,color:#fff
style V fill:#B5E0F7,stroke:#87CEEB,color:#1a1a1a
style S fill:#6BA33A,stroke:#4A7C23,color:#fff
The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot. The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.