My goal last time was to use the world model for the control task of centering an object in the view of the robot arm. Here’s the result.
There were two stages to get this to work. The first was making a Claude based controller, which worked by using the world model to predict the outcome of each candidate move (pan left, pan right, or hold) then having Claude review and decide which action was better. This was really slow so the second stage was to use this Claude controller to generate data that was then used to train a faster controller that you see in the video. Along the way I ended up experimenting with letting AI control more of the robot training process, although that was somewhat less successful.
My original workflow consisted of two separate processes of data collection and model training. I decided to experiment with combining these processes and having Claude make decisions on when to move back and forth between them. At a high level the unified process, which I’ll call the autonomous learner, has five states: collect data, validate (test the model on new data), train, think, and pause for human intervention (manually move objects around in the scene). The autonomous learner is mainly for orchestration and utilizes code from the existing data collection and training repos for the collect data, train, and validate states. The think state is the new part of this process and is essentially a prompt to Claude with data from the other states (e.g. learning curve, info on data collected, configuration parameters, etc). Claude reviews this data and then decides what the next state should be for the autonomous learner and any adjustments that should be made e.g. changing the architecture of the model, fine-tuning vs training from scratch, or how much data to collect and over what range of motion. The autonomous learner did make the entire process of collecting data and training easier, but the decision making by Claude didn’t lead to better models in the same way automating the training process did in the last post. Claude would often make a reasonable analysis e.g. decide there was a lack of diversity in data, but then would fail to direct the learner to pause for a human to rearrange the scene. The convenience came more from putting data collection and training into a loop, and the transitions between data collection and training ended up being more scripted. The final model was a diffusion transformer with about 1 billion parameters. It had a good ability to predict counterfactual actions within the training data.
Some action conditioning was evident when running inference live, but the results were not nearly as good.
The initial plan was to use a vision language model (VLM) in conjunction with the world model to control the robot similar to what I did in my earlier project. The idea was to use the world model to get predicted frames based on different actions then use the VLM to choose the prediction that was closest to the desired end state. I didn’t want to manually collect data to fine tune the VLM like last time so I tried a few different local models like various versions of Gemma and Qwen. None of them performed that well so I ended up trying calls to Claude using max effort — via the Claude Code CLI (claude -p --model opus --effort max). This worked when I initially tried it, but was extremely slow (about 25-50 seconds per action). It’s worth noting I recently tried it again for live inference and it did not work well, which may be due to changes in lighting or other out of distribution factors.
I had a vague notion that large models could be distilled into smaller more efficient models so after some consultation with Claude I decided to try training a fast model through behavior cloning. The idea was to use frames from the training data and feed them into the Claude robot controller. Under the hood this reuses the same predict-then-rank loop as the slow controller — the world model rolls each sampled frame forward under every candidate action and Claude picks the prediction that best centers the Kong. This created a new dataset that was used to train a simple ResNet18 + multi layer perceptron (~11M parameters), image-to-action classifier. This ended up working well and cut the time to action from ~25 seconds to about a second, now limited by the arm physically moving rather than compute. It also worked much better live than the world model, which is interesting because maybe it implies we don’t need to get the world model to generalize as much to still be useful for teaching the faster model. It could also mean I should try using a pre-trained backbone for the world model.
There are a lot of potential directions to go from here, but I think the main medium to longterm goal is to perform some sort of pick and place task and in the shorter term training the arm to touch an object i.e. generalize the process to multiple joints.