xBerry Blog xBerry LeRobot Blog LeRobot + ACT + SO-101: How We Built an Imitation Learning Setup

LeRobot + ACT + SO-101: How We Built an Imitation Learning Setup

xBerry’s LeRobot – a robotic arm, imitation learning and a bit of tape. That’s what our internal R&D project with LeRobot and the ACT model looks like.

TL;DR – what’s this project about?

We built a setup where a robotic arm learns to pick up objects by watching a human move. No trajectory programming. No hundreds of lines of control code. Instead – AI, depth cameras, and something called imitation learning.

 

This is the first article in a series about the project. We’ll walk you through what we’re using, why we chose it, and what came out of it.

 

Hardware: two arms, one job

The heart of our setup is a pair of SO-101 robotic manipulators – each with 6 degrees of freedom (6 DoF), which in practice means roughly what a human arm can do: rotate at the wrist, elbow, shoulder, and so on.

 

They operate in a Leader–Follower configuration:

 

  • Leader (black arm) – a human operator holds it and demonstrates the movement,
  • Follower (white arm) – mirrors every move in real time.

This setup lets us record expert demonstrations without writing a single line of inverse kinematics code. The operator simply… shows the robot what to do. Sounds simple and that’s exactly the point.

 

Software: LeRobot by Hugging Face

Robotics Arm LeRobot

For managing the entire research pipeline, we use LeRobot – an open-source framework from Hugging Face that does exactly what we needed: strips out all the tedious, error-prone integration code and lets you focus on what actually matters – the model architecture.

What does LeRobot handle for us?

  • Multi-threaded data recording from multiple cameras simultaneously,
  • Dataset compression and categorisation,
  • Training neural network policies,
  • Evaluating trained models.

The whole thing is configured through YAML files and a Makefile. Seriously — that’s all it takes to run an advanced robotics experiment on a standard desktop machine.

 

Why does this matter for business? LeRobot is democratising AI robotics. Projects that three years ago required dedicated labs and a team of senior engineers can now be run by a group of interns with a single GPU.

 

Method: Imitation Learning

Forget classical robot programming for a moment. Forget reinforcement learning too, where a robot bangs its head against the wall over and over until it finally learns to avoid it.

 

Imitation Learning (IL) is an approach where the robot simply watches a human and learns.

 

Historically, it started with basic Behavioral Cloning — “I see a camera frame, I pick a motor movement.” Today it’s a much richer field: agents can reason about 3D space, recognise irregular objects, and generalise behaviours to new situations.

 

In practice, this means we can teach a robot a task without ever mathematically describing that task. Which is kind of a big deal.

 

Algorithm: ACT because regular models get stuck

We went with ACT (Action Chunking with Transformers) – no regrets.

 

Classic models suffer from compounding errors: the further ahead in time, the more the trajectory drifts. ACT handles this elegantly:

 

  1. Instead of planning one step at a time, it predicts entire “chunks” of future actions all at once.
  2. It uses the attention mechanism from the Transformer architecture combined with a VAE (Variational Autoencoder).
  3. The generated steps are overlaid over time (temporal ensembling), producing smooth, jitter-free physical motion.
  4. The result? The arm moves like a human – not like a robot from the 90s.

    What does training actually look like?

    The visual backbone is ResNet-18 – proven, stable, and light enough to train on a desktop. The entire training loop is triggered by a single module:

     
    lerobot.scripts.lerobot_train
     

    Key parameters we configured:

     

    • chunk_size=16 – the length of the action chunk predicted by the model
    • batch_size=8 – input batch size
    • expandable_segments – optimised PyTorch memory allocation that makes the whole process run on a standard workstation

    That’s it. Advanced manipulation policy training – no compute cluster, no cloud, on a desktop PC with an RTX 4000.

    RGB or RGB-D? We trained both variants

    This is one of the core questions of our experiment. We trained two model variants:

     

    • RGB – classic visual data from colour cameras,
    • RGB-D – data enriched with depth maps from Intel RealSense cameras.

     
    Map of the desk setup
     

    The hypothesis: depth information should help the model better understand 3D space and grasp objects more precisely. Did it? We cover the results in the next article but we’ll say this now: the answer isn’t obvious.

     

    Setup and camera configuration

    Experiments run on a workstation with an Intel Core i7-14700F + NVIDIA RTX 4000 GPU. Data is collected from 4 physical cameras, providing 6 logical data streams in total:

     

    CameraTypePositionRole
    Camera 1RGB (OpenCV)Overhead, above workspaceGlobal view
    Camera 2RGB-D (RealSense D435)Side-facingDepth + colour
    Camera 3RGB-D (RealSense D455)Starting zoneDepth profiling
    Camera 4RGB (OpenCV)Mounted on manipulator wristRobot’s-eye view

     

    RGB-D cameras capture not just the image, but also depth maps – giving the model a sense of 3D space rather than just a flat picture.

     

    Test task: Pick, Lift and Place

    The benchmark we chose is a classic in robotic manipulation: “Pick, Lift and Place” – grab a ball, lift it, and drop it into a designated container.

     

    To make life harder for the model (and verify it’s actually generalising rather than memorising a path), we divided the workspace into 6 starting zones (indices 0–5). Before each test, the ball is placed in a random zone.

     

    It’s a deceptively simple task: variable object position, different grasp angles, precise placement required – all of which forces the model to genuinely understand the task, not just replay a recording.

     

    What’s next?

    Desk setup

    This is just the beginning. In the upcoming articles in this series, we’ll cover:

     

    • What the data collection process looked like – and what went wrong (spoiler: quite a lot),
    • Results of the RGB vs RGB-D comparison – does depth actually help?,
    • What our interns brought to the project and what they took away from it.

     

    If you’re interested in AI robotics, imitation learning, or just want to see what a setup like this looks like in person – Reach out to us. We’d love to talk.

     

    FAQ

    What is LeRobot? LeRobot is an open-source framework by Hugging Face for robot imitation learning. It automates data collection, training, and evaluation of neural network policies.

     

    What is the ACT model in robotics? ACT (Action Chunking with Transformers) is an imitation learning algorithm that predicts entire sequences of actions at once, eliminating the compounding errors typical of classical models.

     

    What is Imitation Learning? A method of training robots where an agent learns from human demonstrations, without the need to program movements or define reward functions.

     

    What is SO-101? SO-101 is a 6-DoF robotic manipulator designed for easy data collection for imitation learning in a Leader-Follower configuration.

Related post

Planning a digital project?

Contact us Arrow icon