A Swedish startup was looking for a way of moving Virtual Reality to the next level. We applied convolutional neural networks to design and build a Mixed Reality system that supports advanced hand gesture recognition. The system is modular and flexible. It supports various camera types, such as stereo, depth-RGB, and various hardware.



To keep up with technological progress nowadays is quite a challenge. Not so long ago for most of us having a personal computer was just a dream, the monitors were cathode ray tubes, and the mice were purely mechanical with a ball inside. Nowadays we face a different reality where the picture is coming out of the frame.
A Swedish startup had the mission to take Augmented Reality to the next level. We applied convolutional neural networks to design and build a Mixed Reality System that supports advanced hand gesture.


The core idea of the project was to embed augmented reality into the Linux window environment. In the MVP, the team built a system that could successfully track the environment, interpret user hand-gestures, and support third-party hardware that might be connected to the system..

Just imagine having all your windows in three dimensions floating around your room instead of on your monitor’s surface being able to do with your hands what you previously did with mouse and keyboard. It’s compatible with all your current applications. You can point your finger at the maximization button of the calendar, and as you quickly move your finger down (which is a “click” gesture) over the button, the calendar fills the whole surface of the wall. Sounds like the future? We wouldn’t be so sure.


We were responsible for:

  • To ensure that the system is capable of tracking the environment in real-time, the team used multiple camera mapping and generalized ORBSLAM technology.
  • To allow third-party hardware to be compatible with our solution, the team modified Linux kernel and libinput. The app is compatible with all Wayland-capable Linux applications.
  • We combined stereoscopic and infrared-based depth vision. The stereoscopic camera extracted the object’s visual features while the infrared-based depth camera provided high resolution and precision spatial data.
  • To get the accurate results we needed the customized equipment. Cameras had to capture the whole image at once without losing pixels, and in addition, required external self-timers to maintain full synchronization. The equipment was ordered in India.
  • The gesture recognition system that was available at the time of development was capable of tracking the user’s hand from a pre-specified point. We used and trained CNN to detect the user’s hand at any point on the screen. To interpret user hand gestures such as swipe up, down, left, right, bloom, and click, the team used Random Forest.


We built and implemented an early-stage, Simultaneous Localization, and Mapping that allows the user to enjoy the experience fully. Additionally, we used Artificial Intelligence to detect and interpret the user’s hand gesture to allow for browsing through different scenarios.

The system design is highly generic, which means it supports a wide range of applications, both in terms of compatible hardware and of the content presented to the end-user. In principle, any hardware can be manufactured by a third party according to his needs, or needs of a particular client. This allows SpaceOS to be quickly adapted to the current AR setups in architecture, engineering, or marketing, as well as to help quickly prototype and rapidly deploy innovative AR applications.

Project’s Tech stack:

We decided to use several technologies:

  • C++
    System navigation and space mapping. While working with multiple cameras (including movement) ensured us that we will keep maximum precision.
  • SLAM
    Tracks the position of the end-user in the room based on the image from a wide-angle, stereoscopic camera attached to the headset. It simultaneously creates a sparse map of points used for navigation and geometry reconstruction of the surrounding environment (i.e. shape of a window, the position of a wall or size of a table).
  • Wayland window manager
    Is responsible for allowing windows to be placed in three dimensions, reacts to the notifications about the gestures and tells the active app how to react to them (i.e. when you want to grab a window and move it somewhere). Ensures full compatibility with all window applications on Linux.
  • Convolutional Neural Networks
    Have been trained on a variety of data sets quality (10k LQ, 700 HQ) to detect user’s hand and 500 sequences to interpret user hand gestures such as swipe: up, down, left, right, bloom, click.
  • OpenCV/Image processing
    Related to all the tools we use to process the image in real-time also with collecting and changing datasets, gesture detection, and operations on images.
  • PCL
    Point cloud processing tasks and 3D geometry processing, surface reconstruction, 3D registration, model fitting, and segmentation.
  • OpenGL
    Used to interact with a graphics processing unit, to achieve hardware-accelerated rendering.
  • Orbslam
    To ensure that the system is capable of tracking the environment in real-time.
  • Random forest
    To interpret user hand gestures such as swipe up, down, left, right, bloom, and click.
  • Libinput
    Used to transfer from the kernel of use to cyberspace.


I strongly recommend xBerry for their professionalism and reliability in software development. We have been cooperating for over a year. I believe that our cooperation has been a role model for transparency, high-quality, and commitment. The project's goal was to solve a very difficult problem at the border of machine learning, augmented reality, and image processing. Our project required research, innovative technology and, excellent programming skills. I am stunned by how smoothly our project run, despite its difficulty and time constraints. To sum up, I sincerely recommend xBerry as a perfect partner for software development and technology consulting.


CEO avatar

Let 's Talk

Are you interested in ? Just write to us!