Osama Tasneem


Topic
Vision–Language–Action Models for Robotic Manipulation in Industrial Environments
Currently working on:
Actively researching on the use of local LLMs and VLMs in industrial use cases.
Industries are increasingly adopting collaborative robots (cobots), which now account for over 10% of industrial robot installations worldwide. However, these robots often lack the ability to understand natural language or perceive complex environments, limiting their adaptability, safety, and effectiveness in dynamic human-centered settings. There is a need for more intelligent, adaptive, and human-centric robotic systems that can interact naturally and safely with human operators.
This doctoral research project aims to equip robots with the ability to perceive their environment through vision, comprehend natural language commands, and execute appropriate actions autonomously by developing a Vision-Language-Action (VLA) model. We propose to build upon multimodal AI architecture such as Open-VLA and RT-2. The model will be fine-tuned and scaled for local deployment, and adapted for industrial use while ensuring robustness, safety, data privacy, and efficiency in human-robot collaboration.
Successful implementation of Vision–Language–Action Models for Robotic Manipulation in Industrial Environments would lead to significant advancement in the field of robotics. Integration of Large Language models would allow robots to understand human commands more intuitively, thereby enhancing human-robot interaction. Additionally, adding vision capabilities into the model would provide robots with situational awareness, enabling them to make intelligent decisions and improve safety in industrial settings. Local deployment of such models reduces reliance on cloud-based LLM services, thereby increasing data privacy and reducing overall cost. Furthermore, this will expand the potential use of cases, allowing robots to understand complex instructions and perform tasks with greater flexibility and adaptability. This could lead to increased levels of automation in industries such as manufacturing, mining, logistics, and healthcare. For instance, in manufacturing, robots could be deployed for assembly, quality inspection, safety checks, and material handling tasks with minimal reprogramming effort, thanks to their ability to comprehend and respond to natural language instructions.
Current industrial robots rely on pre-programmed routines and limited sensory input, restricting their ability to handle dynamic, unstructured environments.
While recent research in multimodal AI systems like RT-2 and VLA show potential, they are not yet optimized for industrial deployment and depend on cloud infrastructure for deployment. This research differentiates itself by customizing and validating these architectures specifically for collaborative robotics, addressing privacy, safety, and usability challenges.

