Superb AI Focuses on Vision AI Agentization to Address Real Enterprise Challenges
“ZERO,” a zero-shot foundation model launched in June by Superb AI (CEO Kim Hyun-soo), is more than a conventional vision-language model (VLM). According to Chief Technology Officer Cha Moon-soo, ZERO represents an evolution toward a “vision AI agent.”
“Many VLMs are still limited to rule-based machine vision, which we define as step one,” Cha said. “Vision AI has progressed through deep learning—step two—where models learn normal and abnormal patterns. We are now entering step three, where AI understands context and predicts future situations as an agent.”
To reach this third stage and meaningfully replace human judgment, Vision AI must integrate three core capabilities: perception, reasoning, and action. This enables AI systems to respond effectively even to scenarios that were not explicitly learned during training.
ZERO was designed with this goal in mind—to handle diverse situations using a single model. Superb AI views this as a transition from traditional vision AI toward physical AI, with the next step being the development of a vision-language-action (VLA) model that can be applied to robotics.
Robotics, Cha explained, inherently requires agent-level intelligence, as robots must operate in environments and handle objects that cannot be fully predefined. This challenge becomes even more complex for humanoid robots, which rely on multiple sensors and therefore demand high-level reasoning to interpret increased volumes of input data.
Superb AI addresses this by training models on videos that capture real-world situations. By analyzing step-by-step actions within these videos, AI can learn how tasks are performed and evaluate whether a robot’s behavior aligns with human actions.
The company also emphasized the importance of real behavioral data. Unlike purely simulated data, real-world data captures subtle physical factors—such as friction, vibration, and sensor noise—that are critical for achieving physical robustness in robots.
To this end, Superb AI is conducting research that enables robots to learn human behavior from both first-person and third-person perspectives. “We are building datasets that reflect movements from the robot’s own point of view, and studying ways to estimate depth and distance using RGB data alone,” Cha said.
The company adopts a hybrid approach: securing high-quality real-world data in controlled environments and integrating it with simulated data. “Just as large language models made speech recognition and keyword extraction more accessible, VLA models will simplify the recognition and understanding of behavior,” Cha explained. “We believe this can be achieved through a single AI model rather than rule-based systems.”
While acknowledging limitations—such as the inevitable gap between simulation and real-world environments—Cha stressed the importance of deploying imperfect models early. “Applying models to real environments allows them to learn from real data and continuously improve. This is why zero-shot capability is essential.”
As an example of zero-shot reasoning, Cha cited Google’s robot foundation model Gemini Robotics-ER 1.5, which demonstrates strong visual grounding. Although the model is based on a text-centric autoregressive structure rather than real-time action execution, it performs high-level planning and reasoning before producing outputs—reducing the need to explicitly train on every possible scenario.
In this paradigm, robots leverage broad knowledge acquired from large-scale data—such as understanding that a bottle has a lid and that opening it involves twisting—while real robot data grounds this abstract knowledge in physical action. As a result, robots can apply learned concepts to new objects without retraining for every variation.
World models are also expected to play a key role in VLA development. “World models serve as a robot’s imagination,” Cha said. “They predict environmental changes and action outcomes, enabling safer and more stable planning.” Generative world models, such as NVIDIA Cosmos, generate large-scale, physically realistic synthetic data that can reinforce robot learning when real-world data is limited.
Despite these long-term ambitions, Cha emphasized that Superb AI’s current priority is supporting practical Vision AI adoption by enterprises. “Since the release of ZERO, industry response has changed significantly,” he said. “In some cases, customers send sample videos in the morning, and we deliver inference results the same afternoon.”
ZERO’s zero-shot capability minimizes the need for additional data training, allowing Vision AI to be deployed directly in real-world environments. In manufacturing, for example, workers often must learn hundreds of pages of guidelines, requiring an average probation period of three months. By contrast, configuring a Vision AI-powered robotic arm can reduce the learning period to just weeks.
Given ongoing labor shortages in manufacturing, Cha sees zero-shot models and VLMs as a practical solution. However, he cautioned that Vision AI adoption is still in its early stages. “Even basic standard operating procedures often require fine-tuning to work properly,” he noted.
“We are focusing on visual grounding and VLA technologies to overcome these limitations,” Cha concluded. “Our goal is to solve real, operational problems for companies.”
댓글
댓글 쓰기