Industry

Alibaba Bets $293 Million on ShengShu: The Chinese Startup Building 'World Models' That Could Unlock the Next AI Revolution

While American AI giants obsess over chatbots, Alibaba is betting big on technology that bridges digital and physical reality

2026-04-11 By AgentBear Editorial Source: CNBC / Bloomberg
Alibaba Bets $293 Million on ShengShu: The Chinese Startup Building 'World Models' That Could Unlock the Next AI Revolution

While American AI giants are still obsessed with making chatbots write better essays, a three-year-old Chinese startup just convinced Alibaba to write a $293 million check for something entirely different — technology that could finally bridge the gap between the digital world and physical reality. This isn't just another AI video generator. It's a fundamental bet on what comes after the large language model era.

Beijing-based ShengShu Technology announced Friday that Alibaba Cloud led a 2 billion yuan ($293 million) Series B funding round, with participation from Baidu Ventures, TAL Education Group, and Luminous Ventures. The eye-popping raise comes barely two months after the startup closed a 600 million yuan round — a blistering fundraising pace that signals just how strategic "world models" have become in the global AI race.

What Are World Models (And Why Should You Care)?

Since ChatGPT exploded onto the scene in late 2022, the AI industry has been laser-focused on large language models — systems trained on text that can write, code, and converse with remarkable fluency. But a growing chorus of researchers and founders are hitting the limits of this approach.

Language models, for all their impressive capabilities, don't actually understand the physical world. They can describe a ball bouncing, but they can't predict where it will land. They can write about driving, but they can't navigate a car through traffic. They process words, not physics.

World models aim to change that.

Instead of training primarily on text scraped from the internet, world models learn from video, audio, touch sensors, and real-world physical interactions. They build an internal understanding of how the world actually works — gravity, momentum, object permanence, cause and effect. A world model doesn't just generate a video of a car driving; it understands that cars can't drive through walls, that brakes slow you down, that rain reduces traction.

"ShengShu believes that a general world model, built on multimodal data such as vision, audio, and touch, more naturally captures how the physical world works than large language models," the company said in its funding announcement.

ShengShu founder Zhu Jun, a Tsinghua University alum who launched the company in early 2023, put it more succinctly: "We aim to connect perception and action, allowing AI systems to better model and predict real-world behavior consistently."

The Vidu Story: First Mover in China's AI Video Race

ShengShu wasn't trying to build a world model from day one. The company started with a more practical problem: AI video generation.

In April 2024, ShengShu became the first Chinese company to release a video generation model when it launched Vidu — beating domestic rivals like ByteDance and Kuaishou to market, and launching globally months before OpenAI's Sora became widely available.

The timing was perfect. Hollywood was on strike. Content creators were desperate for cheaper production tools. The viral potential of AI-generated video was obvious to anyone who'd seen the early experiments.

But ShengShu's ambitions always ran deeper than just making pretty videos. The team understood that video generation was essentially a compression problem — learning the underlying patterns of how the world moves and behaves, then reproducing them. That same technology, they realized, could power everything from autonomous vehicles to humanoid robots.

In January 2026, ShengShu released Vidu Q3 Pro, its most capable model yet. According to Artificial Analysis, an independent AI benchmarking platform, it ranks among the top 10 systems globally for generating video from text and images — competitive with offerings from OpenAI, Google, and Runway.

The company has also developed Motus, an action-oriented model built on the same video generation architecture, designed specifically for robotics and embodied AI applications.

Alibaba's Multi-Pronged World Model Strategy

The ShengShu investment isn't a one-off bet for Alibaba. It's part of a systematic strategy to dominate the next phase of AI development — the phase that happens after the chatbot wars end.

In the past few months alone, Alibaba has:

This is classic Alibaba: flood the zone, place multiple bets, and see what works. The company has the cloud infrastructure, the capital, and the regulatory blessing to build a complete AI stack — from chips to models to applications.

ShengShu fits perfectly into this ecosystem. The startup already has strategic partnerships with companies developing embodied AI — systems like humanoid robots that need to navigate and manipulate the physical world. As those robots roll out across industrial, commercial, and home settings, they'll need world models to make sense of their environments.

And Alibaba will be positioned to provide them.

The Kevin Kelly Connection

One sign that a technology trend is real? When Kevin Kelly starts writing about it.

Kelly, the co-founder of Wired magazine and one of tech's most respected long-term thinkers, published a widely-shared essay last month arguing that human-level artificial intelligence requires three components:

  1. Reasoning — the ability to think logically and solve problems (what LLMs provide)
  2. Understanding of the physical world — an intuitive grasp of physics, causality, spatial relationships
  3. Continuous learning — the ability to adapt and improve from experience

According to Kelly, we've solved the first category with language models. The third category remains unsolved. But the second category — world models — is the critical missing piece that could unlock the next leap in AI capabilities.

"World models are critical for robotics because the technology needs more than LLMs to work," Kelly wrote. "While AI for the learning category hasn't been developed yet, LLM-powered chatbots have created the knowledge element, making world models a key area requiring a breakthrough."

When thinkers like Kelly and companies like Alibaba align on a technology direction, it's worth paying attention.

The Global AI Video Arms Race

ShengShu isn't operating in a vacuum. The AI video generation space has become one of the most competitive battlegrounds in technology, with billions of dollars pouring in from every corner of the globe.

OpenAI's Sora, when it finally launched after months of anticipation, generated so much demand that the company had to shut down its short-form video app just months later — unable to handle the infrastructure costs and complexity.

Google's VideoFX and Veo models have improved dramatically, leveraging the company's deep research capabilities and massive compute resources.

But perhaps the fiercest competition is domestic. China's short-video giants understand better than anyone how valuable AI-generated content could be:

ShengShu's advantage is focus. While ByteDance and Kuaishou are distracted by their massive social media businesses, ShengShu is pure-play world model development. The $293 million Alibaba injection gives them runway to compete with giants without the baggage of a legacy business to protect.

The Robotics Connection

Here's where world models get really interesting: they're the missing piece for practical robotics.

Current robots — even the impressive humanoid demos from Tesla, Boston Dynamics, and Chinese competitors like Unitree — are essentially playing back pre-programmed motions or reacting to specific sensor inputs. They don't understand their environment the way humans do.

A robot with a world model could look at a messy room and understand that the blanket on the floor is soft, that the glass on the table is fragile, that the drawer needs to be pulled before objects inside can be accessed. It could predict what will happen if it pushes something, or what approach angle would work best for picking up an unfamiliar object.

This isn't science fiction. ShengShu already has partnerships with embodied AI companies. The funding will accelerate these efforts, bridging the gap between video generation (simulation) and robotic control (embodiment).

China has made robotics a national priority. The government has set ambitious targets for humanoid robot deployment in factories and homes. World models like those ShengShu is building could be the difference between robots that look impressive in demos and robots that actually work in the real world.

The Open Questions

For all the excitement, world models remain early-stage technology with significant challenges:

Compute costs — Training on video is orders of magnitude more expensive than training on text. A single hour of video contains more information than millions of words. Scaling world models to the size of today's language models will require breakthroughs in efficiency or massive capital injections.

Data quality — Not all video is created equal. YouTube is full of edited content, special effects, and physically impossible animations. World models need training data that accurately represents physical reality — and filtering for that at scale is non-trivial.

Evaluation — How do you measure whether a world model actually "understands" physics? Current benchmarks are limited. The field needs better ways to assess whether these systems have learned genuine causal relationships or are just pattern-matching.

Integration — Even perfect world models need to be integrated with reasoning systems and robotic hardware. The stack is complex, and the interfaces between components are still being figured out.

🔥 Our Hot Take

This investment marks an inflection point in the AI industry — the moment when the smart money started moving decisively beyond language models.

For three years, the narrative has been about scale: more parameters, more data, more compute. Bigger models would inevitably lead to better capabilities. But the returns on that strategy are diminishing. GPT-4 was impressive; GPT-4.5 was incremental. The ceiling is becoming visible.

World models represent a different path — not scaling up the same approach, but building entirely new capabilities that current systems lack. It's a bet that the next breakthrough won't come from training on more Reddit posts, but from teaching AI systems how the world actually works.

Alibaba's strategy is clear: don't try to pick the winner. Fund multiple approaches, build the infrastructure, and let the market decide. ShengShu, Tripo AI, PixVerse — they're all exploring different aspects of world models. One of them might crack the code. Or they all might contribute pieces of the puzzle.

The $293 million is significant, but it's the strategic positioning that matters more. Alibaba is declaring that world models are the next battlefield, and they're going to be a major player.

For the rest of the industry, the message is unmistakable: the LLM era is entering its mature phase. The frontier is moving to models that can see, touch, and understand the physical world. Companies that don't adapt will find themselves building yesterday's technology.

ShengShu has a long way to go. But with Alibaba's backing, a two-month fundraising head start, and a clear technical vision, they've positioned themselves at the center of what comes next.

The race for world models is on. And China just made a major move.

📚 Related Reading

Enjoyed this analysis?

Share it with your network and help us grow.

More Intelligence

Industry

$7.9 Billion Floods Back Into Asian AI Stocks as Middle East Tensions Ease: The AI Trade Is Back On

Industry

ByteDance's Seedance 2.0 Lands in the US: The AI Video Tool That Terrified Hollywood Is Here

Back to Home View Archive