TL;DR
- Andrew Dai, former Google DeepMind researcher, launched Elorian — a startup targeting visual reasoning gaps in current AI systems.
- The company plans to build AI models for robotics, architecture, and automotive applications where visual understanding remains weak.
- Bloomberg first reported the launch, signaling serious industry attention to what many consider AI’s most stubborn bottleneck.
- Visual reasoning failures plague even top-tier models, limiting real-world automation despite breakthroughs in text and code generation.
Andrew Dai Bets Visual Reasoning Is AI’s Next Frontier
Andrew Dai stepped out from Google DeepMind’s shadow this week to launch Elorian, a startup aimed squarely at one of artificial intelligence’s most embarrassing blind spots: visual reasoning. Bloomberg broke the news, and the timing makes sense. While large language models can write poetry and debug code, they still stumble when asked to interpret spatial relationships or reason through visual scenes — the kind of stuff a toddler nails instinctively.
Elorian plans to tackle applications in robotics, architecture, and automotive. Those aren’t random picks. They’re domains where visual understanding isn’t a nice-to-have — it’s the entire job. A robot that can’t grasp depth perception will knock things over. An architectural AI that misreads floor plans will design unusable buildings. An autonomous vehicle that fumbles object relationships will crash.
Dai comes from DeepMind, which gives him credibility. But credibility doesn’t solve hard problems, and visual reasoning has chewed up plenty of well-funded efforts.
Why Current Models Choke on Visual Tasks
Here’s the thing: vision models have gotten shockingly good at labeling objects. They can tell you there’s a cat, a couch, and a coffee mug in a photo. What they can’t reliably do is answer whether the cat could jump from the couch to reach the mug. That’s reasoning — connecting spatial relationships, physics, and context into a coherent understanding.
The gap shows up everywhere. Ask GPT-4 with vision to count overlapping objects in a cluttered image, and it guesses. Show Claude a diagram with arrows and ask it to trace a path, and it hallucinates steps. These aren’t edge cases. They’re fundamental limitations that block AI from handling real-world visual complexity.
And it’s not like the industry hasn’t noticed. Every major lab has dumped resources into multimodal models. Progress has been real but incremental. The breakthroughs we’ve seen in text generation — where models went from clumsy to genuinely useful in a few years — haven’t materialized for visual reasoning. Not yet.
I think the problem is architectural. Language models benefit from massive datasets where the structure is inherently sequential and symbolic. Vision is messier — continuous, high-dimensional, and packed with implicit relationships that don’t map neatly to tokens. Throwing more compute at the problem helps, but it doesn’t fundamentally change the game.
Elorian’s bet is presumably that a different approach — maybe tighter integration between perception and reasoning, maybe novel training regimes, maybe architectures borrowed from neuroscience — can crack this. It’s a good bet to make. But it’s also the bet everyone else is making.
Robotics and Automotive Applications Demand Better Vision
Why focus on robotics, architecture, and automotive? Because those industries are bleeding money trying to work around AI’s visual limitations. Robotics companies still rely heavily on hard-coded rules and structured environments. Autonomous vehicle teams have burned billions training models that work great in Phoenix but faceplant in Boston because they can’t generalize visual reasoning across contexts.
Architecture might seem like the odd one out, but it’s not. Architects work in a visual language — floor plans, elevations, 3D renderings — that requires understanding spatial constraints, structural logic, and design intent simultaneously. An AI that could reason through those layers would be transformative. Right now, architectural AI tools are mostly glorified autocomplete.
The common thread is that all three domains need models that don’t just see — they need models that understand what they’re seeing well enough to act on it. That’s a higher bar than classification or segmentation. It’s the difference between recognizing a staircase and knowing whether a robot can climb it.
If Elorian can build models that genuinely reason through visual information — not just pattern-match but actually infer relationships and constraints — the applications extend far beyond these three verticals. Manufacturing, healthcare imaging, quality control, urban planning. The list goes on. Visual reasoning is one of those rare problems where solving it well unlocks dozens of adjacent markets.
DeepMind Alumni Keep Spinning Out Ambitious Startups
Dai’s departure fits a pattern. DeepMind has become a talent exporter, with researchers leaving to chase moonshots the parent company won’t or can’t prioritize. Some of that is natural — Google’s incentive structure doesn’t always align with the risk appetite needed for breakthrough research. Some of it is frustration with bureaucracy and the slow grind of corporate AI development.
The alumni network has launched companies attacking protein folding, drug discovery, robotics, and now visual reasoning. Not all of them will succeed. But the fact that so many senior researchers are willing to leave one of the world’s premier AI labs suggests they see opportunities that aren’t getting funded internally.
DeepMind’s strength has always been foundational research — the kind of work that wins awards but doesn’t always ship products. Startups like Elorian can move faster, take bigger risks, and focus on narrow problems without needing to justify how it fits into a trillion-dollar company’s strategy. That’s an advantage, though it comes with the obvious downside of having far less compute and far less margin for error.
The question is whether Elorian can attract the talent and capital needed to compete. Visual reasoning isn’t a problem you solve with a small team and a clever idea. It requires serious engineering, serious compute, and serious patience. Dai’s DeepMind pedigree helps, but the graveyard of AI startups founded by credentialed researchers is crowded.
What Elorian Needs to Prove — and Fast
The startup will need to show progress on benchmarks that actually matter. Not ImageNet accuracy or COCO scores — those measure the wrong things. Elorian needs to demonstrate reasoning: solving visual puzzles that require multi-step inference, handling occlusion and ambiguity, generalizing across domains without catastrophic forgetting.
It’ll also need to pick its battles carefully. Trying to build a general-purpose visual reasoning model from scratch is a decade-long project with uncertain returns. Focusing on one vertical — say, robotics — and nailing the specific reasoning tasks that domain requires is more realistic. Once you have a beachhead, you can expand.
The competitive landscape is brutal. OpenAI, Anthropic, Google, and Meta are all working on multimodal models with massive resource advantages. Elorian’s edge has to be focus and speed — solving a specific visual reasoning problem better and faster than the giants can. That’s doable, but it requires flawless execution and a bit of luck.
Funding will be the other test. If Bloomberg is covering the launch, Elorian probably has backing already — but how much, and from whom, matters enormously. Visual AI is expensive to build. If the company runs out of runway before it ships something compelling, the technology dies on the vine no matter how good the ideas were.
Source: TechStartups
