Bilal Asmatullah is Co-Founder of Sciloop, building agentic AI systems to accelerate long-horizon scientific research.
When my cofounder and I were accepted into a competitive startup accelerator program in fall 2025, we applied with an ambitious idea: to build an “AI scientist” for machine learning research. What began as a narrowly scoped machine learning project quickly forced us to confront a much larger question: What would it actually take for AI to meaningfully accelerate science itself?
When ChatGPT was released toward the end of 2022, it felt like a decent tool for drafting essay ideas, but it struggled with complex math problems. Two years later, while I was taking an introductory probability and statistics course, I remember my professor asking the class to raise their hands if they found AI helpful for problem sets. In a class of more than 100 students, I saw barely three or four hands go up. I had been using it and was surprised by how few students found it useful.
Yet, in my mind, the holy grail of intelligence for these systems was the ability to solve International Science Olympiad problems. Having competed at the International Physics Olympiad myself, I long believed these competitions represented a meaningful upper bound for machine intelligence. Winning a gold medal at an International Physics Olympiad felt like the North Star for creating artificial general intelligence—or AGI, an AI smarter than humans—given that only around 40 high school students in the world achieve this each year after years of rigorous training.
Now that some AI models have performed on par with top students at the International Math Olympiad, AGI, one that would have been visible through AI-driven breakthroughs in domains like bioscience, materials science and energy research, is just not there.
Why Producing Scientific Breakthroughs With AI Is Difficult
One would imagine that an AI capable of solving some of the hardest Olympiad problems would naturally produce novel scientific breakthroughs, offering researchers genuine “aha” moments and directions they might never have considered. Yet, as of today, aside from a handful of carefully orchestrated examples, this has not been the case. But optimizing for raw performance on Olympiad benchmarks masks a deeper and more fundamental problem.
Most scientific breakthroughs are not obvious, linear or cleanly derivable from existing data. They require high-conviction contrarian bets and decisions that look wrong until they are proven right. The correct path is rarely one inference away from the answer. It emerges through exploration, failure and revision.
Modern AI systems excel at one-shot reasoning: answering a question, fitting a model or analyzing a dataset. They are far less capable when the task spans weeks or months and involves uncertainty at every step. In other words, they tend to struggle with long-horizon scientific reasoning.
In biosciences, this limitation is particularly stark. The field is often described as “data-rich,” yet roughly 70% of biology experiments are not reproducible. The issue is not simply a lack of data, but the nature of how experiments are conducted and recorded. Much of the most valuable scientific information never makes it into published papers or structured databases. It lives in electronic lab notebooks, spreadsheets, PowerPoint decks and, most critically, in the heads of experimentalists.
A biologist running an experiment typically does not document every intuition, environmental observation or subtle tweak that influenced a decision. Just as a programmer does not comment every line of code with their internal reasoning, experimental scientists don’t always record why they abandoned one direction and pursued another.
As a result, AI systems trained only on final results see science as a clean optimization problem, when in reality, it is a messy sequence of judgment calls, failed attempts and partial insights.
How AI Can Continue Advancing
If we want AI systems to reason over long horizons, I believe we must train them not just on outcomes, but on trajectories: the full arc of hypothesis, experiment, failure, adjustment and abandonment.
I’m seeing some major AI labs acknowledging these limitations and are beginning to move beyond training models on clean, Olympiad-style problems toward more rigorous benchmarks that better reflect open-ended scientific research. In my view, these newer benchmarks could capture the ambiguity, iteration and long-horizon reasoning that real scientific work demands.
This ambition is not limited to startups or AI labs. It appears to have become a strategic priority for the U.S. government as well, with the Department of Energy working to unify vast scientific datasets into shared repositories to strengthen America’s position in AI-driven scientific discovery.
What This Means For Businesses
Given the momentum behind improvements in AI’s reasoning capabilities and the narrowing focus on AI for science, I feel optimistic about 2026 being an important marker in the timeline of AI-guided scientific discovery.
For leaders deploying AI, the key implication is that models do not fail because intelligence is missing, but because training incentives are misaligned or simply not yet the focus of many frontier AI labs. Olympiad-level performance suggests that these systems can master many formal skills we demand of them. Leaders who interpret today’s domain-specific limitations of AI as permanent risk falling behind those who recognize them as transitional.
My recommendation is to thoughtfully position your business for the moment when frontier AI efforts finally align with your area of interest. Document how decisions over long-horizon tasks are made, and consider structuring workflows for human-AI collaboration. This can help ensure that when AI capabilities advance, your organization can adapt fast.
Forbes Business Council is the foremost growth and networking organization for business owners and leaders. Do I qualify?

Leave a comment