Does Size Matter: The Rise of Smarter AI Models

The prevailing wisdom for LLM models has been that bigger means better. This assumption fuelled the growth of NVIDIA, Taiwan Semiconductor, and other AI hardware giants, as the demand for compute-intensive models skyrocketed.

But this paradigm is starting to crack.

Open-source models like DeepSeek-R1 have proven that smaller, cheaper-to-train models can be just as competitive as mainstream proprietary alternatives. The idea that only large models can be effective is fading, and we are entering an era where efficiency and adaptability matter more than sheer size.

In this post, I’ll talk about a novel approach called Recurrent Depth and discuss its implications for the future of AI. Could this be the shift that enables AI to become more portable, passive, and smarter?

This post is based on the recent paper: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (arXiv link). Instead of simply making models larger, the authors propose making them think deeper—by reusing the same neural layers multiple times at test time.

This technique, called Recurrent Depth, allows a model to dynamically scale its reasoning based on the complexity of a query. It’s not the same as chain-of-thought reasoning or text-based reasoning tokens that many LLM users may already be familiar with. Instead of treating every task the same way, the model can decide to “think longer” about complex problems while keeping simpler queries lightweight. This aligns with my belief that AI’s future lies in passive, efficient models—small enough to run on personal devices but powerful enough to provide real-time insights.

Key Insights From the Paper

💡 What is Recurrent Depth?

Traditional LLMs have a fixed number of layers, meaning the amount of computation per query is the same no matter how complex the task is.
This paper introduces a recurrent block that can iterate multiple times at test time, refining the model’s understanding before generating an output.
Instead of increasing model size, this allows a smaller model to behave like a much larger one—without extra parameters.

💡 Why Is This Different From Other Efficiency Tricks?

It doesn’t rely on MoE (Mixture-of-Experts), where only some neurons fire per query.
It’s also not just a bigger context window—instead of processing more tokens, it refines its internal latent representation.
It lets the model scale compute up or down dynamically, making it a smarter alternative to just increasing model size.

💡 Challenges They Faced

Training at a fixed depth didn’t generalize well to different reasoning depths at test time.
Applying recurrence to every query was inefficient—not all problems need deeper reasoning.
They had to develop strategies to train the model to handle varying levels of iteration gracefully.

🔍 Analogy for Understanding Think of this as drawing a picture. Instead of trying to paint every tiny detail perfectly from the start (bigger models), you begin with a rough sketch and gradually refine it—adding details, shading, and depth with each pass. Recurrent Depth works similarly, iterating over the same layers multiple times to refine understanding, rather than processing everything at once and hoping for perfection.

Open Questions

🔹 Could this be combined with MoE to create a more efficient architecture?
I don’t build models myself, but surely that must be the next step. The paper doesn’t necessarily say that, which surprises me, but combining methods like this feels like the logical next move.

🔹 How do we train models to decide their own reasoning depth without human intervention?
This is interesting because with this method, reasoning depth (the amount of thinking iterations) has to be set manually at inference time. But surely, we could train a small, low-complexity model—like an embedding model—to assess input complexity and estimate the required reasoning depth. Almost like sentiment analysis, but for reasoning depth.

🔹 Will companies adopt this approach?
They have to, surely. Brute-force scaling is nonsensical at this point. There isn’t enough memory in the world to keep scaling models indefinitely. Some companies might attempt to push even larger models, but I don’t think that’s the future. Efficiency and smarter architectures will win in the long run.

🔹 How would this approach impact latency for real-time applications?
Latency is a critical factor. Some models, like Gemini Flash 2.0, focus on speed and deliver instant responses, which is fantastic for conversation-style AI. But is there a way to balance this? I think so. Perhaps by combining multiple models—a fast-response model for immediate engagement while a more thoughtful model runs in the background.

Think of it like this: The fast model could say, “That’s a great question, let me think about it for a second,” while a deeper reasoning model crunches away in the background. This could create a much more natural and fluid user experience, especially for AI assistants and real-time interactions.

Why This Matters

This paper strengthens my belief that the future of AI isn’t just bigger models, but smarter, more adaptive ones. The industry is slowly realising that size isn’t everything—and that efficiency is just as crucial.

With the rise of smaller, more portable AI models, I envision a future where LLMs are not just query-based tools but passive, always-running assistants—integrated into our lives without requiring cloud-based inference.

I genuinely believe we need to be running LLMs locally, not just through centralised AI services. That’s why I’m investing in powerful hardware—not just to run models, but to actively research and develop use cases for local and passive LLMs.

As compute costs rise and sustainability concerns grow, techniques like Recurrent Depth will become essential. I don’t see how big AI companies can ignore this shift, and I expect this method to be incorporated into future mainstream models.

Thanks for reading! There’s a high noise to signal ratio on AI topics. I hope I haven’t added to that. Reach out if you’d like to chat about any of these topics https://www.linkedin.com/in/jpainio/