AI Threads
Posts
The Future of AI Training: Scaling Smarter, Not Bigger

The Future of AI Training: Scaling Smarter, Not Bigger

Welcome to the first issue of AI Threads, where we take a closer look at how AI is built, trained, and scaled—and whether the way we’re doing it now is really the best approach.

Attercop Team
February 26, 2025

AI is in the middle of a major infrastructure boom. In the race to build more powerful models, companies are pouring billions into computing power, pushing the limits of what massive data centres can handle. The thinking has been simple: scale is the answer. More GPUs, bigger clusters, faster training. But as these systems grow, so do the problems—higher costs, technical bottlenecks, and diminishing returns.

That’s where things get interesting. In this issue of AI Threads, we’ll be diving into the race for computing power, the limits of centralised AI training, and whether a more distributed approach could offer a better way forward.

The Cost of Winning: AI’s Billion-Dollar Bet

Microsoft’s recent announcement offers a glimpse into just how high the stakes have become. In a blog post earlier this year, the company revealed—almost in passing—that it plans to invest $80 billion in AI-enabled data centres in 2025 alone. If that number feels abstract, consider this: the James Webb Space Telescope cost a “mere” $10 billion. In other words, Microsoft is committing the equivalent of eight James Webbs—in a single year—just to train and deploy AI models.

“Artificial intelligence is the electricity of our age, and the next four years can build a foundation for America’s economic success for the next quarter century.”

Brad Smith, Microsoft Vice Chair & President

This somewhat mind-boggling comparison underscores the extraordinary scale—and cost—of the new race in artificial intelligence. The winners in this race are those who can secure the largest, most powerful compute clusters. Once, the world’s richest moguls competed over yachts, jets, and private islands. Now, bragging rights belong to those who can build the mightiest GPU farm, with Elon Musk and Mark Zuckerberg showcasing plans for data centers housing hundreds of thousands of GPUs. More recently, even Trump has weighed in, teaming up with OpenAI, Oracle, and SoftBank to announce the $500 billion Stargate initiative.

Yet, this arms race for ever-increasing GPU scale faces a fundamental problem—one that goes beyond the moral and practical concerns of its massive energy consumption. Every time you add another chip, you multiply the complexity of keeping them synchronised. A significant portion of training time is lost to “checkpointing”—the necessary process of exchanging updates after each training step. Could the solution lie in distributing AI training across multiple smaller data centers—or even a diffuse network of ordinary smartphones—rather than relying on a single massive cluster?Let’s take a closer look.

A New Arms Race in AI

OpenAI’s GPT-4 was famously trained on approximately 25,000 GPUs less than two years ago. Now, Elon Musk claims he has 100,000 chips in one data centre and wants to buy 200,000 more. Mark Zuckerberg says he’s aiming for 350,000 GPUs at Meta. And, as we’ve just seen, Microsoft is already plotting to spend $80 billion in a single year to ramp up its AI operations. All this investment highlights how massive the computations behind modern Large Language Models (LLMs) have become.

The rationale for scaling up is clear: more GPUs can, in theory, complete training runs faster by sharing the workload. However, the key word is theory. In practice, parallelising the work requires constant communication between the chips to keep them synchronised with the latest updates. This communication overhead can balloon, creating a bottleneck known as diminishing returns.

Parallel Processing: A Double-Edged Sword

Modern AI systems learn through backpropagation. You show the model vast amounts of data—perhaps text with certain words blanked out or partially hidden images—and it guesses the missing bits. If it guesses incorrectly, you tweak its many parameters so that next time, it edges closer to the right answer.

This process is repeated millions or even billions of times, which is why training large models is so computationally intensive. The speed boost from adding extra GPUs is significant at first, but once you reach tens of thousands of GPUs, more and more time is lost shuttling information around rather than on the learning itself.

DiLoCo: A Smarter Way to Distribute Training

Arthur Douillard at Google DeepMind noticed something crucial: do you really need to synchronise all GPUs all the time? In a 2023 paper on “Distributed Low-Communication Training of Language Models” (DiLoCo), Douillard introduced the idea of grouping GPUs into “islands”—essentially data centres—where they communicate as usual within each island but synchronise less frequently between islands.

This simple tweak massively reduces the communication overhead. There is a cost in raw performance when measuring the model’s accuracy on the specific task it was trained on, but interestingly, there can be improvements in generalised performance—the ability to handle completely different tasks or questions. It’s a bit like having separate groups of students who study on their own, explore tangents, and then occasionally come together to share what they’ve learned. The end result can be more well-rounded expertise.

A visual representation of the DiLoCo training process, where multiple AI model replicas are trained independently across different compute clusters before periodic synchronisation.
Source: Google DeepMind, from Distributed Low-Communication Training of Language Models

Prime Intellect’s Approach: OpenDiLoCo

This idea has already been put into practice. Vincent Weisser, founder of Prime Intellect, led the training of an LLM called “Intellect-1” in November 2024. Comparable in size to Meta’s Llama 2, Intellect-1 wasn’t trained on a single giant cluster but rather on 30 smaller ones, scattered across eight cities on three continents.

Their system, OpenDiLoCo, only checkpoints once every 500 steps. It also “quantises” the changes - discarding the least important chunks of information—so that each cluster spends more time computing and less time communicating. Even with clusters so far apart, Weisser’s team kept the GPUs “actively working” 83% of the time. Restricting the data centres to North America increased that figure to 96%.

The best part? Smaller data centres are far more widely available and affordable. As a result, Weisser’s open-source lab can train a multi-billion-parameter model without having to rent (or build) a colossal, monolithic GPU farm.

The Privacy Perk: Federated and Split Learning

While cost and scale are major motivators for distributed training, there’s another benefit that might be even more significant: privacy. In fields like healthcare, patient records are both invaluable for AI research and extremely sensitive. Conventional training approaches, which combine all raw data in a central location, are often unworkable. That’s where techniques such as Federated Learning (FL) and Split Learning come into play.

In essence, Federated Learning trains a complete model locally on each participant’s machine or server, sending back only updated parameters—not the underlying raw data—to a central server for aggregation. For example, hospitals could collaborate to build a cancer detection system without sharing actual scans or medical histories.

In Split Learning, the neural network itself is divided into segments, with some running on a central server and others on local machines, ensuring that sensitive data remains on-premises.

Together, these approaches help ensure that large-scale AI collaboration can proceed even when data is private or highly regulated. The concept of Federated Behavioural Planes, for instance, offers a way to analyse how different participants behave in a federated learning scenario, detecting anomalies or malicious contributions along the way.

Human Sensing and FL: Protecting Privacy

As an example of how these techniques can improve privacy, let’s look at the emerging field of Human Sensing, where technology monitors human activities or physiological states. The sheer amount and intimacy of this data raise serious ethical and legal concerns. An interesting 2025 survey explored how Federated Learning might allow researchers to advance this field without storing personal data in a central repository—an approach that could be applicable to a wide range of privacy requirements.

A comparison of centralised learning, where all raw data is sent to a central server for AI model training, versus federated learning, where models are trained locally on devices without sharing raw data.
Source: Research from A Survey on Federated Learning in Human Sensing

Beyond GPUs: Could Your Smartphone Join the Party?

If distributing training across multiple data centres sounds radical, an even bolder vision involves tapping into billions of ordinary consumer devices. A single top-of-the-line Nvidia GPU equates to a few hundred premium smartphones in raw computational power—but if we collectively harness the world’s phones and laptops, we could have far more processing power than any single data centre.

Of course, that dream comes with many challenges: network variability, data storage limitations, and the need to keep track of a myriad devices going offline. Yet, the same logic that allows DiLoCo’s island-based approach to generalise well might also apply to a smartphone-based system. Each device contributes slightly different data or perspectives, helping the model become more robust as a result.

What Lies Ahead

For the moment, sticking with a monolithic data-centre approach and high-powered initiatives like Stargate remains the most well-known, effective, and practical strategy—provided there is access to near-unlimited funding, of course! For this reason, the big players, including OpenAI, Microsoft, Google, and Meta, who have already poured massive resources into single-site GPU clusters, continue to invest billions into data centres and silicon. But the advantages of distributed training, with its lower cost barriers, better resource availability, and strong privacy protections, are becoming clearer by the day.

DiLoCo, Prime Intellect’s OpenDiLoCo implementation, the various Federated and Split Learning solutions emerging, and growing research in this field all demonstrate that AI can be scaled out in ways that are both efficient and more accessible. And with smaller data centres becoming widely available, we might hope that the field shifts towards a future where intensive AI training isn’t reserved only for trillion-dollar corporations.

“OpenDiLoCo achieves 90-95% compute utilisation in real-world decentralised training across two continents and three countries.”

- OpenDiLoCo Research Paper

Conclusion: Democratising the Future of AI

While the GPU arms race continues, there are early signs that AI training may be evolving into something more nuanced. As ‘AI optimists’, we are prepared to hope that AI’s future won’t belong solely to those who can afford the biggest single cluster but to those who can coordinate clusters intelligently, wherever they happen to be. We might not see the dismantling of mega data centres in the short term, but the underlying trends—distribution, collaboration, and privacy awareness—show that there are promising ways to rewrite the rules of this incredibly expensive game.

Global data centre distribution as shown on Data Center Map, highlighting the infrastructure supporting AI and cloud computing.
Source: Data Center Map.

It’s hard to argue with $80 billion invested in a single year. But that same figure also raises questions about the sustainability, equity, and ethics of concentrating so much computing power in so few hands. By spreading out AI training across smaller clusters—or even thousands of personal devices—we can keep pushing the boundaries of what’s possible while also reducing costs, preserving data privacy, and ensuring that AI remains an asset for all, not just the privileged few.

At the same time, advances in distributed training hold promise for lowering the astronomical costs of AI research. By harnessing multiple smaller clusters—rather than a single, colossally expensive data centre—organisations can allocate budgets more efficiently, enabling smaller nations (e.g., the UK!) to compete more effectively, while also empowering academic and open-source projects that lack the capital of global tech giants. Moreover, distributing computation in this way could be kinder to the environment, as smaller, geographically dispersed clusters may reduce energy usage and heat density compared to massive facilities.

In short, distributing AI workloads isn’t just about technical innovation—it offers a path towards a more cost-effective, democratic, and sustainable future for AI as a whole.

What do you think? Is the future of AI in ever-larger clusters, or are we on the cusp of a more decentralised era? We’d love to hear your take—reply with your thoughts, share your perspective, or send this to someone who should be thinking about the future of AI.

Reply

or to participate.