AI’s Next Trajectory: The Inference Innovation

Author:

Ishita Deshmukh

India’s AI challenge is shifting from building models to sustaining them—making inference efficiency the new driver of scalable deployment

Much of the recent discourse on Artificial Intelligence (AI) has focused on the rising costs of training compute, with costs for state-of-the-art large language models (LLMs) growing at a rate of 2.4x per year. Inference refers to what happens when a trained AI model is put into real-world use. Every time a chatbot is asked a question, asked to translate a sentence, or to generate an image, the model is performing inference—taking user input and producing an output based on what it has already learned. While training is a one-time, upfront cost, inference happens continuously each time the system is used. This makes it a recurring and long-term expense. For example, over US$ 100 million spent on training ChatGPT-4 translates into approximately US$ 250 million annually in inference costs.

India’s AI bottleneck is no longer language model development, but the economics of inference.

Therefore, India’s AI bottleneck is no longer language model development, but the economics of inference. With improved access to GPUs and the push to design sovereign chips, the success of AI deployment in India increasingly depends on developing context-aware, adaptable, and scalable AI systems. However, design does not automatically enable the manufacturing (fabrication) of semiconductors, as this is a highly sophisticated, multi-step process that requires specialised Extreme Ultraviolet (EUV) systems. These systems, primarily made by the Dutch company ASML, use light with 0.13 nm precision to etch nanometre-scale, AI-enabling chips on silicon wafers, which are necessary for sovereign chips.

Today, India’s AI trajectory is increasingly shaped not by the ability to build models, but by the ability to deploy and operate them under country-specific constraints. Large, general-purpose systems are expensive to operate, require continuous cloud access, and are optimised for environments with abundant compute and stable infrastructure. Further, the volume of real-time queries will continue to grow, turning inference into a persistent and expanding cost centre for enterprises and governments.

This is compounded by the shift toward multimodal systems, model collaboration, and specialised models, which require running not just LLMs but also vision, speech, and recommendation models in parallel, often under tight latency constraints. In the Indian context, this growth runs up against structural limits, making naive scaling economically and operationally unviable. As a result, inference is no longer just about optimising LLMs but also about orchestrating diverse, task-specific models efficiently across environments. Accordingly, new optimisation techniques and system-level innovations will become increasingly necessary.

Today, India’s AI trajectory is increasingly shaped not by the ability to build models, but by the ability to deploy and operate them under country-specific constraints.

Material Limitations, Performance Constraints and Physical Limits

As hardware increasingly becomes a limiting factor in AI deployment, even delaying infrastructure development, the locus of innovation shifts from model development to model execution. Geopolitical constraints on silicon, or “Siliconpolitik”, with China producing the vast majority of metallurgical-grade silicon (over 80 percent of global silicon metal supply as of 2026), pose a stark vulnerability for all countries in the AI race. The dominance of Taiwan’s TSMC in sub-7 nm manufacturing and its recent introduction of a new technology (NU2) for producing chips for mobile applications, laptops, and artificial intelligence further incentivises companies to maintain their entrenched dependence.

Further, bottlenecks in High Bandwidth Memory (HBM), driven by supply constraints in crucial DRAM components and the physical limits of memory gains, pose new, concrete challenges for frontier LLMs, potentially requiring new hardware. HBM enhances memory capacity in frontier LLMs by stacking memory (DRAM) components in a 3D manner to increase density and reduce the need for multiple, interconnected accelerators (GPUs). As models move towards trillions of parameters, conventional HBM is proving insufficient. This is due to bandwidth pressure: inference is dominated by the movement of model weights from memory (DRAM) to compute units (GPU cores). As models grow larger, they require the continuous streaming of vast amounts of data, but HBM bandwidth, while high, does not scale proportionally with model size. This results in underutilised compute, with processors spending cycles waiting for memory. Accordingly, for modern 7B, 70B, or 1T+ parameter models, the bottleneck is not compute speed, but memory bandwidth—how quickly the GPU can access model weights.

These constraints have significant cost implications, as DRAM prices have surged dramatically in late 2025 and into 2026, with some reports indicating increases of 130 percent. Thus, research is increasingly focused on techniques to increase the memory bandwidth available for processing and decoding tokens (text generation) without losing model accuracy. Accordingly, new inference techniques to overcome these hardware, material, and performance limitations are becoming essential. For example, the 3D-DRAM Simulator (ATLAS) is a framework that models how data moves in and out of memory, how fast it happens, and where bottlenecks occur—enabling engineers to design hardware components more efficiently. Additionally, ActiveFlow is a framework that lets AI models run on devices with limited memory by intelligently moving model weights between fast DRAM and slower flash storage. It predicts which parts of the model will be needed next, reducing DRAM usage by up to 40 percent without compromising performance.

Accordingly, for modern 7B, 70B, or 1T+ parameter models, the bottleneck is not compute speed, but memory bandwidth—how quickly the GPU can access model weights.

Additionally, beyond silicon-based chips, researchers are increasingly focusing on new materials such as Gallium Nitride (GaN) and Silicon Carbide (SiC), superatomic semiconductors, and bismuth-based microchips. This is because traditional silicon scaling is hitting physical and efficiency limits, especially for power, heat, and high-frequency performance. For example, Dennard scaling (power-efficiency scaling) broke down in the mid-2000s, meaning performance improvements no longer come with proportional energy efficiency gains. The expectation of performance doubling every 18–24 months has weakened; however, there is no clear evidence that silicon-based transistors have reached their limits for inference.

Inference Constraints and Trade-offs

Novel inference approaches are proving to be crucial for improving model performance as models become more complex and physical constraints accumulate. Thus, even if GPU access improves and computation becomes cheaper, these gains cannot improve efficiency without addressing other constraints. Further, cloud-based inference, while reliant on expanding data centre infrastructure, depends on hyperscalers and has environmental consequences. Training a single large AI model can emit hundreds of tonnes of CO₂, comparable to the lifetime emissions of multiple cars. Additionally, inference dominates lifecycle energy use—in some cases, accounting for about 90 percent of total energy consumption once systems are deployed at scale.

In India’s context, simply improving inference efficiency does not necessarily reduce structural dependence. Many of these optimisation approaches remain tightly coupled to specific hardware stacks, cloud infrastructure, and model architectures controlled by a small set of external actors. For example, quantisation and kernel-level optimisations are often designed around NVIDIA’s CUDA and TensorRT stack, making them difficult to port efficiently to alternative hardware such as AMD GPUs or emerging RISC-V accelerators. At the cloud layer, managed inference services such as AWS SageMaker, Azure ML, and Google Vertex AI bundle optimisation with proprietary infrastructure, creating lock-in through APIs, tooling, and pricing models.

Inference dominates lifecycle energy use—in some cases, accounting for about 90 percent of total energy consumption once systems are deployed at scale.

Today, inference efficiency is increasingly shaped by hardware–software co-design, where chip architecture, memory systems, and energy availability determine which optimisation techniques are viable. Research shows that low-precision inference can significantly reduce compute and energy consumption without major accuracy loss when aligned with hardware capabilities. However, the trade-off is between efficiency and fidelity.

Low-precision inference (e.g., INT8, FP8) reduces memory usage, bandwidth requirements, and energy consumption, improving speed and reducing cost—especially on hardware designed for it. However, this comes at the risk of numerical error: reducing precision can lead to small losses in model accuracy, instability in edge cases, or degraded performance on complex tasks. Thus, maintaining both efficiency and accuracy simultaneously requires new paradigms—but these paradigms should not depend on foreign hardware or frameworks. Crucially, chips themselves must be designed for local realities, rather than relying on reverse engineering or the bottom-up, application-driven approach that currently dominates.

The Way Forward

As India develops domestic semiconductor capabilities, these constraints present an opportunity to co-design chips and inference systems tailored to local conditions. The Design-Linked Incentive scheme under the Indian Semiconductor Mission has enabled firms such as Mindgrove Technologies and InCore Semiconductors to develop indigenous processor and accelerator intellectual property (IP), often using global electronic design automation (EDA) tools. This hybrid approach enables India to focus on application-specific, low-power accelerators—including NPUs optimised for quantised inference, RISC-V processors optimised for edge AI inference—while integrating into global supply chains for fabrication and tooling.

While India faces constraints in scaling frontier cloud-based models, this does not mean abandoning frontier or multimodal AI—it means adapting approaches to local realities with dynamic constraints.

The way forward lies less in prescriptive shifts and more in rebalancing attention. Greater emphasis on inference within public AI initiatives could help surface new optimisation techniques and system designs suited to local constraints. Finally, sustained focus on efficient model design and edge-compatible deployment environments can expand the range of viable applications. While India faces constraints in scaling frontier cloud-based models, this does not mean abandoning frontier or multimodal AI—it means adapting approaches to local realities with dynamic constraints. Accordingly, inference innovation will become increasingly important as new hardware, software, and system-level approaches are developed.

This commentary originally appeared in Observer Research Foundation.