Token costs are climbing. Developers can help fix that

At DevSparks 2026 in Bengaluru, NVIDIA's Jigar Halani made the case that every model choice and agent a developer ships has a direct cost implication, and that understanding tokeconomics is now a core developer skill.

Monday June 15, 2026 , 5 min Read

AI runs on tokens, and tokens cost money. Tokens are the fundamental unit of every AI interaction, the way models measure and charge for the text they read and generate. For most enterprises, that cost is increasing faster than the value it generates.

At DevSparks 2026 in Bengaluru, a summit focused on advancing India's developer ecosystem with next-generation technologies, Jigar Halani, Senior Director of Enterprise Solutions Architecture and Engineering at NVIDIA South Asia, argued that the fix starts with developers, not finance teams.

Model selection is a cost decision, not just a technical one

Every choice about which model to use, whether to go cloud or open source, and how much to consume flows directly into the AI bills organizations are trying to justify.

"Token economics has become a fundamental question in every boardroom meeting taking place in the world today," Halani said. "These inputs are coming from developers like you."

The decision between model sizes, a "30B, 100B, half a trillion, trillion model size" determines throughput, latency, and cost. Larger is not automatically better. A simple text-based workflow can run on a much smaller model. The savings come from matching model size to the task rather than defaulting to the most capable model across the board.

Context length compounds this further: the more context a developer passes into a model per request, the more tokens are consumed per interaction. For high-frequency workflows, that adds up quickly.

Halani drew a parallel to how developers once had to optimize for physical server memory in Java environments, writing code to fit within hard constraints because the server had no more to give. Token consumption requires the same discipline: knowing what is running underneath, what it costs, and how to squeeze more out of the same hardware.

On the question of build versus buy, the performance gap between open source and proprietary models has narrowed significantly, making the question more pressing than it was a year ago.

Halani compared the decision to the shift from ride-hailing to owning a car. When usage is occasional, APIs are cheaper and more convenient. But once an organization has multiple teams running agents, coding assistants, and internal tools simultaneously, the per-token cost of API access stops making sense.

At that point, owning and running your own model infrastructure becomes the more rational choice, the same way a family with daily commutes and school runs stops justifying ride-hailing fares and buys a car.

How fast token demand is actually growing

Until 2024, a typical small-to-midsize non-tech company was consuming tokens in the millions. "Today, when I talk to the same customer, they're talking billions," Halani said. For technology companies, annual consumption is hitting trillions.

The jump from conversational to agentic AI is a large part of why. A conversational interaction generates around "150 million tokens per day". A single agentic prompt of similar apparent simplicity produces far more.

Halani used the example of asking an agent to book the cheapest early morning flight from Bengaluru to Delhi. The instruction triggers a cascade: querying multiple travel platforms, applying discount logic, retrieving payment details, completing the booking.

“So much happened at the back, which was completely automated by that one prompt,” he said. “That kind of agentic workload runs to tens of billions of tokens per day.”

Token demand is not constant. For organizations running their own model infrastructure, planning for concurrent users and seasonal consumption spikes is as important as the model choices themselves.

The hardware optimizations most developers aren't using

The same logic applies to hardware. Choosing the right model means little if the hardware running it is being underused. Throughput per watt, how efficiently a system converts energy into useful compute, is a metric most developers aren't tracking yet. It should be.

Halani asked the room how many were running inference on FP32, FP16, FP8, and FP4, with hands thinning at each step and none at FP4.

"You could do FP4 and still get the same accuracy level that you're looking for," he said. The shift delivers "more than 10x performance" from the same hardware with no software changes required. Developers building on open source models outside India are already working at this level routinely.

On the networking side, NVIDIA's DOCA library for DPU-based KV cache access can improve KV cache performance "by at least 2x" through simple configuration changes. The Vera Rubin architecture, built across seven chips working as a single unit, delivers up to 50x better performance per watt over the Hopper generation, at a cost increase that is, as Halani noted, nowhere near proportional.

Turning tokens into business value

For organizations deciding how to deploy tokens commercially, Halani outlined four approaches: selling token access directly through API pricing, building AI-native products, layering AI into existing offerings, or transforming internal operations to reduce cost and improve execution.

Most Indian companies are currently in the third category, consuming API tokens, adding domain-specific or language value on top, and deploying or reselling the result. The opportunity, in his view, is to move up that stack.

The decisions that determine whether an organization's AI investment scales efficiently or bleeds cost are being made at the developer level, whether developers realise it or not.

"It is you who need to make sure that you work with your leaders, your CEOs, your CTOs, make them understand which tool, which model, which tech, what stack, which software, when, how, and how much," Halani said.

Edited by Teja Lele

Advertise with us