E2E Networks

How Indian AI startups are learning to scale from demo to deployment without breaking the bank

At a recent Bengaluru mixer hosted by E2E Networks, NVIDIA, and YourStory, founders and ecosystem builders unpacked the real challenges of taking AI from prototype to production scale.

Thursday February 12, 2026 , 6 min Read

The gap between an AI demo and a production system isn't just technical. It's a complete mindset shift, and, in India, it comes with its own set of constraints around cost, infrastructure, and scale.

That was the running theme at a mixer in Bengaluru organized by E2E Networks, NVIDIA, and YourStory, where AI founders, investors, and technology leaders gathered to talk about what actually breaks when you try to serve millions of users instead of impressing a room full of investors.

Shivani Muthanna from YourStory moderated the evening, which featured keynotes and a panel discussion that cut through the usual AI hype to focus on execution.

The cost equation that startups can't ignore

Vishnu Subramanian, Head of Product and Marketing at E2E Networks, started with the kind of math that makes early-stage founders pay attention. With $100, you get around 9 to 10 hours on a hyperscaler. On E2E, you get approximately 330 hours.

"We are making the lives of startups a lot easier when you try to go live and try to take it to population scale," Subramanian said, explaining how E2E focuses on optimizing everything, from GPU instance spin-up times to model deployment.

He walked through the stages most AI startups go through. Exploration, where you're just spinning up instances to test models. Training, where you realize GPT-level models are too expensive for your use case, and you need something smaller. Deployment, where you figure out how to serve customers without your costs spiraling. And inference, which is where the real engineering work begins if you want to scale.

NVIDIA's push for efficiency and precision

Megh Makwana, Solution Architect and Engineering Manager for Applied AI at NVIDIA, challenged the room on how they measure GPU performance. Most people, he pointed out, look at GPU utilization or memory usage. Those are the wrong metrics.

"Both of those two metrics are pseudo metrics to quantify whether you are running our application," Makwana said. "The real important metric is flops. If you are consuming 90 plus percent of your GPU power for your workload, then and only then you're actually utilizing the underlying flops."

He called out another common mistake: deploying models in BF16 or FP16 precision just because that's the default in ‘Hugging Face’ repositories. Lower precision models offer three advantages—reduced memory footprint, higher flops for matrix multiplication, and better memory bandwidth. The performance difference is massive. At FP32, you might get around an 80 range performance. At NVFP4, you're in the four digits.

"One of the key things we have at NVIDIA is an open-models, open-data, open-software, open-recipe initiative," Makwana explained. "Rather than just giving you a pre-trained checkpoint, we also want to provide you with the tools and the knowledge and the frameworks to go and do these things on your own."

For voice AI specifically, where latency is everything, he emphasized the need for efficient orchestration and low-level kernel optimization. "For voice to voice, every second matters. You want to make sure the voice-to-voice pipeline can finish a conversation in a sub-millisecond regime."

What production actually looks like

The panel brought together Bharath Shankar, Co-founder and Chief of Products and Engineering at Gnani.ai, Ashwin Raguraman, Co-founder and Partner at Bharat Innovation Fund, along with Makwana and Subramanian.

Shankar's company handles 3.5 crore conversations daily. That's 30,000 concurrent conversations at any given moment. Getting there wasn't about picking the best model. It was about system engineering across the entire stack.

"If all demos were production, then every startup would be profitable today," Shankar said. "Building a demo today is easy. You have frameworks, you have models. But production is a different uphill task."

He walked through what breaks at scale. API clients that can't handle 2,000 requests per second. Databases that weren't designed for that kind of load. Caching systems that become de facto data stores because you're caching everything. "Until you hit that scale, you will not even imagine that the throttling can happen at the client end," he noted.

On cloud provider selection, Shankar was pragmatic. Gnani.ai started with hyperscalers, getting grants from Google Cloud through a cold email. But as the company scaled, the decision came down to five pillars: availability, reliability, scalability, observability, and cost. Hyperscalers are 3x to 4x more expensive than providers like E2E, and for a startup, that matters.

For voice AI specifically, Shankar explained the complexity. "Production-grade voice AI involves multiple layers like speech-to-text, NLP, and text-to-speech. At every layer, there are challenges." On an H100, Gnani.ai can handle more than 64 streams. If you're only getting three or four streams on hardware that expensive, "it is not production grade, according to me".

What investors actually look for

Raguraman brought the investor perspective. At the early stage, his fund isn't looking for massive revenues or profitability. It’s looking for gross margin, which is directly tied to infrastructure spend.

"We've seen startups at 65% margin, we've seen others at 80- 85%," he said. "That tells a story by itself, just in terms of how well either the product has been architected or what you're using from an infrastructure perspective."

Raguraman sees voice AI as the input modality of the future. "It will really democratize access to applications and technology for people, irrespective of their ability to understand technology."

The advice worth remembering

Makwana's technical advice was clear. Track the right metrics, not volatile GPU utilization. Use the right compiler stack—vLLM, TensorRT-LLM, not just PyTorch in eager mode. And invest in low-precision inference, because that's the next viable way of cutting costs.

He also emphasized something he's seen in China but not enough in India. "I would highly recommend that people invest in understanding how to write efficient kernels. Those folks are actually writing custom kernels for their models, and they're trying to get to that 105-110% improvement. At a very large scale, that 5-10% makes a huge difference."

Subramanian's advice was more strategic. "Build for the internet, not just for India," he said. And think about who your end consumer will be in a few years. "Will it be human beings, or will it be computers? Make sure that the product you're building is easily usable by an AI agent."

Shankar's advice cut to the core of long-term moats. "You should also think about data. How do you go back and keep cleaning the data that you're curating from all your conversations? Because if you don't do that at a later point, that is going to be your moat."

The evening ended with networking, but the message was clear. The companies that will win in AI aren't the ones with the best demos. They're the ones who can solve the boring, hard problems of production at scale.

Advertise with us