How Zoho Labs pivoted to inference engineering

At DevSparks 2026 in Bengaluru, Ramprakash Ramamoorthy, Director of AI Research at Zoho Corp, explained how open-weight models forced a rethink of what an in-house AI lab is actually for, and why inference engineering became Zoho Labs' answer.

Sunday June 14, 2026 , 4 min Read

Open-weight models, AI models whose parameters are made publicly available so anyone can download and run them for free, changed the economics of AI development almost overnight. For in-house AI teams that had spent years building their own models, that shift raised a direct question: what are we actually here to do now?

At DevSparks 2026 in Bengaluru, a nationwide movement by YourStory focused on empowering India's developer ecosystem with next-generation technologies, Ramprakash Ramamoorthy, Director of AI Research at Zoho Corp, traced how Zoho Labs navigated that question, and why inference engineering became its answer.

Getting started and pivoting

Zoho Labs was set up to solve engineering problems that kept repeating across Zoho's portfolio of over 100 products. The problem was simple: without a central unit, different teams kept arriving at the same dead ends independently, unaware that someone else had already been there. The lab's job was to catch those problems early, solve them once, and share the fix across teams.

The lab's AI work started in 2011 and expanded steadily into machine learning, computer vision, document processing, and language tools. By 2023, open-weight models had overtaken much of what the team had spent years building.

“The translation thing we built out, 15 language pairs from 2018 to 2023. Five years. And the models that came out in 2023 supported 90 language pairs and they were free and open source," Ramamoorthy said.

The team responded by running three directions at once: Zoho AI Bridge, which let customers connect to third-party providers or use open-weight models hosted on Zoho's own servers; a smaller in-house model for everyday tasks like email and document summaries; and inference engineering, which became the lab's primary focus.

Extracting more from what already exists

Before settling on inference, the team explored alternatives to the transformer architecture, including RWKV, Mamba, and Zamba, each promising better performance at lower cost. But the transformer ecosystem kept improving faster than any alternative could catch up.

The lab shifted to what he called the 101% project: squeezing maximum efficiency out of transformers already in production. Zoho's AI systems handled around six billion API calls a month on a constrained GPU budget, making this a practical necessity.

Ramamoorthy walked through the core techniques. Quantization compressed the numbers with a model used internally, making it faster and cheaper to run. The smarter version only compressed the less critical parts while leaving the important ones intact, gaining speed without losing much accuracy. "Find out which weights are relevant. Don't quantize them. That way you don't lose much accuracy but you gain speed," he said.

KV cache management worked like a short-term memory system: keep what the model reached for often, clear out what it rarely used. Continuous batching grouped incoming requests together instead of handling them one at a time.

Speculative decoding used a small model to draft a response, with a larger model checking it, delivering the quality of a bigger model without the full cost. "Even my engineers do it, they write the code using Sonnet and then use Opus to debug it," he said.

The case for inference

Ramamoorthy was direct about why this made sense for a bootstrapped company. "The lab's job is to train models, but I think that train has passed, because it's all general-purpose models out there. But then you keep running these models. So there is a deep rabbit hole you can go down at an inference level."

For resource-constrained teams, the session closed with a straightforward point: the opportunity in AI was no longer just about which models a team could build, but about how efficiently they could run the ones that already existed.

Edited by Teja Lele

Advertise with us