BharatGen’s Ganesh Ramakrishnan spotlights a sovereign AI ecosystem built from the first byte

At TechSparks 2025, BharatGen’s Ganesh Ramakrishnan spoke about the need to build AI in India from the ground up, ensuring linguistic inclusion, data sovereignty, and practical deployment across public sectors.

Wednesday November 12, 2025 , 6 min Read

The ongoing global expansion of artificial intelligence (AI) raises questions about where knowledge is created, who controls it, and whose priorities are reflected in the technology. These questions are increasingly being addressed in India through a shift toward building sovereign AI capabilities that are developed domestically, governed by Indian institutions, and aligned with local contexts.

Ganesh Ramakrishnan, Principal Investigator at BharatGen, a government-backed AI initiative, speaking on behalf of the consortium centered at IIT Bombay, described this shift as both strategic and cultural.

He explained that the goal is to move India from primarily adopting external technologies to building its own systems end to end.

The initiative involves multiple partners working together in a coordinated framework. The ambition is rooted in the belief that AI designed for India must account for the country’s linguistic diversity, demographic patterns, and institutional needs. Sovereignty, in this context, is not isolation but collective capability building.

“A very unique model in the world, I must say, run as a sovereign AI ecosystem. And the goal here is really to have collaborations across leading educational institutes, engineers, startups, industries, and government,” Ramakrishnan said at TechSparks 2025.

What makes this mission distinct is the emphasis on inclusion by design. The aim is not to localise systems after they are built, but to ensure that cultural preservation, dialectal diversity and accessibility are built into the foundations. This is particularly relevant to India, where the largest groups of speakers are often in the linguistic long tail, beyond widely represented languages.

“The idea here is also to build sovereignty that has digital inclusion, which means what we refer to often as a heavy tail, the long tail of many languages, dialects, people, their needs, cater to them by design, not in retrospect. And the other aspect here is obviously cultural preservation, ensuring that solutions can get localised,” the IIT Bombay professor explained.

This orientation frames AI not only as a technical project but as a national capacity. Ramakrishnan highlighted a broader shift from India as a consumer to India as a producer of intellectual property, recognising that meaningful digital independence requires domestic ownership across the technology stack.

Foundational model development

A central element of this sovereignty effort is the development of foundational AI models from the beginning rather than adapting or fine-tuning pre-existing international models. This involves defining tokenisation, architecture, pre-training regimes and downstream fine-tuning strategies within India, using Indian data and research expertise.

The consortium has built language models spanning billions of parameters and continues to scale up. These models include text-based systems, speech recognition and synthesis models and a vision-language model for document understanding. The creation of a document model trained on Indian data is significant in a country with diverse writing systems, administrative forms, and identity documents.

“Our models are built from scratch, from the first byte. So, we have an understanding of what it means to build large language models from scratch,” Ramakrishnan remarked.

The approach acknowledges that different applications require different model sizes. Rather than relying on a single large system, the team develops a range of models to meet varied trade-offs in latency, accuracy and fine-tuning efficiency. The work also includes exploring mixture-of-experts architectures, which can support multilingual capability by learning shared linguistic structures where relevant and separating others where necessary.

“There is no one-size-fits-all. There is always a trade-off. Larger model would mean low latency, but better coverage. Smaller model would be easy to fine-tune. So, we always present options,” he noted. This flexibility is particularly important when serving speakers of different languages and users working across different economic and infrastructural settings.

Data pipelines and localised benchmarks

Foundational models depend heavily on the quality and representativeness of data used to train them. For a multilingual and culturally diverse environment like India, this requires intentional data curation rather than scraping sources indiscriminately.

The consortium has invested in large-scale speech and text data collection aligned to regions and dialects, including more than 13,000 hours of speech data gathered through multiple vendors.

To ensure integrity, a structured pipeline was developed to detect duplication, validate metadata, monitor audio characteristics and ensure linguistic correctness. This mattered not only for model performance but also for public accountability, given that much of the effort is publicly funded.

“How do you ensure that the taxpayer’s money is being used wisely and not for duplicate effort? So, all kinds of features here, ranging from getting timestamps, preliminary check, replacement of failed cases by vendors, audio frequency check, ensuring there is a canonical character set, normalisation of scripts,” the professor elaborated.

Alongside data collection, the team has created benchmarks specific to Indian settings, recognising that international benchmarks do not always capture Indian linguistic patterns or evaluative needs. This includes benchmarks for text models, speech systems and document understanding tasks.

“We also created our own benchmark called Patram-Bench. We have also been benchmarking an automatic speech recognition system, catering to a large variety of dialects,” he noted.

The creation of local benchmarks reflects a broader point that sovereignty is not only about building models but also about defining the standards against which they are assessed.

Domain-specific applications

The sovereign AI approach is being tested through domain-level implementations designed for practical use. These include agricultural advisory systems accessible through WhatsApp, legal text and case retrieval support tools, finance models and systems for water and sanitation planning and monitoring.

The consortium also works with defence institutions where network isolation and data localisation are essential.

The agricultural application illustrates how multilingual speech and text systems can help individuals who may not engage with digital services through written interfaces. The system combines speech-to-text and text-to-speech models with domain-specific knowledge to provide guidance tailored to local conditions.

“What you find here is integration of not just the co-pilot for agriculture, very specifically integrated with the Indian Council for Agricultural Research data. But also in that process, personalised text-to-speech, speech-to-text, and the document vision models, all of them integrated in one place. All built in a sovereign manner,” Ramakrishnan explained.

Similarly, legal and administrative applications aim to improve access to public services by easing the interpretation of complex documents or routing of grievances. Work with defence forces underscores the need for air-gapped systems and further motivates domestic model training and deployment pipelines.

“We worked very closely with the Indian Armed Forces… A lot of that is based on fine-tuning our base models,” he noted.

This makes sovereign AI a form of public infrastructure designed to advance economic opportunity, governance effectiveness and national security.

“There’s a mission, there’s an urgency, there’s an opportunity and cultural sensitivity at its core,” Ramakrishnan remarked.

Edited by Megha Reddy

Advertise with us