English can no more be ‘default’: building Indic language content online


With the number of local language internet users growing at 47 percent year on year, localisation will be a key part of India’s digital revolution. After all, how does it make sense to exclude most Indians online in this quest?

The size of India’s internet user base has grown rapidly in the past few years, driven primarily by mobile users. In 2016, 409 million Indians were connected to the internet, according to a Google-KPMG report. This number is expected to touch a staggering 735 million by 2021. That’s right. In half a decade, a user base with the size of the US’s population will join the Indian internet.

The sheer size of India’s internet user base points to a digital revolution in the making. Hundreds of millions of Indians connected over the internet, shopping, messaging, banking online… You get the picture.

This is probably not news to most people. We’ve been hearing about increased smartphone usage, cheaper data, and whatnot. One thing that is often glossed over, however, is that the future of the Indian internet will primarily be in Indian languages.

In India, digital media and the internet were designed by Anglophone Indians, for an Anglophone user base. Over the years this trend continued, since access to the internet was generally restricted to this Anglophone elite.

This, coupled with a general human tendency to see things only from one’s own immediate perspective, led to the general perception that there was no demand for non-English content, and that Indian languages would never truly become digital.

The stats beg to differ, though.

India’s Indic language user majority

According to the Google study quoted above, Indian language internet users have already overtaken English language users, with a share of 234 million out of 409 million users in 2016. Out of the projected 735 million users in 2021, 536 million are expected be Indian language users, i.e. over 70 percent.

Given that, it’s easy to see how a focus on localisation will be a key part of India’s digital revolution. After all, how does it make sense to exclude most Indians online in this quest? The objective of the internet is to connect people across the world, not just English-speaking people.

There has always been demand for localisation, but there have been few solutions to address the demand. To explain the need for web content in Indian languages, what could be better than highly compelling stats?

Indic language user growth

The number of local language internet users is growing at 47 percent year on year. At this rate, by 2020 75 percent of new Internet users will be from rural areas, while only 16 percent of this growth will come from urban cities. Indian language users are expected to grow at a CAGR of 18 percent compared to three percent for English language users. Nine out of 10 Indians coming online will prefer their own language over English.

Clearly, building the Indian internet will have huge implications for companies trying to reach more and more Indians. Since most Indians will access the internet on their phones, the first barriers to usage, accessibility issues, need to be decisively dealt with.

Which is why the government has mandated Indic language support on mobile devices from Feb 1, 2018, including display support in all 22 official languages, with typing in at least two Indian languages. This is a wakeup call for other players to get involved, before they miss out.

Indic content online

As we saw earlier, the Indian internet’s Indic language content ecosystem is still nascent. The pitifully low percentage of Indic language content online is mainly a factor of how difficult it is to create and discover Indic content online, and user behaviour. Most Indic language users are highly engaged, but stick to verticals like messaging, entertainment, news, and social media.

Of course, things will change — there is a high demand for banking and government services in Indic languages — but we need to do our part and level the technological barriers that exist.

Let’s take a look at three building blocks for an Indic language content ecosystem we need to ensure are in place.

Content creation

Less than 0.1 percent of the internet content is in Indian languages, a number that will grow once more Indic language users start creating content. Allowing users to type and view Indic language text is a fundamental part of building the Indian internet. With content primarily being user created, mobile keypads end up being how users interface with content.

Keypads need to support typing in Indic languages, with multiple input modes catering to different typing preferences. At present, transliteration (typing in Latin characters) and native character input are the two major input modes. Keypads need to be designed conforming to the guidelines spelled out in the government’s language mandate, to ensure text consistency. Indic scripts are fundamentally different from the Latin script, and Indic content needs to follow the implicit rules of Indic scripts.

Content conversion

Content in English doesn’t need to necessarily remain in English. An app or a site built in English can be localised in multiple Indian languages, the way global companies localise their content in markets where English is not an official language — China (Chinese), Japan (Japanese), France (French), Thailand (Thai), Israel (Hebrew), Austria (Germany), the list goes on.

Content localisation is solved by employing both translation and transliteration. This best explained by the following example:

The translation of the word ‘Play’ could throw up either bajao (in context of music), khel (in context of sports) or a game in Hindi. Similarly, the software has to understand that a brand name called 'John Players' cannot be translated as John Khiladi in Hindi, but has to be transliterated, as a proper noun. This one-size-fits-all approach of most translation algorithms does not rightfully capture the nuances of context in Indian languages.

It goes without saying that a translation request without contextual information is highly unlikely to give the desired result. This problem can be solved by creating extensive domain-related lexical sets. In the above case, ‘Play’ could belong to either the sports, music or drama domain.

Proper nouns (personal names, surnames, place names, and brand names) plus addresses should only be transliterated.

Both transliteration and translation should ideally come together to provide a seamless content conversion experience.

Content discovery

Once you’ve solved the problems of content creation and conversion, you’ll still need to be able to index and discover it online. Input can be in Latin, native, or mixed script typing, like Hinglish. For example, typing 'grey joote' should give you valid results for 'grey shoes', the software should be able to understand the intent behind the search.

People use search to find what they’re looking for. The content they need is already out there. Finding it actually involves a lot of steps. Whenever you use search, there are a bunch of algorithms at work ensuring that you find what the search engine deems the most relevant result. In order for these algorithms to work, content has to be indexed and made compliant with search engine requirements.

Indic language content should also be indexable and searchable, so that Indic language users can discover content relevant to their queries just as effortlessly as an English language user.

Questioning the default

Ensuring these three aspects of the Indian language internet’s content system is the only way of affording millions of local language users the same experience as English language users. Solving this with special attention to mobile platforms requires a full-stack technology approach, as well as government co-operation with companies, and regulatory guidelines outlining what rules Indic language content should follow to ensure interoperability.

Using the internet in an Indian language should ultimately offer a user a seamless, unbroken user experience, not a patchy one that is conspicuously subpar in comparison to an English user’s experience.

Maybe then, English content will cease to be viewed as the “default” — a default that excludes most Indians.

(Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the views of YourStory.)