“In 2016, when Alexa entered our life, I was a non-believer in the idea of voice. My thought was, who wants to speak to a speaker? It’s a dumb idea.”
This is how Kumar Rangarajan, the Co-founder of Slang Labs and former CEO of Little Eye Labs that was acquired by Facebook, kicked off his talk at Future of Work 2019 in Bangalore on Saturday.
Eventually, curiosity got the better of him and he bought the device. When his kids fell in love with it, he began to recognise the potential of voice. “They took to it intuitively without any training. They knew which cues to watch for, when to speak and how to interact with it. Over time, they began to expect voice interfaces to be everywhere,” says Kumar, describing this as the main trigger for setting up Slang Labs, a platform that enables voice augmented experiences.
He narrated an instance where his kids wanted to purchase a song and Alexa made the experience quick and natural. “My kids insisted so I said ‘Alexa buy the song’. I had no idea Alexa could do that. When Alexa bought the song immediately, it blew my mind. I had spoken in the most natural way and it got done. This was the shortest thought to action. I realised that it’s not just a fun toy, but it’s a commercial entity that can go transactional.”
Another key observation was how the voice interface of Alexa is not intimidating to even elders who are not very comfortable with technology. “It was cool, natural, had transactional value and appealed to a wide audience.”
Evolution of input interfaces to voice
Talking about how voice interfaces are transforming the world, Kumar mentioned that today more than 30 percent of Google searches are voice-based. This is because voice interfaces, break barriers to entries, are not limiting like traditional software with buttons UI and menu, and are non-intimidating, enabling us to get things done faster.
Having said that, he added that while voice works great as an input mechanism, “it sucks as an output mechanism”. Illustrating this, he gave the example of using Alexa to do a simple task like order a pizza. “It will start giving you options about the kind of pizza you want and by the time she is done, you would have forgotten the first option. So, while it is cool, it also has certain limitations.”
Tracing back the history of input interfaces over time, Kumar took the audience through how these have transformed our lives. From having to manually punch holes into a card and feed it into a computer which would read the patterns of the holes to be able to understand the requirement, to the invention of the keyboard, which enables us to enter simple information using letters. The next innovation was the mouse, which was a transformational interface since it opened up the entire screen, and was not limited to a single directional input. Despite this, it still felt unnatural. When the next innovation came in the form of touch screen, the interaction felt more natural. “The smartphone era that we are living in is completely transformed because of the touch interface,” he said.
The next phase was that of voice assistants where, in addition to touching the screen, you could also speak to the system. Chatbots followed, which enables users to start speaking in a smarter way and use more natural language to say what they wanted to say.
Now, the next level is being able to integrate voice into other forms of experiences that we are used to, say visual experiences. And that’s precisely what Slang Labs is doing.
Using natural language understanding to smoothen edges
Sharing more details, he briefly spoke about the history of voice recognition from 1922, where the first voice recognition system came into being in the form of a toy called Rex that would respond to its name being called.
Then came systems that recognised numbers, connected speech involving entire sentences, and not just individual words, then statistical models using linguistic model patterns. In 2010 came Deep Learning, which is changing the world today as computers can recognise patterns on their own.
Now, the next level is to enable these systems to recognise more languages with less amount of training data. “English is vast. But when there are smaller learning models, say in the case of regional vernacular languages, it becomes a limitation. Also, speech recognition is just one part. Once the user uses words, there has to be some semantic value, so that the system can understand the meaning and extract the knowledge from the statement using natural language understanding. This is a significant part of innovation.
“With NLU, systems should start understanding extremely natural conversations and give firm responses automatically, they should be able self-train and learn and speak back. The next level of innovation is converting text to speech and smoothing the edges, by bringing in the elements of emotion and correct intonation.”
Enhancing the app experience through voice
While all this tech is predominantly being used as assistants, is there a way to use all this power directly in an app and change the way people interface with it? That’s what Slang Labs is doing, enabling next gen apps to use the advantages that voice-based assistants offer. “Instead of building a new Alexa skill, can I not bring it into an existing experience and make that experience significantly better? That was our reimagination and that’s why we built Slang Labs, a platform to quickly build voice augmented experiences”
Kumar then shared a short demo of their platform where the user of the Airtel app was able to activate international roaming on his phone in a matter of seconds. The demo showcased the automatic prompts by the Slang Labs platform and its multilingual capabilities.
A big shout out to Future of Work 2019 sponsors – Deployment partner Harness.io, Super partner GO-JEK, our Women-in-Tech partner ThoughtWorks, Voice Tech partner Slang Labs, Technology partner Techl33t, AI/ML partner Agara Labs, API Partner Postman and Blockchain partner Koinex.