Who said developers can’t work on Machine Learning (ML)? Busting this and other myths was Rishi Singhal, a customer engineer at Google, an expert on Cloud Platform (GCP), during a Masterclass at Techsparks 2018. Rishi has extensive experience with addressing the challenges startups and enterprises face w.r.t data processing and generating insights.
To set the context of how why new-age cloud technologies help developers in their ML journey, Rishi began with a brief history of the evolution of large-scale data processing. He gave an overview of key aspects of data processing – steaming, windowing and pipelines, some of the challenges that coders face today, and the technologies that offer solutions . The Masterclass offered a quick rundown on how to converge batch and streaming in a single pipeline, how to do ML on top of a database, and how to rethink the Hadoop and spark deployments with cloud.
Rishi also explained how the Lambda architecture, which was extensively used until a few years ago, resulted in complexity in project management because of the need to maintain two different pipelines, two different codes, two different set of people who had the expertise to define and refine the infrastructure.
Explaining the evolution of key technology solutions, and how that has aligned with the evolving business needs. “Once Apache Hadoop became synonymous with batch processing, people began to think how businesses could derive value from real-time data. Businesses wanted information that could help them answer – what should I bring to improve my business? This then led to a boom in technological advancements. We saw Apache Spark, Apache Flume, Apache Kafka, Apache Beam, and Google Dataflow coming in, along with a host of similar technologies. Today, it is cloud-based processing technologies like Google Dataflow, which is redefining the programming experience for developers, while also aptly meeting business needs.”
While terming technologies such as Hadoop and Spark as “revolutionary and foundational”, he shared production deployments are still challenging. “Especially in use cases when the number of people on a particular project increase, there are breakdowns on these key technology platforms. To increase the capability, you need more servers, which when working in an organisation set up means it has to go through approvals and often takes months before a cluster gets ready. Then there are additional issues related to maintenance, cost, etc., which add to the woes. Even though this has become passé due to the adoption of newer technologies, it’s still a day-to-day reality in many organisations.”
With organisations of all sizes adopting cloud, the question today is, “How cloud is helpful for existing technologies like Hadoop and Spark,” said Rishi. “The biggest differentiator comes in the form of the cloud’s ability to segregate data and processing. If your data resides in a bucket, in other words cloud storage, you can start and end the processing anytime you want. Also, computing is what costs money. If it is on the cloud, you can start the cluster only when you want. And, cloud clusters starts within a short time, as fast as 90 seconds. Additionally, it comes with the advantage of no maintenance, no setting up of infrastructure, which means, you are up and running in a few seconds.”
The Masterclass helped to showcase how DataFlow, Google’s fully-managed service for transforming and enriching data, helps to solve some of the biggest data processing challenges while also seamlessly integrating with data warehousing. Rishi explained how Beam-based SDK lets developers build custom extensions and even choose alternative execution engines, such as Apache Spark. Through a demo, the Cloud Engineer also explained the different elements of Cloud Dataflow – graph optimisation, processing, dynamic work balancer, automatic scaling, auto healing, etc., and how dataflow works in different use case scenarios. He also shared the common pipeline that happens on Google Cloud and solves 95 to 99 percent of the data analytics challenges.
The session also covered features of BigQuery – Google's serverless, highly scalable, enterprise data warehouse – that has been designed to enable data analysts find meaningful insights using familiar SQL, without the need for a database administrator. Reiterating the interesting features of Google Cloud, Rishi said, “BigQuery supports analysis of GIS data, which means you can now incorporate geo-spatial information into your analytics workflows, even as your datasets grow into the petabytes.” (Geographic Information System (GIS) enables the management, analysis and presentation of geographical details from a variety of sources.)
The session saw some interesting numbers related to be BigQuery, which runs blazing-fast SQL queries on gigabytes to petabytes of data, being shared
- The largest query, the number of rows that was processed was 10.5 trillion
- The largest query on the amount of data that was processed was 2.1 petabytes
- The largest storage customer had 62 petabytes of data
- Injection rate was 4.5 billion rows per second
“ML is not easy unless you have a strong base in Mathematics or Statistics. Also, everyone is not required to do that,” shared Rishi and reasoned why and how Google has been working towards democratisation of ML.
“First let’s see the evolution of data sets. In the 90s, it was just plain data sets. All you predominantly did was just collect the data, do some SQL programming and get results. In the 2000s, analytics and BI came into the picture. Then it came to managed analytics, where businesses didn’t have to worry about managing infrastructure, but just analytics. Today, businesses want ML to be democratised. They want regressions and classifications to be done by people who are proficient in SQL and not just data scientists or analysts. And, that what Google has done – making ML relevant for developers as much as data scientists. If you are a developer, working on languages like Java or Python but don’t have expertise in ML, Google has a number of cloud ML APIs. All you need to do is go to cloud.google.com. You will have access to Translation APIs, Text to Speech APIs, Speech to Text APIs, Vision/Video APIs, among a host of others. But these are generic APIs; however, if you want to do something for a specific use case you can think of Auto ML, wherein you can input data and Auto ML does the churning, gives you a model for use. For a SQL Analyst, there’s BQML. For the data scientists, those who hold a PhD in Statistics, Mathematics, who use Deep Neural Networks to do Machine Learning and Programming, there’s Machine Learning Engine.”