More than 75 percent of organisations are into Big Data processing and putting those insights into business use. But, data scientists might be in short supply.
Big Data is big. Huge, in fact. It is the hottest segment in information technology and according to IDC, Big Data revenues are estimated to cross $187 billion in 2019. It is said that the amount of data in the world doubles every two years, and by 2020, the digital universe will reach 44 zettabytes, or 4 trillion gigabytes in data.
The biggest revenue opportunities are in manufacturing and banking sectors. But, almost all major organisations are heavily invested in Big Data now, and it is being touted as the “definitive source of competitive advantage” across industries.
Here are five interesting trends from the worldwide Big Data market, as observed by Qubole, a big data-as-a-service firm, in its 2018 Big Data Activation Report.
Big Data processing is widespread
About 76 percent companies “actively leverage at least three big data open source engines” and put those findings into “active use”. The most popular engines are Apache Hadoop/ Hive, Apache Spark, and Presto, and these are used for data preparation, machine learning, and reporting and analysis workloads. “Data activation strategies are becoming more nuanced in matching the best tool for the individual job,” says Qubole Co-founder and CEO, Ashish Thusoo.
Huge volumes of commands being run
Over 58 million commands were processed by users in the three main engines (Apache Hadoop/Hive, Apache Spark, and Presto) in 2017. In one year, total usage across the three major engines has grown by 162 percent. Presto and Apache Spark are the fastest growing engines. Presto, particularly, has surged, “experiencing a 420 percent growth in compute hours and 365 percent expansion in total number of commands run.” Customers in aggregate are running 24X more commands per hour in Presto than Apache Spark and 6X more commands than Apache Hadoop/Hive.
New tools gaining adoption
In addition to the top-three engines, nearly 30 percent of organisations have used new tools like Apache Airflow for “orchestrating sophisticated data preparation pipelines and operationalising machine learning using Python code”. It allows monitoring of jobs, handling of failures, and so on. Other tools like XGBoost (predictive machine learning tool), Pandas (Python-based data science tool used for statistical analysis) and MLLib (Apache Spark’s ML library) are also gaining in acceptance.
Increased productivity and automation in focus
While usage and implementation grows, data-driven organisations are focused on optimising the number of users running commands in each engine, such that costs reduce and the process is nearly automated. For small-scale implementations, there are 16 users per engine; for medium implementations, the ratio is 48 to 1; and for large-scale implementations, it rises to 188 to 1.
“With self-serve data access, analytics and data teams are able to spend more time on higher-value tasks, such as uncovering previously hidden insights, identifying new revenue streams, improving the user experience, or modernising their processes, with minimal intervention from the DataOps or DevOps teams.”
Data scientists in short supply
In the US, the trend of hiring data scientists has grown over 650 percent since 2012. There are approximately 35,000 people in the US who have data science skills. Despite that, there are over 190,000 unfilled data-related jobs in the US alone, and hundreds of organisations are on a hiring spree. There is a “huge skill gap” in Big Data, and it is one of the reasons why organisations look to automate their engines as much as possible.