The geek from Egypt who’s taking on the world of big data – meet Cloudera's Amr Awadallah
Amr Awadallah would be a professor today had he not stumbled upon the world of startups. The focus of this week's Techie Tuesdays, Amr Co-founded Cloudera, the platform for data management and analytics he built on Apache Hadoop and other open source technologies. The company went for an IPO earlier this year and is now valued at almost $2.5 billion.
During a recent conversation with Amr Awadallah, Founder and CTO of Cloudera, I not only learnt of his fascinating journey, but also about an aspect of the Egyptian life that I until then thought was peculiar to India. Listening to Amr talk about his early years, I realised that Indian parents aren't the only ones to decide or force a career path on their children. Egyptian parents too do that. I also learnt that Yahoo! has a product division called Yahoo Shopping, and that Mendel Rosenblum, one of the co-founders of VMware, is a professor of Computer Science at Stanford University.
Amr would’ve remained a computer enthusiast irrespective of his career path. If not a techie and CTO, he would have become a professor, a career path his father had chosen for him.
Things were on track until the startup bug bit Amr. He started up and exited while still a student. After getting his doctoral degree, he went back to the startup world again, this time with more confidence and knowledge. Amr founded Cloudera with three other people in 2008. Earlier this year, the company went for an IPO and is currently valued at nearly $2.5 billion. Cloudera is one of the largest platforms in the world for data management and analytics.
This week’s Techie Tuesdays delves into Amr’s journey, from Cairo to California.
The Egyptian geek kid
Amr was born in Cairo, Egypt in 1970. Soon after his birth, his father moved to Southampton in the UK to do his PhD in Accounting. Amr and his mother joined him a little later and they spent five years there. After finishing his PhD, his father returned home and joined Cairo University as a professor. Amr recalls his father pushing him to become a university professor from an early age. It was quickly becoming Amr’s goal too.
Amr did very well in school and graduated at the top of his class in Cairo University's Faculty of Engineering. A teaching job was seen as a logical progression for students excelling at the undergraduate level. Amr enrolled for a Master's in computer engineering.
He studied animation and worked on creating characters that walked like humans—as opposed to animators drawing them frame by frame. He explains,
I analysed walking sequences from videos and then created mathematical formulas (solved equations) that can replicate the walking sequence. Based on height and weight of an individual, the system would automatically generate a realistic walking sequence.
Amr loved video games ever since he was a child and even saw a future in animation. He wanted to make the process of creating characters in the games easier, as well as endow them with movements that were as natural and real as possible.
After his postgraduation, he applied to many universities in the US and was accepted at Stanford besides several others.
Stanford and startup
The first few months at Stanford were tough for Amr. He had moved to the US with his wife and their little baby and did not know many people there. Also, Amr who was used to being a topper in his class in Cairo found himself to be average in comparison to other students in his class at Stanford.
Initially, Amr planned to do his PhD in computer networking and he even published a couple of papers on the subject. While at Stanford, he got exposed to entrepreneurship very early. He recalls,
I participated in entrepreneurship classes and lectures just out of curiosity. I saw all the speakers make it seem much more attractive and achievable to create a company. Being in the Bay Area, you get to see the companies we've been hearing about, like Oracle, Intel, HP from close, and learn of their journey.
All this made starting up relatively easier in the eyes of students. In 1999, Amr joined one of his friends, Thai Tran, who was building a shopping comparison service for books called BookSmart (later renamed VivaSmart). Since it was the height of the internet bubble back then, Amr could comfortably raise about $500,000 for his venture.
VivaSmart was a comparison shopping engine that could crawl all over the web and aggregate all kinds of information about a product (specifications, images, etc) along with prices from different websites. Amr adds, “We started with books because students are price sensitive about books.”
VivaSmart used Oracle for the database to aggregate a lot of information and Perl scripts and programs to build the agents to crawl the web to ferret out information. The company had an on-premise data centre.
The bubble burst
In 2000, it became obvious that the internet bubble was going to burst, and a lot of companies started shutting down. Since VivaSmart had a good product but fewer customers Amr and his co-founder decided to sell the company. They approached some of their clients and one of them gave a good offer. But on the day the deal was to be inked, it was called off. Amr says, “We were devastated. But we went back and looked at other companies we had approached.”
Eventually, Yahoo! agreed to acquire VivaSmart in June 2000 for $8.9 million, and Amr and his co-founder joined Yahoo!
Back to college
While focusing on VivaSmart, Amr stopped working on his PhD thesis and, as a result, by 2001, his two years of leave of absence came to an end. He was keen to go back and finish his PhD and told Yahoo! the same. But the company, wanting him to continue to work for them, agreed to pay Amr’s tuition fee and gave him enough time to work on his PhD. Professor Mendel Rosenblum, who was one of the founders of VMware, agreed to guide Amr. Amr’s thesis was vMatrix, a backward-compatible solution for improving the interactivity, scalability, and reliability of Internet applications. He adds,
While at Yahoo!, I came up with this idea of using virtual machines to move servers and to be very close to the demands. If there's very heavy demand for a certain application, then you can move the server in real time to the region where the demand is high.
In his first four years at Yahoo!, between 2000 and 2004, Amr focused on the comparison shopping service, which was formed after VivaSmart’s acquisition. VivaSmart became the backend for Yahoo Shopping and Amr helped in integrating the technology and adding new features.
In 2004, Amr shifted to a new role: measuring the performance for business intelligence and data analytics for different Yahoo products and optimising it. He holds this experience fundamental to the idea that later metamorphosed into Cloudera.
The genesis of Cloudera
At Yahoo!, Amr had built a business intelligence platform using Oracle technology for data warehousing and data systems where all the data was aggregated. He used IBM DataStage for data collection, and MicroStrategy for business reporting and dashboards. While doing all this, he faced the following problems in his data pipeline:
- Speed at which he could process the data: Amr’s team was collecting data from all over the world and every midnight they had to close these files, clean them and aggregate the data in a way that fits their data model and schema. This operation took around eight hours to finish, which was too long.
- Oracle system was very expensive to keep data online and alive for long periods of time. So, every now and then, the team ended up making backups of data. But the data sent to backup wasn’t available to do analysis.
- Business users, data scientists and data analysts wanted to change the types of questions frequently. The traditional way to do data warehousing in BI is built using fixed schemas or fixed data structures, which is very difficult to change. It can take a lot of time, even weeks. So, Amr was looking for something that is more flexible and agile in terms of making changes.
- A lot of problems didn't fit nicely with SQL. Things like detecting fraud, figuring out social graph (for people interacting on Yahoo!Messenger), image processing (for Yahoo! image-based products) was a much harder problem than what SQL can do. The team needed to go beyond SQL in terms of its data processing capabilities.
At that time, a group at Yahoo! was working on Hadoop for Yahoo! Search. Amr recalls,
The legacy infrastructure of Yahoo! Search wasn't scaling very well. They needed a new engine that can crawl the web and create an index for Yahoo! Search. They chose a new open source, Hadoop.
Amr met the team working on this and described some of his BI problems and data science challenges. He tried Hadoop and it solved all the four problems. He adds, “Eight hours of data processing was shortened to five minutes. Cost also came down ten times on a per gigabyte basis. It was much more flexible to change the schemas and I could use many other ways to analyse data other than SQL.”
Amr finished his PhD in 2007 and left Yahoo! in 2008. By then, he had figured out that Hadoop would become big. He joined Accel Partners as entrepreneur-in-residence (EIR) and got a chance to work on idea stages of startups that were being pitched to the firm. Soon, he decided to work on Hadoop startups as he understood the problem and the solution really well.
He met his first co-founder, Jeff Hammerbacher, who was also an EIR then at Accel. Jeff started the data science team at Facebook. In fact, he, along with DJ Patil, had actually coined the term 'data science'. Jeff too had experienced the power of Hadoop for scale, agility and cost efficiency. Just a few weeks down the line, Amr and Jeff heard about two guys doing the same thing as them and decided to meet them on the advice of the investor to whom they were pitching. The duo met Mike Olson, who went on to become Cloudera's CEO for the first five years, and Christophe Bisciglia. The four, bringing complementary skills to the table, formally created Cloudera in October 2008.
Cloudera in the last nine years
Amr believes that every startup goes through three phases:
- Discovery phase, where you discover the problem you’re solving and come up with the solution, that is your product. Since the Cloudera team had already solved the problem at Yahoo! and Facebook, this phase was a short one for them.
- Product-market fit phase, when your product in the current form fits to what the market expects and in a repeatable way without having to do a lot of customisation for every customer. This phase went on for almost three years for Cloudera. Amr says,
“Initially, we were going to offer our product only as a cloud service. When we spoke to some bigger firms in telecom, finance and insurance sectors, we realised that they wanted to run it on-premise. So, in 2009, we changed our product from being a cloud-based one to an on-premise product.”
- Scaling phase: Once you've proved that your product has a good fit in the market, it’s all about how you scale it and your team handles the demand worldwide.
Cloudera has recently launched Altus, the platform-as-a-service version of their product. Cloudera counts Airtel as one of their largest customers globally. The company also has a couple of other customers in the Indian government.
Tech stack at Cloudera
Cloudera started with Hadoop focused on batch processing and storage, using MapReduce Service and HDFS (Hadoop Distributed File System). Currently, the company has almost 30 different projects. Amr says,
We believe that if data is leveraged correctly, then many of the problems that cannot be solved today can be solved tomorrow using that data.
He adds, “To achieve that result, we need to have a platform that is flexible in terms of consuming any type of data but can also extract the value of the data in different ways.”
Some of the offerings of Cloudera include:
- Impala is an SQL engine that allows you to do SQL analytics on top of data.
- Apache Solr is a search engine that allows you to do Symantec and text search on top of your data.
- Spark, a machine learning engine, allows you to do machine learning, advanced analytics and data science on top of your data.
Amr notes that the challenges for a startup, though many, vary according to the stage it is in. During the product-market fit phase, Cloudera’s main challenge was to figure out the features and functions of open source projects that they should include in the Hadoop distribution to meet the needs and expectations of the widest spectrum of customers possible.
When Cloudera started signing on a lot of customers, they had a challenge in terms of providing support to these customers. They now have bots in customer support that handle almost 20 percent of the workload of the support organisation. Cloudera finds it much more scalable than having humans handle the workload, and it plans to continue investing in bots.
The sixth wave of automation
Amr and the Cloudera team believe that by collecting enough data of different types—not just structural data, but semi-structural data from sensors, mobile devices, as well as completely unstructured data from social media, email and documents—on how humans make different types of decisions, one can learn so much more on the subject. This has been a part of Cloudera’s DNA from early days.
Amr believes that human civilisation is now in the sixth wave of automation:
- The first wave happened almost 100,000 years ago when human beings figured out the automation of knowledge transfer by talking. We were able to pass information to each other and to our next generations faster than any other species.
- The second wave of automation took place almost 10,000 years ago. It was the automation of growing/making food using cattle and rivers at a much higher rate. This had three implications:Hunter and gatherers lost their jobs,
- We were able to grow more food, so we could have more children,
- We now had more time to think about higher level abstract concepts.
- Automation of discovery, which took place 3,000-4,000 years ago was the third wave of automation. That's us coming up with the principles of maths, science (chemistry, biology, etc) that allowed us to discover new concepts and techniques quicker than ever before.
- Industrial Revolution, which started in 18th century, was the fourth wave of automation. It was the use of machines to automate the production for which we were using humans before. Countries that figured this out and adopted Industrial Revolution significantly increased their economic output. But it also put many people out of their jobs.
- The fifth wave was the automation of process using information technology, ie using computers to automate the process of opening bank accounts, generating newspapers, closing the books at the end of month, communicating using WhatsApp / mail. The term 'computer' was used in the 17th century to refer to people who performed mathematical calculations. —https://en.wikipedia.org/wiki/Human_computer.
Right now, we’re in the sixth wave—the automation of decisions. It's about creating algorithms that can learn how a human makes a certain decision and then repeat that decision much quicker. Amr adds,We now want to learn how humans makes decisions about diagnosing a disease, reviewing illegal documents, figuring out whether a given transaction is fraud or not, and then automating it using software to make it happen much quicker and in a more repeatable, scalable way.
He claims that Cloudera's stack is superior to that of Amazon’s and others. He says,
We have been doing this longer than the other companies and not just doing it for ourselves, but for big financial companies like Mastercard, and JP Morgan. We know how these enterprises work and what works for them. From our interaction with the organisations, we have firsthand knowledge of how to take machine learning and productionise that and make it run in a machine critical stable environment.
In addition, Amr claims that Cloudera has the bot/AI that supports their customer clusters and analyses their data and lets them know what changes need to be made to keep the cluster performing with high security and capacity.
Of team, decisions and values
As for Cloudera's employees, Amr looks for productive, smartpeople with good work ethics and who fit the company’s culture. He says,
You've to keep in mind that the interview process is imperfect. Even if you're the best interviewer in the world, every now and then you'll make a mistake and you'll hire somebody who wasn't the right fit. When you're a smaller organisation, you've to be quick in making the correction with wrong hires.
Cloudera follows consensus-driven decision-making. Amr believes that debates, even if they cause the occasional delay, lead to smarter decisions.
Much of Amr's individual values shine through at Cloudera as well:
- Be the change: Be proactive, have initiative, self-drive and self-motivation. He adds,“When things aren't going the way you would expect, don't complain about it. Don't be overly sad about it. Try to analyse the situation to the best of your capability and decide what can be done going forward. Being negative and lamenting takes away energy and leaves nothing.”
- Be open: It's about being transparent, trustworthy and displaying high integrity and ethics.
- Be hungry to learn and grow more.
- Fly information: It's about team work.
Past and future
One of the prominent regrets in Amr’s life is missing out on an opportunity to prevent one of Cloudera’s key competitors from being born. He says, “HortonWorks came from Yahoo! and I knew about that team. In some ways we could have intervened to acquire the team or at least hire some of the people from the team and prevent that spin off from taking place.” HortonWorks is valued at more than $800 million now.
According to International Data Corporation, automation of decisions will be a $200-billion market by 2020. Amr believes that over the next 10-15 years, Cloudera has enough opportunity to become a $30-100 billion one. With Cloudera, Amr wants to create the Oracle or the IBM of the future. He wants to enable this wave of automation of decisions that will change finance, insurance, governance, smart cities and how we all live.