The world is one big data problem. – Andrew McAfee, Principal Research Scientist, MIT
You can have data without information, but you cannot have information without data. Any business will be inundated with data from various sources. As a multi-product company that has seen hyper growth over the past few years, we’ve had an abundance of data; generating over 50 TB of data per month across all our products.
With each of our software products touching a different aspect of the customer journey, we knew that one aspect was uncompromisable – data privacy and security. The biggest challenge that we have been trying to solve is interpreting this data and generating insights out of it, and ensuring that all the data we work with was always secure. So while designing the system we ensured that we kept the highest standards around security, governance, and compliance.
Our teams have grown exponentially, and today, between the data science team, machine learning team, customer success teams, product teams and business operations teams, we have a lot of people who interpret, analyze and build models based on the available data.
Previously, only the IT department had access to data and business units had to go through the IT department to access the data. Other business teams of the organization had to go through various time-consuming manual actions to generate any type of insight to make informative decisions. On one hand, there are teams that have a constant need for data access and on the other, there is a constant challenge to make the data available to the teams while keeping in mind the security and privacy norms.
Earlier, if teams needed data which was critical, they would have to get security clearance and then the DevOps team would run queries on our database. Only then would they be able to get the data they needed. We wanted to devise proactive solutions that help tackle this challenge. We wanted to democratize data and make our organization to function on data-driven solutions.
The primary step to achieve our goals was to build a platform that provides transparency and enables the users to trust the data they were working with. It was clear that the need of the hour was to build a data lake. This urged us to kick-start the Big Data initiative at Freshworks.
We needed to create a large body of data at rest, or a Data Lake, as known across the industry. Like any other man-made lake, a Data Lake needs steady input streams and a treatment plant to process the data and make it useful. Apache Hadoop had all the tools we needed to ingest, process and enrich the incoming data.
The construction of an artificial lake – Baikal
Every team has its own unique goals and priorities. The data generated by them is stored in data silos. The repository of fixed data is under the control of a single organizational unit and cannot be easily accessed by the other teams. To derive business insights and take data-driven decisions in our day-to-day work, we needed to ensure that all teams had access to this information. In order to consolidate the data silos built for special-purpose applications, we built a scalable data repository, ‘Baikal’ using Cloudera as a platform on AWS.
Cloudera is a provider and supporter of Apache Hadoop for the enterprise. It provides a unified platform to store, process and analyze data which is largely used for Big Data analysis. It also aids in finding new ways to derive values from existing data.
We put together a small team of specialized engineers to help accelerate and reach our goal of building a data lake. We named our artificial lake after the largest freshwater lake, Baikal. We designed it to capture data from all our products (be it internal or external) and across all possible formats (structured and unstructured) and store it in a way that was queryable. We started building the platform using analytics and processing frameworks/libraries like Apache HBase, Apache Hive, Apache Pig, Impala, Apache Spark and other ecosystem components. Our goal was to manipulate raw data and derive real insights rather than make it a mere query and data extraction tool. We wanted our data lake to be a genuine enterprise data hub.
Arriving at the right platform
When we started, we went with the obvious choice of using Amazon EMR. However, Amazon EMR had grown out of what was originally “transient” type workloads. Amazon EMR did not work well when we needed to keep the cluster around for a prolonged time. Being tied to the AWS stack meant that we did not have the choice of accessing rich Apache components as in Cloudera or Hortonworks.
We looked into third-party data lake providers as well. S3-based data lakes are easy to set up. Data is kept in S3 and clusters are spun up on EC2 to query the data. Even spot instances can be spun up or down based on usage. After detailed deliberations, we decided to build the data lake inhouse. That would allow us to build a complete solution that fit our needs, adheres to our strict security standards, allow fine grained access control, and would have right performance/cost characteristics. Moreover, it would help us develop an in-house expertise that comes in handy in making use of a large scale system like this.
Cloudera gave us a platform where we could start working on right away. From an engineering standpoint, it gave us the same code that comes with the Open Source Stack but with bug fixes, patches, and interoperability with Kerberos and other stacks in the same family. Cloudera gave us predictable and consistent access to platform improvements along with good community support.
We had laid out the fundamentals of the building process and set our end goals. The entire process required us to gather data that was being ingested from various sources, process it and store it in some easily accessible form.
Freshworks’ data is spread across RDS, S3 and custom data sources. In Baikal, we made use of an existing stack to ingest data from these sources. Apart from that, we wrote a lot of custom data pipelines (using AWS and Cloudera Stack) to move data at regular intervals. In a few cases, we wrote custom connectors to ingest data and take it to our data storage in Hadoop Distributed File Storage (HDFS) and Apache HBase. We wanted to efficiently create elaborate data processing workloads that are fault tolerant, repeatable and exhibited high availability (HA).
Sqoop, a command-line interface application, is the most commonly used component for ingesting data from RDBMS to a Hadoop Cluster. A staging area or a landing zone is created as an intermediate storage area. It is used for data processing during the extract, transform and load (ETL) process. The final relevant data is stored in a cluster which is consumed by Hive or Pig components.
We cater to different teams and hence, different ways of processing the data. While some teams need quick immediate reports, some others find it more time-optimal to register a job and have the report delivered to their emails. We use tools like Oozie and Luigi to help with creating custom workflows.
Some other teams tapped into Apache Spark for batch-processing capabilities. Spark – which has APIs in Scala, Java and Python – runs in our Hadoop clusters itself. It uses YARN for using internal memory and physical disks for processing the data, saving our developers time and resources.
While we opened up Spark for batch processing, we knew that it provided the constructs for streaming analytics as well. At Freshworks, we had a team working on Central Kafka Platform and we were looking to leverage the same downstream. Developers were able to build a real-time streaming application using Apache Spark and many such pipelines were being used by our machine learning team to build bots.
When we started with the cluster sizing experiment, we were clear that we would not go ahead with transient clusters. We needed a combination of a different family of machines to address different workloads. The diagram below depicts how we started classifying our machines based on various workloads and types.
We chose an EC2 instance with ephemeral storage as it provided the highest throughput to HDFS. Historically, it has been a clear choice for most Hadoop users.
For ease of use, we stored our data in the form of Hive tables. The data pulled from RDBMS is stored in the same way as in the source. We did comprehensive tests in order to choose the right compression format (Snappy, gzip and such), and find a balance between higher compression ratio vs. performance, keeping in mind CPU usage as well.
We also ran numerous tests and changed the storage formats several times to suit our needs. Eventually, we found that columnar stores like Apache Parquet offered very good flexibility between both fast data ingestion and fast random data lookup.
We started with a very simple pipeline: a mere Hadoop stack with Sqoop, Hive, and Oozie to ingest data from RDS. We faced common issues like handling multiline columns, accented characters and incremental updates.
Taking into consideration the kind of apps that were running on our platform, we needed our data to be synchronized with the data source. We started by revamping the existing stack and used a known use case to popularize the platform. The stack sat behind Tableau (which can connect to Hive) as the visualization platform to make easy-to-read charts, infographics and reports, demonstrating the ease of integration with BI tools. In order to bring in more stakeholders to the table and generate new ideas for data usage, we started with a Customer-Churn use case to help our Customer Success team. We noticed that certain reports which were being run periodically were resource-intensive and thought that this could be a good start that would help us advertise the system and its capabilities throughout the organization. We automated the flow using Hive and Pig scripts and helped the customer success team to have the reports on the first of every month.
The stack started to automate a number of reports and helped in making the entire organization understand the importance of such a project. Later, we were able to release the data-platform to our largest consumer, the machine-learning team and support them in doing analysis and giving out insights.
This is the first of two parts on why we built our data lake. We wanted to share our story, and start a discussion with engineers across the world who’d be interested in the topic.
Check out the second part, where we dive into the engineering nuances of the data lake.