When we built Woopra’s database, we wanted to optimize it for our use case, which in its simplest form, is tracking unique visitor events. The applications of this use case are myriad and require some serious engineering chops if you want to achieve accuracy and speed while maintaining a datastore that's efficient, fault-tolerant and agile.
This article describes the critical design decisions at Woopra that led us to build a custom database using the concept of schemaless database design. Let’s dive in!
Column Store Database - Yay or Nay?
If you've come across analytics infrastructure or have built one yourself, you might know that the analytics use case is one where a column store generally works better than the traditional, row-based database.
When a query runs on a relational database, the system needs to access information from all the tables that are being joined and then return the required result.
For example, if you want to calculate the total number of unique visitors on your website that purchased a high selling item over a given period, you would need to access the user_ID, first name, last name, item_ID, purchase_price, transaction_ID and transaction_date.
In a relational database, the system would need to read the user table, the item table and the transaction table. Unless you need to return all or most of the columns from the three tables, reading all of the data to calculate the count of unique visitors is cumbersome and as a result, very costly for an analytics use case.
In contrast, in a column store, each attribute has its own file, so it’s much faster when you want to access data at scale, which is typically the case for an analytics platform.
Column stores are designed to handle analytics queries better as they typically have to read column data on a deeper level and perform aggregate level analysis on a small subset of columns.
In many column store databases, the data is completely denormalized too which helps to quickly query and retrieve only the relevant fields.
In our example, we only need the files that contain the user_ID, item_ID and transaction_date data to calculate the number of unique visitors that purchased that item in a particular date range.
This is a lot more cost-effective as the rest of the fields are ignored and the query performs much faster. As a result, column stores significantly reduce the disk IO for reading large amounts of data.
Our customers track hundreds of events that have an average of 60-70 properties each. The ability to minimize IO was critical to building a high performing database.
In a query-heavy environment such as Woopra’s, it was important to identify how we logged data and consequently, how we accessed it to allow even the most complex queries to run with unparalleled smoothness.
There are dozens of column store database management systems (PaaS and open-source) in the market such as Amazon Redshift, Google BigQuery, Druid and MariaDB. These are fully-managed services that offer the scale and cost efficiency that analytics providers need.
With so many solutions already available, you might wonder why we decided to go through the trouble of designing and building a custom database for Woopra?
For a use case where tracking user events is mission critical, we realized that we needed a database that not only supported millions of data points but also offered a highly optimized data ingestion and querying mechanism that would enable us to build a sophisticated, real-time infrastructure.
Following are some of the key decision points that led us to take the plunge and build Woopra’s proprietary database:
1) Real-time Actionability
Here at Woopra, we're obsessed with ‘real-time actionability’! Column store databases typically need 15 minutes or more to first commit the data into the system before running a query.
In Woopra, we wanted to be able to respond to an HTTP tracking request with analytics-based automated triggers, taking into account the very event that's being tracked. So, we needed to be able to access the user tracking data while it's being processed to make it truly real-time. Further, we required a database that's able to read the complete historical tracking data that's already committed, as well as the pre-committed data in the tracking request itself, to trigger automated actions in 100-200 milliseconds!
To accomplish this enormous feat, we needed a custom database with the flexibility to automate tracking events and trigger actions while the event is occurring.
Not even our closest competitors have the ability to power this level of automation that drives real-time actionability, allowing our customers to achieve real-time personalization, at scale.
Column store systems such as Amazon Redshift come with one major disadvantage: the need to have pre-defined schemas. The unique visitor tracking use case that Woopra caters to would work only if it were schemaless.
Since our customers bring their data into Woopra from multiple sources and each customer has an entirely different set of data points they want to track, not having pre-set data types in the repository helps Woopra define “columns” on the go, as and when the data flows in.
Additionally, we wanted to avoid any massive outages in case of schema changes or migration activities that would save us hours of processing time, thus keeping the data storage model completely agile.
Being schemaless increases the flexibility of storing all sorts of event data with different characteristics, without having to pre-define its structure in the database. It allows us to put incredible power in the hands of our users by giving them the ability to build customizable schemas so they can define how the data is tracked and processed in Woopra.
3) Unique ID Constraint
While the commercially available column stores function well, they're more general purpose and have other limitations that eventually helped us narrow down our requirements.
Databases such as Amazon Redshift do not enforce unique IDs. This means that the primary key and foreign key constraints are only informational. This can cause massive data quality issues!
We realized that any primary key or foreign key violation would be detrimental to analyses. So, we designed a unique ID mapping system that is incredibly intricate, allowing us to store unique IDs for visitors in a way that is paramount to our functionality.
4) Database Maintenance
Commercially available systems typically require you to perform periodic maintenance tasks to keep the data consistent. This maintenance requires resource-intensive operations (for example, the VACUUM command in Amazon Redshift or Cassandra’s compaction) that can last for hours, which means that the data may not be available at all times.
Woopra’s mission was to build a next-generation database that would overcome the above limitations and allow us to scale our business and cater to a wide range of business needs. We wanted to have the flexibility to make our data available to our users at all times, while creating an architecture that's designed to include maintenance in our query set up proactively.
Building a custom database ensured that we render extremely powerful results while maintaining data consistency.
Woopra’s proprietary database is built to deliver a superior customer experience without compromising on accuracy and agility of the system. A schemaless database allows us to process a range of queries by retrieving only specific data from the system.
Because of the robustness of our ID mapping system, we avoid aliasing issues and the system successfully maintains the user’s unique identities throughout multiple digital interactions and touchpoints.
Building our own proprietary database has allowed us to process queries that can read new and existing data to deliver real-time insights that enable our customers to tactically achieve a massive competitive advantage!
Stay tuned for more! In the next blog in this database series, we’ll discuss how our system architecture enables our customers to cross the chasm between after-the-fact analysis and real-time personalization!