Towards Real Time Analytics using Data lake and Event driven Architecture – Part 1

Naresh Reddy
Licious Technology
Published in
9 min readSep 6, 2023

--

Objective:

In the age of data-driven decision-making, organizations are striving to build robust data architectures that can handle diverse data sources, process data efficiently, and enable real-time insights. At Licious, We have taken a significant step towards modernizing its data platform to better align with its goals. In this article, we’ll explore the evolution from a traditional architecture to a more versatile and dynamic architecture centered around a data lake and event-driven ingestion pipelines with both realtime and near realtime data availability to make business decisions more effectively.

Previous Architecture:

In the existing architecture we used the combination of Amazon Redshift and Change Data Capture (CDC) and Batch ETL’s via Airflow and architecture was like this below:

Previous Dataplatform Architecture

Challenges in Previous Architecture:

We've often encountered the below Issues with Previous Architecture.

  1. Replica Challenges: For instance, if a replica experienced a restart, it necessitated a manual restart of all the CDC connectors, introducing additional maintenance overhead.
  2. Adapting to Schema Changes: Whenever there was a modification in the schema, it directly impacted the ingestion process, creating disruptions and delays.
  3. Storage and Compute Complexity: Given that both storage and compute resources are combined in Redshift, managing efficient scaling of both components became intricate and demanded careful optimization.
  4. Cost Implications: The issue of scaling naturally led to increased costs, as scaling in terms of storage and computing translated to higher operational expenses.
  5. Scaling Hurdles: As data volume increased, queries tended to become progressively slower, reaching a point where the architecture struggled to maintain performance.
  6. Real-time Limitations: The architecture’s performance fell short of achieving true real-time data processing, creating a gap between expected and actual real-time insights.
  7. Replica Lag Impact: Instances where a replica encountered lag had a ripple effect, impacting other pipelines and overall data availability.
  8. Challenges with Unstructured Data: Initially began with CDC, the architecture faced numerous challenges when dealing with unstructured data. This led to a pivot toward directly scheduled ETL processes, as CDC was deemed insufficiently efficient and mature for handling such data complexities.

While the Redshift and CDC-based architecture offers numerous benefits, these challenges underscore the importance of careful planning, continuous monitoring, and a proactive approach to addressing potential pitfalls.

DataLake Architecture:

To address the shortcomings of the Redshift and CDC-based architecture, we undertook a comprehensive transformation. Embracing an event-driven paradigm, we achieved real-time data processing, mitigating latency and inconsistencies. Our strategy incorporated a Lakehouse setup, strategically partitioned for optimal resource utilization. Additionally, we introduced a dedicated real-time store, ensuring data consistency and facilitating schema evolution. This revamped architecture, combining Lakehouse principles, forms the bedrock of our agile and dependable data integration and warehousing solution.

As we transitioned towards a more effective solution, let’s now delve into the detailed High-Level Design (HLD) diagram of our new architecture. This diagram provides a visual representation of the key components and their interactions, illustrating how our innovative approach overcomes the challenges we previously discussed.

Data Platform Architecture

Below, we outline the core components of our newly designed data platform architecture, detailing the strategic tech choices we’ve made to integrate real-time event-driven principles, a robust data lake framework, and streamlined data processing.

Our architecture begins with data generated by our applications and backend services, which are pushed as events.

Event Bus with registry:

At the heart of our architecture lies the Event Bus, a robust solution built on top of Kafka and tightly integrated with the Confluent Schema Registry. This central communication hub serves as the backbone of our system, orchestrating the flow of events from various applications. Leveraging Kafka’s proven scalability and fault tolerance, the Event Bus guarantees reliable event transmission, even in high-throughput scenarios.

The Confluent Schema Registry, a core component of this architecture, ensures that events maintain a standardized structure across the board. By providing a centralized repository for event schemas, it promotes uniformity in data interpretation and processing. This means that regardless of the application source, events are consistently understood and efficiently processed as they move through the system. Confluent Schema Registry’s compatibility with Kafka further streamlines the integration, facilitating a seamless data pipeline that underpins the entire architecture.

Realtime Processing:

For real-time event processing and consumption, we’ve established an agile Data ingestion that feeds directly into ClickHouse. After evaluating alternatives such as Druid and Pinot, we chose ClickHouse for its stellar performance and suitability to our needs. ClickHouse’s columnar storage structure ensures rapid query responses, efficient data compression, and seamless integration with visualization tools, making it an optimal fit for our dynamic environment.

we’ve implemented a robust architecture that ensures data generated at its source is available in ClickHouse, our data repository, in a matter of seconds — approximately <5 seconds, to be precise. This ultra-low latency is achieved through streamlined event ingestion powered by Kafka, a real-time event streaming platform. By integrating Kafka with the Confluent Schema Registry, we ensure uniform data interpretation across applications.

Furthermore, ClickHouse’s data compression capabilities come to the fore in scenarios like tracking website conversions. As user interactions flow in real-time, ClickHouse optimizes storage and retrieval, enabling us to swiftly identify where conversion rates drop, helping us take immediate corrective actions to enhance user experiences and maximize conversions.

In essence, ClickHouse’s prowess in real-time event processing, demonstrated through use cases spanning order insights, insights for supply chain operations (includes warehouse,midmile & lastmile), traffic information, and conversion analysis, positions it as a cornerstone of our data architecture. Its columnar storage, coupled with its adeptness in optimizing data access and storage, propels our ability to derive timely, actionable insights from dynamic data sets, ultimately driving our success in a fast-paced business landscape.

Continuous ETL:

Executing continuous ETL operations has become a streamlined process through the utilization of Apache Airflow. Employing a microbatching strategy, we ensure that data transitions occur seamlessly, maintaining the integrity and consistency of our datasets.

For processing these microbatches, we leverage the power of Apache Spark on Amazon EMR, a scalable and efficient solution. To enable efficient incremental data processing and support real-time updates, we employ Hudi (Hadoop Upserts, Deletes, and Incrementals) as a key component in our pipeline.

The culmination of this process leads to the data finding its home within our robust Datalake, firmly established on Amazon S3. This storage solution ensures the reliability, scalability, and accessibility required to support our evolving data platform architecture, setting the stage for subsequent stages of data processing and analysis.

Complex Event Processing(CEP):

Employing Flink for Complex Event Processing, we enable real-time anomaly detection, pattern matching, and fraud detection. These functionalities will be explored in detail in an upcoming dedicated blog.

Batch ETL:

Our Batch ETL processes serve as a pivotal bridge between various data sources, including Google Sheets, and the outcomes of our continuous ETL operations. They efficiently gather data from these sources, transforming and aggregating it to enrich our gold tables with valuable insights, including aggregations and other transformations. These batch processes don’t just handle Google Sheets data; they also process data generated from continuous ETL operations. As a strategic step, the results of these Batch ETL operations are systematically pushed back to our resilient Datalake on Amazon S3.

Data Lake:

In the ever-evolving landscape of data management, organizations are constantly seeking efficient ways to harness the power of their data for informed decision-making. Data lakes have emerged as a game-changer, offering a scalable and flexible solution for storing, processing, and analyzing vast volumes of diverse data. Within the architecture of a data lake, data is categorized into distinct tiers, each serving a unique purpose in the data lifecycle. we’ll get into the intricacies of the Bronze, Silver, and Gold data tiers, exploring their roles and significance in the data management process.

Different tiers Of Data: (Bronze →Silver →Gold)

Bronze Tier: Raw Data Ingestion

At the foundation of the data lake lies the Bronze tier, where raw data is ingested in its purest form. This data tier serves as the initial repository for all incoming data from various sources, such as Licious apps, Services and external partners, and more. The primary objective of the Bronze tier is to ensure that no data is lost.

Key Features:

  1. Data Capture: The Bronze tier acts as a landing zone for all data, regardless of its structure or format. This includes structured, semi-structured, and unstructured data, ensuring that no valuable information is left behind.
  2. Data Cataloging: Metadata is attached to the ingested data, providing context and information about the source, timestamp, and initial quality of the data. This metadata aids in tracking the data lineage and understanding its origins.
  3. Minimal Processing: Data in the Bronze tier undergoes minimal processing. It is neither altered nor transformed, maintaining its fidelity for future use cases.

Imagine a user named Alex visits the Licious website and views two different product pages: a laptop and a smartphone. Each of these events, denoting the product viewed, is pushed to the event bus as raw data without any modifications.

Here’s how the raw data for these events might look like:

  1. Event 1 — Page Viewed (WEB):

Event Type: PAGE_VIEWED

User: Jon

Platform: iOS

Page Name: HomePage

ProductId: P2

Warehouse Id: W1

User Agent Info: Chrome101.1 browser ,licious.ins

Timestamp: 2023-08-29 10:15:00

2. Event 2— Page Viewed (Mobile):

Event Type: PAGE_VIEWED

User: Jon

Platform: iOS

Page Name: HomePage

ProductId: P2

Warehouse Id: W1

User Agent Info: Android11 etc

Timestamp: 2023-08-29 10:15:00

Silver Tier: Consolidation with Cleansing

As data progresses through the lifecycle, it moves into the Silver tier, where data refinement begins. In the Silver tier, raw data from the Bronze tier is consolidated, cleansed, and transformed into a more structured format. This process is essential to enhance data quality and enable meaningful analysis.

Key Features:

  1. Data Cleansing: In this stage, data is cleansed of inconsistencies, errors, and duplicates. Data quality rules are applied to ensure that the data is accurate and reliable.
  2. Schema Design: Data is structured into well-defined schemas, making it more accessible and understandable for downstream applications. This schema design facilitates querying and analysis.
  3. Metadata Enhancement: Additional metadata is added to the data, providing insights into its context, transformations, and any changes made during the cleansing process.

let’s look at how the raw data from the Bronze layer can be refined in the Silver layer:

  1. Refined Data — Silver Layer (Event 1):

User: Alex

Action: Viewed

Product Id: P1

Product: Lamb curry cut 500gms

Product Category: Mutton

Warehouse Id: W1

Warehouse Name: Indira Nagar

Timestamp: 2023-08-29 10:15:00

Hour: 10

Date: 2023-08-29

Year: 2023

2. Refined Data — Silver Layer (Event 2):

User: Jon

Action: Viewed

Product Id: P1

Product: Chicken Curry Cut 450gms

Product Category: Chicken

Warehouse Id: W1

Warehouse Name: Indira Nagar

Timestamp: 2023-08-29 10:20:00

Timestamp: 2023-08-29 10:15:00

Hour: 10

Date: 2023-08-29

Year: 2023

Gold Tier: Aggregates with Different Dimensions

At the pinnacle of the data lake architecture lies the Gold tier, where data is transformed into valuable insights through aggregation and analysis. In this tier, data is refined even further, aggregated across various dimensions, and prepared for reporting, visualization, and advanced analytics.

Key Features:

  1. Data Aggregation: Data is aggregated to create higher-level summaries and key performance indicators (KPIs). Aggregation provides a consolidated view of trends and patterns within the data.
  2. Dimensional Analysis: Data is analyzed across different dimensions, such as time, geography, product, and customer segments. This enables deeper insights and more targeted decision-making.

let’s explore how the refined data from the Silver layer can be aggregated and analyzed in the Gold layer:

Aggregated Data — Gold Layer:

Total Page Views: 2

Total Users: 2

Total Views by Product Category:

Mutton: 1 view

Chicken: 1 view

Dimensional Analysis — Gold Layer:

1.) Product: Lamb Curry Cut 500gms

Total Views: 1

User Count: 1

Date Viewed: 2023-08-29

2.) Product: Chicken Curry Cut 450gms

Total Views: 1

User Count: 1

Date Viewed: 2023-08-29

API Layer:

The API layer enables users to retrieve and interact with the stored data. It provides a unified interface for querying data from both the real-time store and datalake, abstracting the underlying complexity.

APIs allow our Licious applications or services to programmatically access the processed data for further analysis and integration. These APIs provide a structured way for other systems to utilize the insights generated by our data platform architecture.

Conclusion:

In summary, the article underscores the significance of adopting a transformative data platform to stay relevant in the data-driven landscape. The combination of a robust data lake and an event-driven architecture emerges as a crucial strategy for unlocking real-time insights from diverse data sources. This approach not only enhances data accessibility and scalability but also enables agile, data-driven decision-making.

However, this transformation journey is not without challenges. It demands careful navigation of technical complexities while prioritizing data security, governance, and expertise. Overcoming these hurdles requires substantial investment in advanced solutions, effective data management practices, and continuous skill development. In embracing this paradigm, organizations position themselves to thrive in an era where data insights and responsiveness are pivotal drivers of success.

--

--