We have implemented microservices and event-sourcing at Simpl. All our internal and external facing services emit structured events. The information we get from third party applications, are also gathered as events. All these events reach our data-pipeline. We process them, structure them for our analytics and machine learning usage. This post of about design considerations of building a scalable data pipeline.

Let’s take an example. An e-commerce application as usual. User visits the website, adds products to cart, does checkout, pays etc. Every activity of the user is useful for analytics. Let’s call these events as UserVisitedWebsiteItemAddedToCartCheckOutCart , PaymentInitiatedPaymentSucceeded , PaymentFailed etc.

We might wanna know things like,

  • How many customers visited us today/this week/month?
  • How many orders are made today/this week/month? Are we doing more or less?
  • Which product sells the most (today/this week/month)?

Let’s take a simple case of tracking UserVisitedWebsite events and analytics around it. We would be storing the flat list of the events in our data lake. To enable quick queries, the summary tables are created.

There’re two way we can update the summary table.

  • Stream: Listen to event stream i.e. implemented typically with AMQP, Kafka, Kinesis etc. and process every event. Increment the count by one whenever the pipeline encounters a UserVisitedWebsite event.
while True{
if(event.event_type == 'UserVisitedWebsite'){
update("date_visits.value", date_visits.value + 1)
}
}
  • Batch: Use the data stored in data lake by running the batch-processing code in periodic intervals. Typically nightly. Upsert a record in summary tables.
todays_visits = events.where(event_type,`UserVisitedWebsite').and(event_date, today).unique
upsert_row("date_visits", today, todays_visits.count)

As you might notice stream processing gives updated count in realtime, easier to implement and computationally cheaper.

However,

  • Accuracy: There’s a possibility of duplicate events. Remember guaranteed delivery cannot be achieved without retries and hence duplicates. With duplicate events, the count would go wrong.
  • Idempotency: Where there’s code, there’re bugs. So with data pipelines. It’s not uncommon to re-run the pipeline code on the same date after fixing bugs. It’s not hard to notice, the realtime update code would not be reusable since it’s not idempotent
  • Extensibility: Not every summary table can be designed in first go. Based on usage, we would create summary tables that are frequently used. In this scenario, we would have to feed the realtime update code.
  • Parallelism: It’s harder to get stream processing code to run in parallel than batch processing code.

Batch upsert code would not face issue in all the above cases at the cost of delay in getting summary table values.

So while designing data pipelines,

  • Build for batch! It’s easier to get started. Reduce the delay by running batch pipeline code in small intervals may be.
  • Stream processing is okay for dashboards stats & graphs where the trends are consumed rather than absolute accuracy.
  • Use data partitioning by date, functionality, location to make batch processing faster.
  • If a value is required in realtime, build for it. But, ensure it’s backed by batch update code for accuracy and idempotency. This is a proven pattern and called as Lambda architecture.

Write A Comment