The nature of modern systems are in a constant state of generating “events”: A piece of data generated at a point in time representing an action.
- “Liking” a webpage article
- Having a field auto-complete on a form
- Page reporting a metric on user activity
Of course this isn’t just limited to web apps but backend applications, systems as well as mobile and connected devices.
Engineering teams have deployed technologies such as Apache Kafka to “connect” this deluge of event data. The principle is one event needs many actions. Perhaps not a new concept on the surface. But Kafka extends this fundamentally as you’ll read. The value includes:
- Providing a platform to develop new data-intensive applications
- Accelerating the time to market of new services
- Empowering analytics and data science teams to tap into new and real time streams of data
If you’re not familiar with how Kafka is used across an enterprise, in a very abstract fashion, here are some examples.
Secure and reliable transportation of data
The most classic usecase. Connect data generated by applications to data stores in a reliable, scalable and secure fashion to a data store.
Whilst data is in transport, it may be valuable to “wrangle” the data on the fly before delivering the data.
An example may be to take the lat/long coordinates of where the “like” was originated from and enrich it with the country name for that location.
Kafka provides a framework to do this on the data stream “in motion”. This replaces traditional ETL processes that often run in batch processes.
Analytics on data-in-motion
The value of data can fall off a cliff if not acted on within a few minutes of being generated.
For organisations, the time between the data being generated and the time it’s available in a data warehouse is traditionally several minutes at the very best often hours or days.
Analytics driven off real time data streams allows teams to take decisions faster.
Developing data science and new services
Product teams are now empowered to develop new services by subscribing to a data stream in a manner that is completely decoupled with other applications generating the data. Producers of the data never need be concerned about future consumers of the data.
This facilitates microservicing of applications and the acceleration of application delivery.
For example, having just “liked” an article, a separate service would match this event with a pre-calculated machine learning model to instantly propose other articles you may like. No good if this is done fives minutes later as you may have already left the site.
Event sourcing and CQRS
Event sourcing (and subsequently an architectural practice known as CQRS) allows engineers to fundamentally rethink about how applications store and access data from databases.
This is a very broad subject. If summarised, traditionally, applications tightly couple both read and write actions and the model for those actions on databases.
Solutions such as Kafka can be used as an event store – an immutable trace of all events. Applications may benefit from having different data models and different database instances depending on the functions it needs to provide.
The state of each respective database is generated dynamically by analysing the historical events stored in the event store. The read and write actions can be scaled independently.
Using Kafka to build stateful applications
Following on from event sourcing and CQRS, one of the by-products of this practice is if using Apache Kafka Streams framework, applications will store an in-memory state of a subset of a mini database locally on each instance of a distributed application before it updates an external DB.
A distributed application can query locally the in-memory state rather than externally a database. This is bringing compute, data and state together in order to build low-latency and data-intensive stateful applications.
In follow-up blogs, I’ll go into more detail about some of these usecases with value and examples of organisations adopting them.
Follow The Data Difference for notifications of other blogs we publish. Follow @TheDataDiff