I’ve been working with Big Data for many years and have seen first hand how real time data volumes are growing and the use cases are becoming more and more sophisticated.
In my time at Splunk I saw customers go from thinking 100GB of data per day was a lot, to handling 10s even 100s of TB per day. The more data consumed, the more use cases were developed and value derived. Recently, through events I’ve attended, meetings I’ve had with big data leaders and my own research, I’m starting to see a change in thinking…
First of all let your imagination consider the amount of data now being generated by some of the largest data generators: a Global Ad Network may generate (180TB / day), thousands of driverless cars (4TB per day per car), automated car production plants, or mobile phone networks (may be exabytes or zettabytes?!).
Already organisations are generating more data than they can store and manage in an economic way.
Secondly, consider a old discussion point, the value of data decreases through time.
So… if there’s too much data to store and real time action demands processing of data as soon as the event has occurred, where could things be going???
Forget: transport the data, store and run batch processes. Think: processing data on the fly as soon as the event has occurred.
Publish all data on the wire into a data highway, allow teams across an organisation to tap into the data, process it close to the source when and where necessary, if not consumed, the data will be dropped.
In this new way of working, we see the concept of “data campaigns” where data is collected for a particular time, for a particular project. An example, being a car manufacturer analysing data from vehicles driving over 60km/h in wet conditions. This campaign may only be required for a few days or weeks before disconnecting from the data highway.
Of course, having such a dynamic environment and architecture brings other challenges: data operations (or DataOps), perhaps a subject for another blog?