I’ve been thinking about data a lot recently. Nothing unusual there - it’s my job after all, and given the way data is growing at the moment, the term ‘big data’ is a regular on the daily buzzword bingo card. What’s really been making me think however, is not so much the quantities of data, but where it all is and how we can access it.
The volumes of data being stored in online media sites such as YouTube or Vimeo, the blogs, content and status updates across social networks, the events and transactional traffic being generated by advertisers and devices are all creating an online pool that is dwarfing the overall volumes being stored inside corporate data centres.
Such resources present a rich seam, which is increasingly available to be mined, as examples such as Amazon providing access to “Open datasets” via EC2 illustrate. Add to this the vast pools of governmental data - across transport and other public services - being opened up through initiatives led by the likes of Sir Tim Berners Lee and Professor Nigel Shadbolt, and the opportunity becomes overwhelming.
As the shutters rise on increasing pools of data, an emerging challenge is either of getting to the data where it is, or getting the data to where it can be processed and analysed. While Moore’s Law has done a great job of increasing processing power while reducing its costs (which is why we have so much data in the first place), networking bandwidth has not increased at the same rate.
While in the future we might get some kind of quantum networking, in the short term at least such bottlenecks are here to stay. Data transfer is expensive, meaning that our data mountains tend to stay where they are.
What can be done? Of course, there is more to data than simply raw sets of zeroes and ones. Data can be aggregated, pre-processed, meta-tagged, connected and referenced. Consider the simple example of mapping Britain’s bus services - while bus data might need to be accessed in real-time, nobody is suggesting the same for the map.
Many have commented that the role of the data scientist will become increasingly important in years to come. I believe we will see another role emerge - that of "data orchestrator", who works out how to connect these disparate data sources without shifting the base data around.
The role is architectural - while OpenStack, CloudStack and so on will be the tools of the trade, the physical nature of today’s enterprise and wide area networks, data centres and hosting facilities also play a part. For example, if the task is to link open transport with social network data, where better to start than in a colocation centre which is hosting both?
Much work remains to be done. I (and many others) have been discussing the concept of federated meta-repositories for example, which enable the distributed data sources to be viewed as a whole, rather than having to start from scratch every time. One thing is for sure however - for the time being, in large part, our data sources are likely to stay where they are. It is up to all of us to create the tools and develop the skills to respond accordingly.