Data, data, everywhere - but not anywhere?

I’ve been thinking about data a lot recently. Nothing unusual there - it’s my job after all, and given the way data is growing at the moment, the term ‘big data’ is a regular on the daily buzzword bingo card. What’s really been making me think however, is not so much the quantities of data, but where it all is and how we can access it.

The volumes of data being stored in online media sites such as YouTube or Vimeo, the blogs, content and status updates across social networks, the events and transactional traffic being generated by advertisers and devices are all creating an online pool that is dwarfing the overall volumes being stored inside corporate data centres.

Such resources present a rich seam, which is increasingly available to be mined, as examples such as Amazon providing access to “Open datasets” via EC2 illustrate. Add to this the vast pools of governmental data - across transport and other public services - being opened up through initiatives led by the likes of Sir Tim Berners Lee and Professor Nigel Shadbolt, and the opportunity becomes overwhelming.

As the shutters rise on increasing pools of data, an emerging challenge is either of getting to the data where it is, or getting the data to where it can be processed and analysed. While Moore’s Law has done a great job of increasing processing power while reducing its costs (which is why we have so much data in the first place), networking bandwidth has not increased at the same rate.

While in the future we might get some kind of quantum networking, in the short term at least such bottlenecks are here to stay. Data transfer is expensive, meaning that our data mountains tend to stay where they are.

What can be done? Of course, there is more to data than simply raw sets of zeroes and ones. Data can be aggregated, pre-processed, meta-tagged, connected and referenced. Consider the simple example of mapping Britain’s bus services - while bus data might need to be accessed in real-time, nobody is suggesting the same for the map.

Many have commented that the role of the data scientist will become increasingly important in years to come. I believe we will see another role emerge - that of "data orchestrator", who works out how to connect these disparate data sources without shifting the base data around.

The role is architectural - while OpenStack, CloudStack and so on will be the tools of the trade, the physical nature of today’s enterprise and wide area networks, data centres and hosting facilities also play a part. For example, if the task is to link open transport with social network data, where better to start than in a colocation centre which is hosting both?

Much work remains to be done. I (and many others) have been discussing the concept of federated meta-repositories for example, which enable the distributed data sources to be viewed as a whole, rather than having to start from scratch every time. One thing is for sure however - for the time being, in large part, our data sources are likely to stay where they are. It is up to all of us to create the tools and develop the skills to respond accordingly.

Data, data, everywhere - but not anywhere?

Trending Articles

モーツァルトディヴェルティメント変ホ長調 K.563 の名盤

井上貴博アナウンサー彼女や結婚の噂は？実家や親が話題？人気は？

Ke Aloha Kalikimakaの歌詞を和訳します

PaliのLepe `Ula`ulaと歌詞の和訳

2014年6月6日号　三菱東京ＵＦＪ銀行（5月14日付）

LNK2019:未解決の外部シンボルと LNK1120:外部参照 1 が未解決について

ヴァンパイア・ノーツ　攻略

大阪・泉南イオンで飛び降り自殺とみられる転落事件が発生：ネットで拡散された理由とは

メールディーラーで受信するアドレスを追加できますか？

Robocopy のエラー (戻り値) について

林要の結婚や経歴&評判とWikiプロフやLOVOT(ラボット)とグルーブエックス株価は

【極☆寒】「凍った髪」を競い合う『国際ヘア・フリージング・コンテスト』！寒〜い写真に身震いしつつ過ぎ行く冬にサヨナラだ!!

滋賀の部落（同和地区）一覧

【銃刀法違反】吉田総業組長代行恩田達志容疑者を再逮捕

和歌山県代表決まる　都道府県対抗中学バレー

大浦街道で重体事故

【世界大学ランキング】第１位にジュリアード音楽院とウィーン国立音大、日本勢は？

【対策済】「SKYSEA Client View」のアップデートに失敗する問題についてのお知らせ

Lahaina Lunaの歌詞を和訳しました

画像・写真】ららぽーと横浜で16歳男子高校生が転落死不審な動き→逃走し警備員に追いかけられ→柵越え飛び降り・12m転落窃盗・万引き？それとも盗撮？