With the price of commodity hardware continuing to motion downward, it continues to surprise me when I talk to organizations that still offload their historical data to tape offline and inaccessible. I can only guess that it is driven by a mix of factors that include no enforced compliance regulations that state data be online and available for query, or that it is a legacy process not yet evolved with budget dollars being directed on the production analytic database systems to meet ongoing business needs. Most large enterprises today are feeling the pain of too much data – on the one side driven by business users demanding access to broader and detailed data for better decision-making, and on the other side driven by regulatory requirements to keep the data online for specific timeframes. The IT team caught in the middle must satisfy the business user and simultaneously balance the infrastructure costs in order to stay in business and competitive.
Some industry sectors don’t have the option but to keep their data assets online for multiple years – the most obvious being banking and financial services where data must be WORM compliant, according to the Dodd Frank Reform bill passed a few years ago for transparency and protection of trades and transactions. The cost of doing business has now increased to the point where IT executives are forced to drive at least a third of the cost out of infrastructure in order to maintain the margins they have become accustomed to over the years.
I read a recent article that discussed the exact challenge of keeping financial Big Data online and available and where the head of NYSE who owns Big Data was interviewed and states: “…. when you’re talking about billions of transactions per day, building systems that can take this unfriendly data and turn it into regulation-friendly, analysis-ready information is a key, ongoing struggle.” and he goes on to say “There’s not one system out there that could actually store that data and have it online. Besides, it wouldn’t be practical. It’s old, old data, it’s just used for regulatory needs and then maybe trending over time details.”
Yes it certainly is a challenge to ingest billions of transactions daily and make them available for ongoing query and certainly once the data reaches a certain age, it becomes less frequently accessed and likely by a smaller group of users. However, the point about it being practical is probably more a reference to leveraging traditional database approaches. Certainly petabyte scale is extremely challenging if you are using a row or even columnar architected database and in some cases organizations managing this magnitude resort to flat file farms as opposed to offloading to offline tape so they meet the compliance regulations. Neither “low-tech” approach solves the problem.
This is where Hadoop is the ideal platform being ideally suited to managing large data estates at low cost scale. In order to perform fast query against hundreds of terabytes to petabytes of historical data, standard SQL is the best method being better suited to the infrequent query access that a compliance officer might require with financial data that predominantly constitutes trades and transactions. In order to have SQL access to data on HDFS, you need a database architected to easily handle big data, perform on HDFS and also meet the enterprise requirements of security and data availability.
For an enterprise organization looking to getting their feet wet with Hadoop, using it as the dedicated back-end archive is an excellent way to go. If SQL is not a key requirement, you can still perform queries using Hive or MapReduce as long as batch response rates are acceptable. In many cases the historical data archive is an environment where once you set it up should just run and ingest records at the agreed upon business-level frequency coming from the source database systems. There shouldn’t be any specific requirements for heavy data transformations and a high number of rapid response concurrent queries. As Hadoop gains in maturity and reliability, organizations will have rolled it out and gained skilled resources in addition to experience at managing the cluster. My prediction is that this archive back-end use-case will be the most popular across large enterprises in the coming year. Offline tape is simply not acceptable for banking and financial services and Hadoop delivers on the promise of continuous online access at low cost.