Cloud computing helps organizations store, manage, share, and analyze their Big Data in an affordable and easy-to-use way. Today’s cloud Infrastructure-as-a-Service (IaaS) providers such as Microsoft, GoGrid, Amazon, Google, Rackspace and Slicehost, supported by the on-demand analytics solution vendors, make Big Data analytics very affordable.
Most corporate enterprises don’t fully leverage their data. Data is usually locked in multiple databases and processing systems throughout the enterprise. However an aggregate view of all the data is sometimes needed to answer tough questions from customers or analysts. Most importantly, by analyzing their Big Data trends, statistics, and other actionable information to help decide on their next move, companies can grow their business by uncovering important information. Take for example Google whose success can be attributed primarily to its ability to analyze large amounts of data. In fact, Google developed a software framework called MapReduce to support large distributed data sets on clusters of computers. MapReduce has the advantage of processing structured and unstructured data. A paper published by Google engineers, “MapReduce: Simplified Data Processing on Large Clusters,” clearly describes how MapReduce works. As a result of this paper, many open source implementations of MapReduce emerged between 2004 to the present. There are open source software instances that leverage MapReduce, such as Hadoop, an infrastructure that helps the construction of reliable, scalable, distributed systems.
So how does Hadoop, MapReduce and cloud computing come together to solve the Big Data problem?
Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using MapReduce. Think of MapReduce as the engine that brings speed and agility to the Hadoop platform. With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. Hadoop allows enterprises to easily explore this complex data using custom analyses tailored to their information and questions.
Hadoop runs on a collection of commodity, shared-nothing servers. Hadoop self-restores, meaning you can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. It can deliver data — and run large-scale, high-performance processing jobs — in spite of system changes or failures.
Many tools are built using Hadoop as the foundation, for example: open source support tools such as Thrift and Clojure and dozens of commercial solutions such as Appistry, Cloudera, Goto Metrics, Karmasphere, and Talend. Also the three main database vendors – IBM, Microsoft, and Oracle – all support Hadoop interaction in different ways.
However, the Big Data problem is not just all about size of the data; it is also about performance and how fast can data be processed.
Businesses also have another path to Big Data analytics: the cloud. Cloud services for Big Data are popping up, offering platforms and tools to perform analytics quickly and efficiently.
Take for example cloud computing platforms like Amazon EC2, on which you can rent virtual Linux servers, and then introduce the open source Hadoop, which will be built onto the virtual Linux servers to establish the cloud computing framework.
Amazon EC2 is playing the role as the IaaS and provides users virtualized hosts. IaaS is the leasing of infrastructure as a service with specific quality-of-service constraints that has the ability to execute certain operating systems and software. PaaS or Platform-as-a-Service focuses on the software framework or services, which provide the ability of APIs to “cloud” computing on the infrastructure. Hadoop plays a role as PaaS and is built on the virtualized hosts as the cloud computing platform. However, Hadoop is not restricted to be deployed on VMs hosted by any vendor; you can also deploy it on normal Linux OS on physical machines.
In conclusion, as costs fall and companies think of new ways to correlate and analyze data, Big Data analytics will become more common. Small businesses will especially benefit given their low-cost ability to manage and analyze Big Data. Recall that Google and Facebook were all once small companies that leveraged their data to grow significantly. No wonder many of the foundations of Big Data came from the methods these businesses developed.