Hadoop Best Practice: Facebook

Hadoop is a free Java framework for distributed applications and intensive data management. It enables applications to work with thousands of nodes and petabytes of data.

With Hadoop, companies discover and put into practice new techniques for analysis and retrieval of data, techniques previously impossible to implement for reasons of performance, cost and technology. As a result, Hadoop is an option that is gaining popularity to treat, store and analyze large volumes of raw data, semi-structured or unstructured data from the most disparate data sources.

More than one billion users of Facebook generated about 2,500 million updates, about 300 million images and links. This makes the social network to process more than 500 terabytes of content daily.

Facebook uses several open source solutions in their data warehouses – Apache Hadoop, Apache Hive, Apache HBase, Apache Thrift or in-house Facebook Scribe tool to manage such vast amount of data.

At Facebook, Hadoop is used in three different ways: as a data warehouse for web analytics, to store distributed database, and finally for backups of these MySQL database servers.

Facebook is probably maintain the largest Hadoop cluster in the world, with over 105 terabytes every 30 minutes including data related to 2.7 billion Likes and 2.5 billion content items shared per day for more than a billion users. In addition, Hadoop is used for thousands of simultaneous requests, mining operations, social analysis, and management of resources.

The main strength Hadoop lies in its scalability. Hadoop allows Facebook to exploit the current hardware. This framework supports the treatment of all types of data – structured, semi-structured or unstructured and its scalability allows Facebook developers to extend through more specialized functionality for a wide range of applications.

Hadoop does not replace existing systems. Instead, Hadoop increases their power by allowing additional processing of large volumes of data so that existing systems can focus on what they do best. Facebook’s Prism project plays an important role for the company to harness the potential of Hadoop in a hybrid environment to take advantage of truly unique benefits of each technology and maximize the performance of the whole environment.

One of the major limitations in Hadoop is to make things work; the servers have to be next to each other. But, Facebook is taking Hadoop to the next level with its Prism project. Prism project is a way to run a Hadoop cluster across multiple data centers around the world. It automatically replicates and moves data to where it is needed on a vast network of computers.

Leave a Reply

Your email address will not be published. Required fields are marked *