In turn, Google decided that MapReduce, the big data analysis tool that the company initiated several years ago, was no longer an appropriate technology for real-time volume data processing. Google representatives at the conference Google I/O introduced the novelty of the cloud Dataflow which has help companies to analyze large volumes of data in real time in the cloud.
Google Cloud Dataflow services cloud data in batch mode or in real time, which allows you to perform complex or large-scale integration of the data streams in real time applications in analysis.If this service complements the analytical tools and data processing of the Google Cloud Platform – which includes BigQuery in particular – this also constitutes the response to an identical service developed by AWS, called Kinesis. Amazon Kinesis aims to break down barriers between complex data processing in real-time volume in the cloud and inject streams in applications.
Cloud Dataflow is a successor to MapReduce, and is based on Google’s internal technologies like Flume and MillWheel. This new project in which Google placed their servers can be considered the natural evolution of MapReduce. Google said in a blog post that Cloud Dataflow makes it easy for you to get actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure. You can use Cloud Dataflow for use cases like ETL, batch data processing and streaming analytics, and it will automatically optimize, deploy and manage the code and resources required.
Dataflow tool aims at facilitating companies demanding job with deploying solutions for large data analysis so that they can focus on their own business. Google wants to continue in this direction in the future and over the next year the company wants to access additional tools and services to simplify the development and monitoring related to these operations.
The first Dataflow Cloud SDK uses Java, but also provides a dashboard that allows you to monitor the pipeline directly from the console of development. At present, it can be used in analysis of statements about any product on social networks in real time; control transaction logs on anomalous activity that could be indicative of security incidents. The service also can be used as an alternative to the local systems of extraction, transformation, and loading (ETL), used to prepare the data for processing BI systems.
Others have built similar tools. Twitter, for example, has created a device called open source Summingbird. But Cloud Dataflow is a bit different, since Google is only offering it as a service in the cloud, something that anyone can access through Internet. Google is sharing its infrastructure in line with the world at large, through its cloud services like Google Compute Engine and Google App Engine that allow companies and independent developers to build and run large software applications are based on the internal infrastructure of Google.