IBM is providing substantial resources to the Apache Software Foundation’s Spark project to prepare the platform for machine learning tasks, like pattern recognition and classification of objects. The company plans to offer Bluemix Spark as a service and has dedicated 3,500 researchers and developers to assist in its preservation and further development.
In 2009, AMPLab of the University of Berkeley developed the Spark framework that went open source a year later as an Apache project. This framework, which runs on a server cluster, can process data up to 100 times faster than Hadoop MapReduce. Given that the data and analyzes are embedded in the corporate structure and society – from applications to the Internet of Things (IoT) – Spark provides essential advancements in large-scale data processing.
First, it significantly improves the performance of applications dependent data. Then it radically simplifies the development process of intelligence, which are supplied by the data. Specifically, in its effort to accelerate innovation on Spark ecosystem, IBM decided to include Spark in its own platforms of predictive analysis and machine learning.
IBM Watson Health Cloud will use Spark to healthcare providers and researchers as they have access to new health data of the population. At the same time, IBM will make available its SystemML machine learning technology open source. IBM is also collaborating with Databricks in changing Spark capabilities.
IBM will hire more than 3,500 researchers and developers to work on Spark-related projects in more than a dozen laboratories worldwide. The big blue company plans to open a Spark Technology Center in San Francisco for the Data Science and the developer community. IBM will also train Spark to more than one million data scientists and data engineers through partnerships with DataCamp, AMPLab, Galvanize, MetiStream, and Big Data University.
A typical large corporation will have hundreds or thousands of data sets that reside in different databases through their computer system. A data scientist can design an algorithm using to plumb the depths of any database. But is needs 90 working days of scientific data to develop the algorithm. Today, if you want to implement another system, it is a quarter of work to adjust the algorithm so that it works. Spark eliminates that time in half. The spark-based system can access and analyze any database, without development and no additional delay.
Spark has another virtue of ease of use where developers can concentrate on the design of the solution, rather than building an engine from scratch. Spark brings advances in data processing technology on a large scale because it improves the performance of data-dependent applications, radically simplifies the process of developing intelligent solutions and enables a platform capable of unifying all kinds of information on real work schemes.
Many experts consider Spark as the successor to Hadoop, but its adoption remains slow. Spark works very well for machine learning tasks that normally require running large clusters of computers. The latest version of the platform, which recently came out, extends to the machine learning algorithms to run.