Scaling Machine Learning
There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…