Finally a solution for real-time distributed machine learning: Jubatus. Jubatus differs from Mahout and other distributed machine learning solutions that its focus is real-time instead of batch. Algorithms are for online classification, regression, recommendation, graph operation (queries, centrality, shortest path), etc. Zookeeper is used to keep the distributed Jubaclassifiers synchronized. Multiple clients connect to the Juakeeper (based on Zookeeper). Jubatus has a plugin framework to convert unstructured data on the fly into feature vectors. Performance seems to be linear for 16 nodes. Jubatus is another solution that Big Data Architects should evaluate…
I initially complaint about the complexity of installing Mesos when I was playing around with Spark and Shark. However
when I saw the Twitter Mesos and Framework presentation, I understood why Mesos can be disruptive to how you architect applications in a highly distributed manner typical for Cloud Computing.
You can see the presentation here.
The key is that Twitter combined Mesos with Zookeeper, Linux Control Groups and Google’s Protocol Buffers as well as Spark, Storm and Hadoop. This provides them with a way to easily program services that can be scaled to hundreds of mesos nodes, automatically upgraded and restarted in case of failure. Also resource usage can be controlled via the control groups. Zookeeper manages the configuration. Protocol buffers assure efficient communication between nodes. Services can use Spark and Storm for real-time operations and Hadoop for batch. Developers do not have to worry about scaling the services, deploying them to different nodes, etc. This is handled by the Twitter Framework and Mesos master.
There is only one thing to add: “TWITTER PLEASE OPEN SOURCE YOUR TWITTER FRAMEWORK” or in Twitter language: “#mesos please #opensource #twitterfw now @telruptive “…
In October 2010 Yahoo made another internal system open source: S4. S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
After looking at the current code and documentation it is clear that this is an alpha project. Yahoo seemed to have stripped everything useful and is rebuilding the product almost from scratch. However this does not take away that S4 can in the near future become as important as Hadoop.
Hadoop is becoming the default standard for extremely large batch data processing. The word batch is important because low latency systems are not getting a lot of benefits from the Hadoop framework. In the telecom domain low latency is exactly the type of processing that is key. You don´t want to have voice or video to arrive late or unsynchronized.
S4 promises to focus on real-time high-volume data streams. It is unfortunate that the current code is not better documented and that Yahoo decided not to open source some examples around computer learning, etc.
The S4 framework should excel at taking rapid computational decisions for event-driven systems. This makes it a possible candidate for a long list of telecom domains: everywhere from network routing decisions, real-time billing, policy control, voice recognition, natural language processing, advertisement, etc.
Of course the S4 design is not new in the industry. Erlang and Scala have an Actor framework that can be seen as a more basic version of S4. Even some java implementations exist.
The power of mixing in Zookeeper and a pluggeable architecture can set S4 appart from previous frameworks. However more developers will be needed, more documentation but more important a re-usable library of processing elements. Having such a re-usable library would allow new applications to be built via configuration of processing elements instead of writing code.
Although S4 is still in an infant state, the potential to be a core compontent in a future telco 2.0 architecture is there…