Archive

Posts Tagged ‘zookeeper’

Jubatus – distributed scalable online machine learning framework

December 16, 2012 Leave a comment

Finally a solution for real-time distributed machine learning: Jubatus. Jubatus differs from Mahout and other distributed machine learning solutions that its focus is real-time instead of batch. Algorithms are for online classification, regression, recommendation, graph operation (queries, centrality, shortest path), etc. Zookeeper is used to keep the distributed Jubaclassifiers synchronized. Multiple clients connect to the Juakeeper (based on Zookeeper). Jubatus has a plugin framework to convert unstructured data on the fly into feature vectors. Performance seems to be linear for 16 nodes. Jubatus is another solution that Big Data Architects should evaluate…

Mesos: Your next highly distributed Cloud architecture framework

August 21, 2012 2 comments

I initially complaint about the complexity of installing Mesos when I was playing around with Spark and Shark. However
when I saw the Twitter Mesos and Framework presentation, I understood why Mesos can be disruptive to how you architect applications in a highly distributed manner typical for Cloud Computing.

You can see the presentation here.

The key is that Twitter combined Mesos with Zookeeper, Linux Control Groups and Google’s Protocol Buffers as well as Spark, Storm and Hadoop. This provides them with a way to easily program services that can be scaled to hundreds of mesos nodes, automatically upgraded and restarted in case of failure. Also resource usage can be controlled via the control groups. Zookeeper manages the configuration. Protocol buffers assure efficient communication between nodes. Services can use Spark and Storm for real-time operations and Hadoop for batch. Developers do not have to worry about scaling the services, deploying them to different nodes, etc. This is handled by the Twitter Framework and Mesos master.

There is only one thing to add: “TWITTER PLEASE OPEN SOURCE YOUR TWITTER FRAMEWORK” or in Twitter language: “#mesos please #opensource #twitterfw now @telruptive “…

Open Source Solution Index from the Big Dotcoms

January 26, 2012 Leave a comment

The big names in dotcom world are busy open sourcing some of their secret sause. It is very important to become familiar with these often strangely named projects because they are responsible for several competitive advantages. Since the list is growing please suggest new solutions in the comments section so they can be added.

Google

Facebook

  • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store
  • Hive a data warehouse infrastructure that provides data summarization and ad hoc querying.
  • FlashCache is a general purpose writeback block cache for Linux. It was developed as a loadable Linux kernel module, using the Device Mapper and sits below the filesystem.
  • HipHop for PHP transforms PHP source code into highly optimized C++. HipHop offers large performance gains and was developed over the past two years.
  • Open Compute Project an open hardware project aims to accelerate data center and server innovation while increasing computing efficiency through collaboration on relevant best practices and technical specifications.
  • Scribe is a scalable service for aggregating log data streamed in real time from a large number of servers.
  • Thrift provides a framework for scalable cross-language services development in C++, Java, Python, PHP, and Ruby.
  • Tornado is a relatively simple, non-blocking web server framework written in Python. It is designed to handle thousands of simultaneous connections, making it ideal for real-time Web services.
  • codemod assists with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention.
  • Facebook Animation is a JavaScript library for creating customizable animations using DOM and CSS manipulation.
  • Online Schema Change for MySQL lets you alter large database tables without taking your cluster offline.
  • Phabricator is a collection of web applications which make it easier to write, review, and share source code. It is currently available as an early release and is used by hundreds of Facebook engineers every day.
  • PHPEmbed makes embedding PHP truly simple for all of our developers (and indeed the world) we developed this PHPEmbed library which is just a more accessible and simplified API built on top of the PHP SAPI.
  • phpsh provides an interactive shell for PHP that features readline history, tab completion, and quick access to documentation. It is ironically written mostly in Python.
  • Three20 is an Objective-C library for iPhone developers which provides many UI elements and data helpers behind our iPhone application.
  • XHP is a PHP extension which augments the syntax of the language such that XML document fragments become valid expressions.
  • XHProf is a function-level hierarchical profiler for PHP with a simple HTML-based navigational interface.

Twitter

Twitter open sourced some complete projects (e.g. FlockDB) but especially adds extensions to existing projects. For a full list see here.

Yahoo

  • Apache Traffic Server is fast, scalable and extensible HTTP/1.1 compliant caching proxy server.
  • Hadoop THE nosql solution at the moment was started by Yahoo. Yahoo actively contributes also to several extensions like Avro and Pig.
  • YUI is a free, open source JavaScript and CSS framework for building richly interactive web applications.

LinkedIn

  • Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes
  • Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene
  • Cleo is a flexible, partial, out-of-order and real-time typeahead search.
  • Datafu is Hadoop library for large-scale data processing.
  • Decomposer is for massive matrix decompositions
  • Glu is a deployment automation platform
  • A set of useful gradle plugins
  • Indexing engine for IndexTank and API, BackOffice, Storefront, and Nebulizer for IndexTank  
  • Kafka is a distributed publish/subscribe messaging system
  • Kamikaze is a utility package for performing operations on compressed arrays of sorted integers
  • Krati is a simple persistent data store with very low latency and high throughput
  • Base utilities shared by all linkedin open source projects
  • A set of utility classes and wrappers around ZooKeeper
  • Norbert is a library that provides easy cluster management and workload distribution
  • Sensei is a distributed, elastic, realtime, searchable database
  • Voldemort is a distributed key-value storage system
  • Zoie is a real-time search and indexing system built on Apache Lucene

The power of S4, Yahoo’s distributed stream computing platform, in telco?

November 14, 2010 Leave a comment

In October 2010 Yahoo made another internal system open source: S4. S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

After looking at the current code and documentation it is clear that this is an alpha project. Yahoo seemed to have stripped everything useful and is rebuilding the product almost from scratch. However this does not take away that S4 can in the near future become as important as Hadoop.

Hadoop is becoming the default standard for extremely large batch data processing. The word batch is important because low latency systems are not getting a lot of benefits from the Hadoop framework. In the telecom domain low latency is exactly the type of processing that is key. You don´t want to have voice or video to arrive late or unsynchronized.

S4 promises to focus on real-time high-volume data streams. It is unfortunate that the current code is not better documented and that Yahoo decided not to open source some examples around computer learning, etc.

The S4 framework should excel at taking rapid computational decisions for event-driven systems. This makes it a possible candidate for a long list of telecom domains: everywhere from network routing decisions, real-time billing, policy control, voice recognition, natural language processing, advertisement, etc.

Of course the S4 design is not new in the industry. Erlang and Scala have an Actor framework that can be seen as a more basic version of S4. Even some java implementations exist.

The power of mixing in Zookeeper and a pluggeable architecture can set S4 appart from previous frameworks. However more developers will be needed, more documentation but more important a re-usable library of processing elements. Having such a re-usable library would allow new applications to be built via configuration of processing elements instead of writing code.

Although S4 is still in an infant state, the potential to be a core compontent in a future telco 2.0 architecture is there…

 

Follow

Get every new post delivered to your Inbox.

Join 299 other followers

%d bloggers like this: