Archive

Posts Tagged ‘hbase’

Real-Time Hadoop queries will be a reality in 2013

December 5, 2012 7 comments

Real-Time Hadoop queries will be a reality in 2013 thanks to two new projects from Cloudera: Impala and Trevni.

Impala is the open source version of Dremel, Google’s proprietary big data query solution. A first beta is available and the production version is foreseen for Q1 2013.

Impala allows you to run real-time queries on top of Hadoop’s HDFS, Hbase and Hive. No migrations necessary.

However the real revolution will only get better when Doug Cutting [the creator of Lucene, Hadoop, etc.]‘s Trevni is integrated into Impala. Trevni is a new columnar data storage format that promises superior performance for reading large columnar stored data sets.

Impala+Trevni is promising real-time big data queries with multiple joins that are on par in performance but have more functionality than Google’s Dremel…

Hadoop for Real-Time: Spark, Shark, Spark Streaming, Bagel, etc. will be 2012’s new buzzwords

August 15, 2012 5 comments

The website defines Spark as a MapReduce-like cluster computing framework designed to support low-latency iterative jobs. However it would be easier to say that Spark is Hadoop for real-time.

Spark allows you to run MapReduce jobs together with your data on distributed machines. Unlike Hadoop Spark can distributed your data in slices and store it in memory hence your processing and data are co-located in memory. This gives an enormous performance boost. Spark is more than MapReduce however. It offers a new distributed framework on which different distributed computing paradigms can be modelled. Examples are: Hadoop’s Hive => Shark (40x faster than Hive), Google’s Pregel / Apache’s Giraph => Bagel, etc. An upcoming Spark Streaming is supposed to bring real-time streaming to the framework.

The excellent part

Spark is written in Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning.

The points for improvement

In order to use Spark, you need to install Mesos. Mesos is a framework for distributed computing that was also developed by Berkeley. So in a sense they are eating their own dog food. Unfortunately Mesos is not written in scala so installing Spark becomes a mix of make’s, ant’s, .sh, XML, properties, .conf, etc. It would not be bad if Mesos would have consistent documentation but due to incubation into Apache the installation process is currently undergoing changes and is not straightforward.

Spark allows to connect to Hadoop, Hbase, etc. However running Hadoop on top of Mesos is “experimental” to say the least. The integration with Hadoop should be lighter. At the end only access to HDFS, SequenceFiles, etc. is required. This should not mean that a complete Hadoop should be installed and Spark should be recompiled for each specific Hadoop version.

If Spark wants to become as successful as Hadoop, then they should learn from Hadoop’s mistakes. Complex installation is a big problem because Spark needs to be installed on many machines. The Spark team should take a look at Ruby’s Rubygems, Node.js’s npm, etc. and make the installation simple, ideally via Scala’s package manager, although it is less popular.

If possible the team should drop Mesos as a prerequisite and make it optional. One of Spark’s competitors is Storm & Trident, you can install a Storm cluster in minutes and have a one click command to run Storm on an EC2 cluster.

It would be nice if there would be an integration SDK that allows extensions to be plugged-in. Integrations with Cassandra, Redis, Memcache, etc. could be developed by others. Also looking at a distribution in which Cassandra’s Brisk is used to mimic Hive and HDFS (a.k.a. CassandraFS) and have it all pre-bundled with Shark, could be an option. Spark’s in-memory execution and read speed, combined with Cassandra’s write speed, should make for a pretty quick and scalable solution. Ideally without the need to fight with namenodes, datanodes, jobtrackers, etc. and other Hadoop hard-to-configure inventions…

The conclusion is that distributed computing and programming is already hard enough by itself. Programmers should be focusing on their algorithms and not need a professional admin to get them started.

All-in-all Spark, Shark, Streaming Spark, Bagel, etc. have a lot of potential, it is just a little bit rough around the edges…

Update: I am reviewing my opinion about Mesos. See the Mesos post.

Open Source Big Data Reporting & ETL show promises

March 16, 2012 1 comment

With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.

Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.

Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.

Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…

Open Source Solution Index from the Big Dotcoms

January 26, 2012 Leave a comment

The big names in dotcom world are busy open sourcing some of their secret sause. It is very important to become familiar with these often strangely named projects because they are responsible for several competitive advantages. Since the list is growing please suggest new solutions in the comments section so they can be added.

Google

Facebook

  • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store
  • Hive a data warehouse infrastructure that provides data summarization and ad hoc querying.
  • FlashCache is a general purpose writeback block cache for Linux. It was developed as a loadable Linux kernel module, using the Device Mapper and sits below the filesystem.
  • HipHop for PHP transforms PHP source code into highly optimized C++. HipHop offers large performance gains and was developed over the past two years.
  • Open Compute Project an open hardware project aims to accelerate data center and server innovation while increasing computing efficiency through collaboration on relevant best practices and technical specifications.
  • Scribe is a scalable service for aggregating log data streamed in real time from a large number of servers.
  • Thrift provides a framework for scalable cross-language services development in C++, Java, Python, PHP, and Ruby.
  • Tornado is a relatively simple, non-blocking web server framework written in Python. It is designed to handle thousands of simultaneous connections, making it ideal for real-time Web services.
  • codemod assists with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention.
  • Facebook Animation is a JavaScript library for creating customizable animations using DOM and CSS manipulation.
  • Online Schema Change for MySQL lets you alter large database tables without taking your cluster offline.
  • Phabricator is a collection of web applications which make it easier to write, review, and share source code. It is currently available as an early release and is used by hundreds of Facebook engineers every day.
  • PHPEmbed makes embedding PHP truly simple for all of our developers (and indeed the world) we developed this PHPEmbed library which is just a more accessible and simplified API built on top of the PHP SAPI.
  • phpsh provides an interactive shell for PHP that features readline history, tab completion, and quick access to documentation. It is ironically written mostly in Python.
  • Three20 is an Objective-C library for iPhone developers which provides many UI elements and data helpers behind our iPhone application.
  • XHP is a PHP extension which augments the syntax of the language such that XML document fragments become valid expressions.
  • XHProf is a function-level hierarchical profiler for PHP with a simple HTML-based navigational interface.

Twitter

Twitter open sourced some complete projects (e.g. FlockDB) but especially adds extensions to existing projects. For a full list see here.

Yahoo

  • Apache Traffic Server is fast, scalable and extensible HTTP/1.1 compliant caching proxy server.
  • Hadoop THE nosql solution at the moment was started by Yahoo. Yahoo actively contributes also to several extensions like Avro and Pig.
  • YUI is a free, open source JavaScript and CSS framework for building richly interactive web applications.

LinkedIn

  • Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes
  • Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene
  • Cleo is a flexible, partial, out-of-order and real-time typeahead search.
  • Datafu is Hadoop library for large-scale data processing.
  • Decomposer is for massive matrix decompositions
  • Glu is a deployment automation platform
  • A set of useful gradle plugins
  • Indexing engine for IndexTank and API, BackOffice, Storefront, and Nebulizer for IndexTank  
  • Kafka is a distributed publish/subscribe messaging system
  • Kamikaze is a utility package for performing operations on compressed arrays of sorted integers
  • Krati is a simple persistent data store with very low latency and high throughput
  • Base utilities shared by all linkedin open source projects
  • A set of utility classes and wrappers around ZooKeeper
  • Norbert is a library that provides easy cluster management and workload distribution
  • Sensei is a distributed, elastic, realtime, searchable database
  • Voldemort is a distributed key-value storage system
  • Zoie is a real-time search and indexing system built on Apache Lucene
Follow

Get every new post delivered to your Inbox.

Join 314 other followers

%d bloggers like this: