Impala is the open source version of Dremel, Google’s proprietary big data query solution. A first beta is available and the production version is foreseen for Q1 2013.
However the real revolution will only get better when Doug Cutting [the creator of Lucene, Hadoop, etc.]‘s Trevni is integrated into Impala. Trevni is a new columnar data storage format that promises superior performance for reading large columnar stored data sets.
Impala+Trevni is promising real-time big data queries with multiple joins that are on par in performance but have more functionality than Google’s Dremel…
I initially complaint about the complexity of installing Mesos when I was playing around with Spark and Shark. However
when I saw the Twitter Mesos and Framework presentation, I understood why Mesos can be disruptive to how you architect applications in a highly distributed manner typical for Cloud Computing.
You can see the presentation here.
The key is that Twitter combined Mesos with Zookeeper, Linux Control Groups and Google’s Protocol Buffers as well as Spark, Storm and Hadoop. This provides them with a way to easily program services that can be scaled to hundreds of mesos nodes, automatically upgraded and restarted in case of failure. Also resource usage can be controlled via the control groups. Zookeeper manages the configuration. Protocol buffers assure efficient communication between nodes. Services can use Spark and Storm for real-time operations and Hadoop for batch. Developers do not have to worry about scaling the services, deploying them to different nodes, etc. This is handled by the Twitter Framework and Mesos master.
There is only one thing to add: “TWITTER PLEASE OPEN SOURCE YOUR TWITTER FRAMEWORK” or in Twitter language: “#mesos please #opensource #twitterfw now @telruptive “…
The website defines Spark as a MapReduce-like cluster computing framework designed to support low-latency iterative jobs. However it would be easier to say that Spark is Hadoop for real-time.
Spark allows you to run MapReduce jobs together with your data on distributed machines. Unlike Hadoop Spark can distributed your data in slices and store it in memory hence your processing and data are co-located in memory. This gives an enormous performance boost. Spark is more than MapReduce however. It offers a new distributed framework on which different distributed computing paradigms can be modelled. Examples are: Hadoop’s Hive => Shark (40x faster than Hive), Google’s Pregel / Apache’s Giraph => Bagel, etc. An upcoming Spark Streaming is supposed to bring real-time streaming to the framework.
The excellent part
Spark is written in Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning.
The points for improvement
In order to use Spark, you need to install Mesos. Mesos is a framework for distributed computing that was also developed by Berkeley. So in a sense they are eating their own dog food. Unfortunately Mesos is not written in scala so installing Spark becomes a mix of make’s, ant’s, .sh, XML, properties, .conf, etc. It would not be bad if Mesos would have consistent documentation but due to incubation into Apache the installation process is currently undergoing changes and is not straightforward.
Spark allows to connect to Hadoop, Hbase, etc. However running Hadoop on top of Mesos is “experimental” to say the least. The integration with Hadoop should be lighter. At the end only access to HDFS, SequenceFiles, etc. is required. This should not mean that a complete Hadoop should be installed and Spark should be recompiled for each specific Hadoop version.
If Spark wants to become as successful as Hadoop, then they should learn from Hadoop’s mistakes. Complex installation is a big problem because Spark needs to be installed on many machines. The Spark team should take a look at Ruby’s Rubygems, Node.js’s npm, etc. and make the installation simple, ideally via Scala’s package manager, although it is less popular.
If possible the team should drop Mesos as a prerequisite and make it optional. One of Spark’s competitors is Storm & Trident, you can install a Storm cluster in minutes and have a one click command to run Storm on an EC2 cluster.
It would be nice if there would be an integration SDK that allows extensions to be plugged-in. Integrations with Cassandra, Redis, Memcache, etc. could be developed by others. Also looking at a distribution in which Cassandra’s Brisk is used to mimic Hive and HDFS (a.k.a. CassandraFS) and have it all pre-bundled with Shark, could be an option. Spark’s in-memory execution and read speed, combined with Cassandra’s write speed, should make for a pretty quick and scalable solution. Ideally without the need to fight with namenodes, datanodes, jobtrackers, etc. and other Hadoop hard-to-configure inventions…
The conclusion is that distributed computing and programming is already hard enough by itself. Programmers should be focusing on their algorithms and not need a professional admin to get them started.
All-in-all Spark, Shark, Streaming Spark, Bagel, etc. have a lot of potential, it is just a little bit rough around the edges…
Update: I am reviewing my opinion about Mesos. See the Mesos post.
In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.
One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.
The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.
In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
TridentState tweetersToFollowers =
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.aggregate(new Count(), new Fields("reach"));
The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.
The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.
Cloudify, from the scalability experts GigaSpaces, is still its early stages. Unlike Google App Engine, Azure, Heroku, etc. this PaaS is more focused on the application life cycle and not on being a “transparent” application server and database. The main focus is automating application and services deployment, monitoring, autoscaling, etc. The closest competitor would be Scalr.
Unlike Scalr, Cloudify’s focus is on Cloud-neutrality. Cloudify is not focusing on using specific Amazon services for scalability but instead to make a neutral Cloud platform. The advantage is that every possible Cloud being it private or public can be used and scenarios like hybrid clouds with Cloud bursting from private to public cloud are possible. The deep understanding of large-scale architectures in a company like GigaSpaces is a guarantee that Cloudify will scale in the future.
Cloudify is still missing some important functionality like security, multi-tenancy, integrations with lower-level automation frameworks (e.g. Chef and Puppet), complex upgrade management [e.g. rolling upgrades, MySQL schema upgrades, A/B testing of new features, etc.], etc. However the roadmap is pointing towards most of these items.
Software architects should understand the possibilities Cloudify, Scalr, etc. bring. By having a reusable automation framework companies are able to spend more development and operations time on bringing new business features and less on reinventing the wheel.
Twitter is having a Real-Time Analytics solution that could easily become as important as Hadoop. They talked about open sourcing it but so far have not done so.
This post is an open invitation to Twitter open source Rainbird and accelerate Real-Time Analytics adoption in the world. Hadoop has changed thousands if not millions of companies. Rainbird could do a similar thing.
In order to gather people around this subject, I am proposing that you include #TWOSRB in your tweets. #TWOSRB stands for Twitter please Open Source RainBird:
With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.
Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.
Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.
Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…
Fujitsu just presented SaaSification on Cebit. Existing applications can be easily brought to the Cloud and sold via App Stores and SaaS marketplaces. IBM is also working on SaaSification and even adds multi-tenancy.
What is next?
Everybody wants to have a full App Store or SaaS Marketplace, so SaaSification is the next step after launching your store. However converting a client/server application to the Cloud is only step 1. Step 2 is creating new services that are specifically built for the Cloud.
What does Built-for-the-Cloud means?
Cloud-Ready applications should also accept the new reality of APIs. Both for exposure as well as consumption. This means that applications need to be redesigned according to application slices.
So if SaaSification wants to be successful then it needs to add quick enablers for multi-tenancy, big data, integration with external APIs as well as API exposure, etc. This integration concept can be called iPaaS or integration platform-as-a-Service. iPaaS should not only focus on exposing or integrating APIs but on providing complex services by integration multiple SaaS solutions together.
Other enablers should be added as well. Basically 80% of a SaaS solution consists out of the same elements or tries to solve the same problems. These could all be provided via a SaaSification PaaS:
- Blog – to describe the newest ideas.
- Forum – for people to get answers from the community.
- IT PaaS – where you run the actual business logic and UI. Data storage is assumed to be provided by the Big Data elements.
- Portal and Mobile Portal – allows to quickly define the “static” content for the web and mobile site.
- Deployment management – ideally continuous deployment or integration tools that allow fast feature by feature deployment.
- A/B testing – allow new features to be deployed to subsets of users and check which version of a feature has the highest impact on the bottom-line. A/B testing was made popular by Amazon.
- Automated testing – lots of testing can be automated but especially end-to-end and performance testing are the harder tests that should be focused on.
- Configuration management – manage the version control of the code.
- Metering and billing – be able to meter the resource usage by users, companies or any other element you want to meter and be able to bill users both for subscriptions as well as for usage, ideally with advanced set-up with overage, etc.
- Marketplace listing and provisioning – automate the listing of products on the marketplace as well as the provisioning of new services.
- Single sign-on & identity management - allow companies to use their own user credentials (e.g. SAML), authorization for third-parties (e.g. oAuth), etc.
- Reporting and data warehousing – this can be part of the big data stack but especially being able to create ad-hoc reports for instance for A/B testing . Of course regular business reporting needs to be included as well.
- ERP – accounting, resource management, etc.
- CRM – sales and lead management
- Operations & Maintenance – automation of back-ups, monitoring both for the performance and fault management but as well business monitoring.
- Support – helpdesk, ticketing system, SLA management, etc.
- Social integration – tools to add social aspects like Facebook apps, Twitter feeds, etc.
The idea is not that a SaaSification PaaS offers all these solutions by custom development. Instead the SaaSification PaaS should allow startups to assemble an ideal architecture by combining different solutions from different providers. For example you would be able to select the support solution you prefer, e.g. desk.com, zendesk.com, etc. and this solution would be completely integrated into the overall stack, e.g. CRM integration with help desk and fault management together with sign sign-on.
SaaSification 2.0 should focus on making sure that 2-5 people can start a new dotcom solution and focus on creating a killer service and not on building up yet another stack of solutions for configuration management, support, billing, etc. If a SaaSification PaaS can shorten the time to launch with months and reduce the needs to operate the solution with several people then startups will see the value. Instead of SaaSification PaaS a good term could be Incubation PaaS, to incubate SaaS solutions. Once the business model and solution is proven, there will be money to move to a custom-build stack but during incubation and crossing-the-chasm enterpreneurs should be able to focus on delivering value to their customers and not on re-inventing the startup wheel.
Hadoop has run into architectural limitations and the community has started working on the Next Generation Hadoop [NGN Hadoop]. NGN Hadoop has some new management features of which multi-tenant application management is the major one. However the key change is that MapReduce no longer is entangled inside the rest of Hadoop. This will allow Hadoop to be used for MPI, Machine Learning, Master-Worker, Iterative Processing, Graph Processing, etc. New tools to better manage Hadoop are also being incubated, e.g. Ambari and HCatalog.
Why is this important for telecom?
Having one platform that allows massive data storage, peta-byte data analytics, complex parallel computations, large-scale machine learning, big data map reduce processing, etc. all in one multi-tenant set-up means that telecom operators could see massive reductions in their architecture costs together with faster go-to-market, better data intelligence, etc.
Telecom applications, that are redesigned around this new paradigm, can all use one shared back-office architecture. Having data centralized into one large Hadoop cluster instead of tens or hundreds of application-specific databases, will enable unseen data analytics possibilities and bring much-needed efficiencies.
What is needed is that several large operators define this approach as their standard architecture hence telecom solution providers will start incorporating it into their solutions. Commercial support can be easily acquired from companies like Hortonworks, Cloudera, etc.
Having one shared data architecture and multi-tenant application virtualization in the form of a Telco PaaS would allow third-parties to launch new services quickly and cheaply, think days in stead of years…