Archive

Posts Tagged ‘hadoop’

Solving the pressing need for Linux talent…

September 17, 2013 Leave a comment

The Linux Foundation shared the below infographics recently.  Click on it and you get the associated report. The short message is, if you are an expert in Linux you are in high demand because companies don’t find enough experts due to the Cloud and Big Data boom.

Unless cloning machines are discovered later this year, quickly expanding the number of Linux experts is unlikely to happen. This means total cost of ownership for enterprises is likely to rise. This is ironic since Linux is all about open source and providing some of the most amazing solutions for free.

The obvious alternative is to focus on Microsoft products. They are relatively cheap in total cost of ownership since licenses are “payable” and average Windows skills can be easier found.

However Microsoft is loosing the server war, especially in the web application space. So this is not a winning strategy if you are going to do Cloud.

How to solve the pressing need for Linux talent?

The only possible strategy is to lower the number of experts needed per company. Larger companies always will need some but they should be focused on the “interesting high-value tasks”. This concept of interesting and high-value is key. With the number of cloud servers exploding, we can not expect the number of experts to explode.

Open source products like Puppet and Chef have helped to alleviate the pain for the more “skilled” companies. One DevOp was able to manage more than ten times as many machines as before. Unfortunately these server provisioning tools are not for the faint of heart. They  require experts that know both administration and coding.

It is time for the next generation of tools. Ubuntu, the number #1 Cloud operating system, is leading the way with Juju. If Linux wants to continue to be successful then the common problems, the boring problems, the repetitive problems, etc. should be solved. Solved by Linux gurus in such a way that we, the less IT gifted, can get instant solutions for these common problems.

We need a Linux democracy in which the lesser skilled, but unfortunately the majority, can instantly reuse best-in-class blueprint solutions. Juju is a new class of tools that gives you instant solutions. For all those common problems: scaling a web application, monitoring your infrastructure, sharding MongoDB, replicating a database, installing a Hadoop cluster, setting up continuous integration, etc. Juju can offer solutions. The individual software components have been “charmed”. A Charm allows the software to be instantly deployed, integrated and scaled. However the real revolution is just starting. Juju will have bundles pretty soon. Technically speaking, a bundle is a collection of pre-configured and integrated Charms. In lays speak, a bundle is an instant solution for a common problem. You instantly deploy a bundle [one command or drag-and-drop] and you get a blue-print solution. Since Juju is open source, the community can create as many instant solutions as there are common problems.

So if you want to scale your IT solutions without stretching neither your budget or cloning your employees and without the lock-in of any proprietary and expensive commercial software, then you should try Juju today. Play with the GUI or install Juju today.

A Big Data-Base that is fast but inaccurate: BlinkDB

April 6, 2013 2 comments

The idea might sound strange at first. Why would you want a database that delivers inaccurate data? However BlinkDB trades accuracy for speed. When you query data you can specify when you want the answer, e.g. within 2 seconds, or how accurate you want the answer to be, e.g. 1% error with 95% confidence.

So if you have very large amounts of data (10-100s of Tera Bytes or even Peta Bytes) and you want quick good enough answers then BlinkDB is for you. An early adopter is Facebook. Would you rather have Justin Bieber‘s followers count exactly right in minutes or 99% right as long as your page loads almost instantly? So if you need fast reasonably accurate answers over slow correct answers, BlinkDB is worth checking out.

What can you use BlinkDB for?

  • The obvious use case would be real-time reporting? If you need to take decisions in the blink of an eye, e.g. day traders, and 5-10% error is acceptable, e.g. what is the average change of all commodity prices in the last 2 seconds.
  • Real-time bookings or price comparison in which users want to know the best possible offer but accept some small error margin, e.g. mobile bar-code scanners that deliver product price comparisons in 1 second instead of 10 will dominate the App Store.
  • Any visitor, friends, tweets, total search results, etc. counter on a large website in the world.
  • Any Power Law or Long Tail data in which there are some extremely popular cases, e.g. Justin Bieber followers, or a very large set of infrequent cases, e.g. the number of blogs that have under 1000 visitors per month.
  • Machine Learning solutions and recommendation engines that are using Collaborative Filtering and other types of algorithms that compare an item or user with large groups of other items and users.
  • and many other use cases…

The Big Data Revolution is likely to hit Gartner’s “trough of disillusionment” in 2013.

October 26, 2012 Leave a comment

Big Data is a hype right now. Everything that comes close to Hadoop or NOSQL turns into gold! Unfortunately we are getting close to Gartner’s “Peak of Inflated Expectations”. Hadoop does an excellent job at storing many tera bytes of data and doing relatively complex Map-Reduce operations. Unfortunately this is just the tip of the Big Data requirements iceberg. Doing intelligent Big Data analytics requires more than counting who visited a web site. Map Reduce is able to do complex machine learning but it is not really made for it. The Mahout project has to jump through too many hoops to convert matrix-based analytics algorithms into Map-Reduce enabled versions. Map-Reduce just is not an easy way of doing matrix-based operations. Unfortunately most machine learning algorithms rely on matrices. Also real-time and batch often go together in real live.  You need to pre-calculate recommendations or train a neural network but you do want recommendations, predictions and classifications to be in real-time. Unfortunately Hadoop is only good at one of the two.

So when the majority of investors and business analysts realize that Hadoop has limitations, what will happen?

Answer: Nothing unexpected. Hadoop will continue to be used for what it is best. A new hype will arrive as soon as somebody solves the real-time distributed analytics problem…

Scaling Machine Learning

October 17, 2012 1 comment

There is currently still a vacuum for easy & scalable solutions in the machine learning space.

At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.

If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.

What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.

Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.

Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.

If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?

Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.

If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…

Mesos: Your next highly distributed Cloud architecture framework

August 21, 2012 2 comments

I initially complaint about the complexity of installing Mesos when I was playing around with Spark and Shark. However
when I saw the Twitter Mesos and Framework presentation, I understood why Mesos can be disruptive to how you architect applications in a highly distributed manner typical for Cloud Computing.

You can see the presentation here.

The key is that Twitter combined Mesos with Zookeeper, Linux Control Groups and Google’s Protocol Buffers as well as Spark, Storm and Hadoop. This provides them with a way to easily program services that can be scaled to hundreds of mesos nodes, automatically upgraded and restarted in case of failure. Also resource usage can be controlled via the control groups. Zookeeper manages the configuration. Protocol buffers assure efficient communication between nodes. Services can use Spark and Storm for real-time operations and Hadoop for batch. Developers do not have to worry about scaling the services, deploying them to different nodes, etc. This is handled by the Twitter Framework and Mesos master.

There is only one thing to add: “TWITTER PLEASE OPEN SOURCE YOUR TWITTER FRAMEWORK” or in Twitter language: “#mesos please #opensource #twitterfw now @telruptive “…

Hadoop for Real-Time: Spark, Shark, Spark Streaming, Bagel, etc. will be 2012′s new buzzwords

August 15, 2012 5 comments

The website defines Spark as a MapReduce-like cluster computing framework designed to support low-latency iterative jobs. However it would be easier to say that Spark is Hadoop for real-time.

Spark allows you to run MapReduce jobs together with your data on distributed machines. Unlike Hadoop Spark can distributed your data in slices and store it in memory hence your processing and data are co-located in memory. This gives an enormous performance boost. Spark is more than MapReduce however. It offers a new distributed framework on which different distributed computing paradigms can be modelled. Examples are: Hadoop’s Hive => Shark (40x faster than Hive), Google’s Pregel / Apache’s Giraph => Bagel, etc. An upcoming Spark Streaming is supposed to bring real-time streaming to the framework.

The excellent part

Spark is written in Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning.

The points for improvement

In order to use Spark, you need to install Mesos. Mesos is a framework for distributed computing that was also developed by Berkeley. So in a sense they are eating their own dog food. Unfortunately Mesos is not written in scala so installing Spark becomes a mix of make’s, ant’s, .sh, XML, properties, .conf, etc. It would not be bad if Mesos would have consistent documentation but due to incubation into Apache the installation process is currently undergoing changes and is not straightforward.

Spark allows to connect to Hadoop, Hbase, etc. However running Hadoop on top of Mesos is “experimental” to say the least. The integration with Hadoop should be lighter. At the end only access to HDFS, SequenceFiles, etc. is required. This should not mean that a complete Hadoop should be installed and Spark should be recompiled for each specific Hadoop version.

If Spark wants to become as successful as Hadoop, then they should learn from Hadoop’s mistakes. Complex installation is a big problem because Spark needs to be installed on many machines. The Spark team should take a look at Ruby’s Rubygems, Node.js’s npm, etc. and make the installation simple, ideally via Scala’s package manager, although it is less popular.

If possible the team should drop Mesos as a prerequisite and make it optional. One of Spark’s competitors is Storm & Trident, you can install a Storm cluster in minutes and have a one click command to run Storm on an EC2 cluster.

It would be nice if there would be an integration SDK that allows extensions to be plugged-in. Integrations with Cassandra, Redis, Memcache, etc. could be developed by others. Also looking at a distribution in which Cassandra’s Brisk is used to mimic Hive and HDFS (a.k.a. CassandraFS) and have it all pre-bundled with Shark, could be an option. Spark’s in-memory execution and read speed, combined with Cassandra’s write speed, should make for a pretty quick and scalable solution. Ideally without the need to fight with namenodes, datanodes, jobtrackers, etc. and other Hadoop hard-to-configure inventions…

The conclusion is that distributed computing and programming is already hard enough by itself. Programmers should be focusing on their algorithms and not need a professional admin to get them started.

All-in-all Spark, Shark, Streaming Spark, Bagel, etc. have a lot of potential, it is just a little bit rough around the edges…

Update: I am reviewing my opinion about Mesos. See the Mesos post.

Trident Storm, Real-Time Analytics for Big Data

August 13, 2012 4 comments

In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.

One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.

The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.

In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
topology.newStaticState(getUrlToTweetersState());
TridentState tweetersToFollowers =
topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.shuffle()
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.parallelismHint(200)
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.groupBy(new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.parallelismHint(20)
.aggregate(new Count(), new Fields("reach"));

The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.

The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.

Big Data Apps and Big Data PaaS

March 21, 2012 5 comments

Enterprises no longer have a lack of data. Data can be obtained from everywhere. The hard part is to convert data into valuable information that can trigger positive actions. The problem is that you need currently four experts to get this process up and running:

1) Data ETL expert – is able to extract, transform and load data into a central system.

2) Data Mining expert – is able to suggest great statistical algorithms and able to interpret the results.

3) Big Data programmer – is an expert in Hadoop, Map-Reduce, Pig,  Hive, HBase, etc.

4) A business expert – that is able to guide all the experts into extracting the right information and taking the right actions based on the results.

A Big Data PaaS should focus on making sure that the first three are needed as little as possible. Ideally they are not needed at all.

How could a business expert be enabled in Big Data?

The answer is Big Data Apps and Big Data PaaS. What if a Big Data PaaS is available, ideally open source as well as hosted, that comes with a community marketplace for Big Data ETL connectors and Big Data Apps? You would have Big Data ETL connectors to all major databases, Excel, Access, Web server logs, Twitter, Facebook, Linkedin, etc. For a fee different data sources could be accessed in order to enhance the quality of data. Companies should be able to easily buy access to data of others on a Pay-as-you-use basis.

The next steps are Big Data Apps. Business experts often have very simple questions: “Which age group is buying my product?”, “Which products are also bought by my customers?”, etc. Small re-useable Big Data Apps could be built by experts and reused by business experts.

A Big Data App example

A medium sized company is selling household appliances. This company has a database with all the customers. Another database with all the product sales. What if a Big Data App could find which products tend to be sold together and if there are any specific customer features (age, gender, customer since, hobbies, income, number of children, etc.) and other features (e.g. time of the year) that are significant? Customer data in the company’s database could be enhanced with publicly available information (from Facebook, Twitter, Linkedin, etc.). Perhaps the Big Data App could find out that parents (number of children >0), whose children like football (Facebook), are 90% more likely to buy waffle makers, pancake makers, oil fryers, etc. three times a year. Local football clubs might organize events three times a year to gain extra funding. Sponsorship, direct mailing, special offers, etc. could all help to attract more parents, of football-loving-kids, to the shop.

The Big Data Apps would focus on solving a specific problem each: “Finding products that are sold together”, “Clustering customers based on social aspects”, etc. As long as a simple wizard can guide a non-technical expert in selecting the right data sources and understanding the results, it could be packaged up as a Big Data App. A marketplace could exist for the best Big Data Apps. External Big Data PaaS platforms could also allow data from different enterprises to be brought together and generate extra revenue as long as individual persons can not be identified.

Open Source Big Data Reporting & ETL show promises

March 16, 2012 1 comment

With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.

Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.

Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.

Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…

NextGen Hadoop, beyond MapReduce

Hadoop has run into architectural limitations and the community has started working on the Next Generation Hadoop [NGN Hadoop]. NGN Hadoop has some new management features of which multi-tenant application management is the major one. However the key change is that MapReduce no longer is entangled inside the rest of Hadoop. This will allow Hadoop to be used for MPI, Machine Learning, Master-Worker, Iterative Processing, Graph Processing, etc. New tools to better manage Hadoop are also being incubated, e.g. Ambari and HCatalog.

Why is this important for telecom?
Having one platform that allows massive data storage, peta-byte data analytics, complex parallel computations, large-scale machine learning, big data map reduce processing, etc. all in one multi-tenant set-up means that telecom operators could see massive reductions in their architecture costs together with faster go-to-market, better data intelligence, etc.

Telecom applications, that are redesigned around this new paradigm, can all use one shared back-office architecture. Having data centralized into one large Hadoop cluster instead of tens or hundreds of application-specific databases, will enable unseen data analytics possibilities and bring much-needed efficiencies.

Is this shared-architecture paradigm new? Not at all. Google has been using it since 2004 at least when they published Map Reduce and BigTable.

What is needed is that several large operators define this approach as their standard architecture hence telecom solution providers will start incorporating it into their solutions. Commercial support can be easily acquired from companies like Hortonworks, Cloudera, etc.

Having one shared data architecture and multi-tenant application virtualization in the form of a Telco PaaS would allow third-parties to launch new services quickly and cheaply, think days in stead of years…

Follow

Get every new post delivered to your Inbox.

Join 299 other followers

%d bloggers like this: