Posts Tagged ‘big data’

Solving the pressing need for Linux talent…

September 17, 2013 Leave a comment

The Linux Foundation shared the below infographics recently.  Click on it and you get the associated report. The short message is, if you are an expert in Linux you are in high demand because companies don’t find enough experts due to the Cloud and Big Data boom.

Unless cloning machines are discovered later this year, quickly expanding the number of Linux experts is unlikely to happen. This means total cost of ownership for enterprises is likely to rise. This is ironic since Linux is all about open source and providing some of the most amazing solutions for free.

The obvious alternative is to focus on Microsoft products. They are relatively cheap in total cost of ownership since licenses are “payable” and average Windows skills can be easier found.

However Microsoft is loosing the server war, especially in the web application space. So this is not a winning strategy if you are going to do Cloud.

How to solve the pressing need for Linux talent?

The only possible strategy is to lower the number of experts needed per company. Larger companies always will need some but they should be focused on the “interesting high-value tasks”. This concept of interesting and high-value is key. With the number of cloud servers exploding, we can not expect the number of experts to explode.

Open source products like Puppet and Chef have helped to alleviate the pain for the more “skilled” companies. One DevOp was able to manage more than ten times as many machines as before. Unfortunately these server provisioning tools are not for the faint of heart. They  require experts that know both administration and coding.

It is time for the next generation of tools. Ubuntu, the number #1 Cloud operating system, is leading the way with Juju. If Linux wants to continue to be successful then the common problems, the boring problems, the repetitive problems, etc. should be solved. Solved by Linux gurus in such a way that we, the less IT gifted, can get instant solutions for these common problems.

We need a Linux democracy in which the lesser skilled, but unfortunately the majority, can instantly reuse best-in-class blueprint solutions. Juju is a new class of tools that gives you instant solutions. For all those common problems: scaling a web application, monitoring your infrastructure, sharding MongoDB, replicating a database, installing a Hadoop cluster, setting up continuous integration, etc. Juju can offer solutions. The individual software components have been “charmed”. A Charm allows the software to be instantly deployed, integrated and scaled. However the real revolution is just starting. Juju will have bundles pretty soon. Technically speaking, a bundle is a collection of pre-configured and integrated Charms. In lays speak, a bundle is an instant solution for a common problem. You instantly deploy a bundle [one command or drag-and-drop] and you get a blue-print solution. Since Juju is open source, the community can create as many instant solutions as there are common problems.

So if you want to scale your IT solutions without stretching neither your budget or cloning your employees and without the lock-in of any proprietary and expensive commercial software, then you should try Juju today. Play with the GUI or install Juju today.

MapD – Massively Parallel GPU-based database

An MIT student recently created a new type of massively distributed database, one that runs on graphical processors instead of CPUs. Mapd, as it has been called, makes use of the immense computational power available in off-the-shelf graphics cards that can be found in any laptop or PC. Mapd is especially suitable for real-time quering, data analysis, machine learning and data visualization. Mapd is probably only one of many databases that will try new hardware configurations  to cater for specific application use cases.

Alternative approaches could focus on large sets of cheap mobile processors, Parallella processors, Raspberry PIs, etc. all stitched together. The idea would be to create massive processing clouds based on cheap specialized hardware that could beat traditional CPU Clouds both in price and performance at least for some specific use cases…

5 Strategies for Making Money with the Cloud

January 22, 2013 1 comment

Everybody is hearing Cloud Computing on the television now. Operators will store your contacts in the Cloud. Hosting companies will host your website in the Cloud. Others will store your photos in the Cloud.

However how do you make money with the Cloud?

The first thing is to forget about infrastructure and virtualization. If you are thinking that in 2013, the world needs more IaaS providers then you haven’t seen what is currently on offer (Amazon, Microsoft, Google, Rackspace, Joyent, Verizon/Terramark, IBM, HP, etc.).

So what are alternative strategies:

1) Rocket Internet SaaS Cloning

Your best hope is SaaS and PaaS. The best markets are non-English speaking markets. We have seen an explosion of SaaS in the USA but most have not made it to the rest of the world yet. Only some bigger SaaS solutions (Webex, GoToMeeting, Office 365, etc.)  and PaaS platforms (Salesforce, Workday, etc.) are available outside of the US and the UK. However most SaaS and PaaS solutions are currently still English-only. So the quickest solution to make some money is to just copy, translate and paste some successful English-only SaaS product. If you do not know how to copy dotcoms, take a look at how the Rocket Internet team is doing it. Of course you should always be open for those annoying problems everybody has that could use a new innovative solution and as such create your own SaaS.

2) SaaSification

During the gold rush, be the restaurant, hotel or tool shop. While everybody is looking for the SaaS gold, offer solutions that will save gold diggers time and money. SaaSification allows others to focus on building their SaaS business, not on reinventing for the millionth time a web page, web store, email server, search, CRM, monthly subscription billing, reporting, BI, etc. Instead of a “Use Shopify to create your online store”, it should be “Use <YOUR PRODUCT> to create a SaaS Business”.

3) Mobile & Cloud

Everybody is having, or at least thinking about buying, a Smartphone. However there are very few really good mobile services that fully exploit the Cloud. Yet I can get a shopping list app but most are just glorified to-do lists. None is recommending me where to go and buy based on current promotions and comparison with other buyers. None is helping me find products inside a large supermarket. None is learning from my shopping habits and suggesting items on the list. None is allowing me to take a number at the seafood queue. These are just examples for one mobile + cloud app. Think about any other field and you are sure to find great ideas.

4) Specialized IaaS

I mentioned it before, IaaS is already overcrowded but there is one exception: specialized IaaS. You can focus on specialized hardware, e.g. virtualized GPU, DSP, mobile ARM processors. On network virtualization like SDN and Openflow. Mobile and tablet virtualization. Embedded device virtualization. Machine Learning IaaS. Car Software virtualization.

5) Disruptive Innovations + Cloud

Selling disruptive innovations and offering them as Cloud services. Examples could be 3D printing services, wireless sensor networks / M2M, Big Data, Wearable Tech, Open Source Hardware, etc. The Cloud will lower your costs and give you a global elastically scalable solution.

Big Data 2013 Predictions

January 1, 2013 5 comments

If you just invested a lot of money in a Big Data solution from any of the traditional BI vendors (Teradata, IBM, Oracle, SAS, EMC, HP, etc.) then you are likely to see a sub-optimal ROI in 2013.

Several innovations will come in 2013 that will change the value of Big Data exponentially. Other technology innovations are just waiting for smart start-ups to put them into good use.

Real-Time Hadoop

The first major innovation will be Google’s Dremel-like solutions coming of age like Impala, Drill, etc. They will allow real-time queries on Big Data and be open source. So you will get a superior offering compared to what is currently available for free.

Cloud-Based Big Data Solutions

The absolute market leader is Amazon with EMR. Elastic Map Reduce is not so much about being able to run a Map Reduce operation in the Cloud but about paying for what you use and not more. The traditional BI vendors are still getting their head around a usage-based licensing for the Cloud. Except a lot of smart startups to come up with really innovative Big Data and Cloud solutions.

Big Data Appliances

You can buy some really expensive Big Data Appliances but also here disruptive players are likely to change the market. GPUs are relatively cheap. Stack them into servers and use something like Virtual OpenCL to make your own GPU virtualization cluster solution. These type of home-made GPU clusters are already being used for security Big Data related work.

Also expect more hardware vendors to pack mobile ARM processors into server boxes. Dell, HP, etc. are already doing it. Imagine the potential for Distributed Map Reduce.

Finally Parallella will put a 16-core supercomputer into everybody’s hands for $99. Their 2013 supercomputer challenge is definitely something to keep your eyes on. Their roadmap talks about 64 and 1000 core versions. If Adapteva can keep their promises and flood the market with Parallella’s then expect Parallella Clusters to be 2013 Big Data Appliance.

Distributed Machine Learning

Mahout is a cool project but Map Reduce might not be the best possible architecture to run iterative distributed backpropagation or any other machine learning algorithms. Jubatus looks promising. Also algorithm innovations like HogWild could really change the dynamics for efficient distributed machine learning. This space is definitely ready for more ground-breaking innovations in 2013.

Easier Big Data Tools

This is still a big white spot in the Open Source field. Having Open Source and easy to use drag-and-drop tools for Big Data Analytics would really excel the adoption. We already have some good commercial examples (Radoop = RapidMiner + Mahout, Tableau, Datameer, etc.) but we are missing good Open Source tools.

I am currently looking for new challenges so if you are active in the Big Data space and are looking for a knowledgable senior executive be sure to contact me at maarten at telruptive dot com.

How can I generate new revenues from my data is the wrong question…

December 27, 2012 1 comment

With Big Data in the news all day, you would think that having a lot of high quality data is a guarantee for new revenues. However asking yourself how to generate new revenues from existing data is the wrong question. It is a sub-optimal question because it is like having a hammer and assuming everything else is a nail.

A better question to ask is:”What data insight problems potential customers have that I could solve?” Read more…

Scaling Machine Learning

October 17, 2012 1 comment

There is currently still a vacuum for easy & scalable solutions in the machine learning space.

At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.

If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.

What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.

Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.

Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.

If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?

Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.

If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…

Trident Storm, Real-Time Analytics for Big Data

August 13, 2012 4 comments

In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.

One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.

The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.

In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
TridentState tweetersToFollowers =

.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.groupBy(new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.aggregate(new Count(), new Fields("reach"));

The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.

The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.

Data Analytics as a Service

April 18, 2012 2 comments

Every company is using Microsoft Office and especially Excel to do some sort of data analytics. However data volumes have grown exponentially and have outgrown Spreadsheets. You need experts in the business domain, in data analytics, in data migration/extraction/transformation/loading, in server management, etc. to get data analytics done on Big Data scale. This makes it expensive and only usable for the happy few.

Why? There must be easier ways to do it.

I think there are. For those unfamiliar with data analytics but eager to learn, you should take a look at a product called RapidMiner. It is close to amazing how a non-expert is able to use Neural Networks, Decision Trees, Support Vector Machines, Genetic Algorithms, etc. and get meaningful results in minutes. The amazing part is also that RapidMiner is open source hence for usage by 1 analyst it is free., the company behind RapidMiner, also offers server software to run data analytics remotely. It is here where big data opportunities meet easy data analytics. What if RapidMiner data analytics could be ran on hundreds of servers in parallel and you pay by usage just as you pay for any Cloud compute and storage instances?

RapidMiner as a Service

RapidMiner as a Service, RMaaS, would allow millions of business people to be able to analyse Big Data “without Big Investments”. This type of Data Analytics as a Service would provide any SME with the same data analytics tools as large corporations. Data could come from Amazon S3, Amazon’s DynamoDB, Hosted Hadoops, any webservices, any social network, etc.

Visual as a Service

RapidMiner as a Service is only one of the many domain specific tools that could be offered as a visual drag-and-drop Cloud service. VAS as a Service is another example in which complex telecom assets can be easily combined in a drag-and-drop manner. There are many more. These services will be the real revolution of Cloud Computing since they combine IaaS/PaaS/SaaS into a new generation of solutions that bring large savings for new users and potential large revenues for their providers…

10 ways telecom can make money in the future a.k.a. telecom revenue 2.0

LTE roll-outs are taking place in America and Europe. Over-the-top-players are likely to start offering large-scale and free HD mobile VoIP over the next 6-18 months. Steeply declining ARPU will be the result. The telecom industry needs new revenue: telecom revenue 2.0. How can they do it?

1. Become a Telecom Venture Capitalist

Buying the number 2 o 3 player in a new market or creating a copy-cat solution has not worked. Think about Terra/Lycos/Vivendi portals, Keteque, etc. So the better option is to make sure innovative startups get partly funded by telecom operators. This assures that operators will be able to launch innovative solutions in the future. Just being a VC will not be enough. Also investment in quickly launching the new startup services and incorporating them into the existing product catalog are necessary.

2. SaaSification & Monetization

SaaS monetization is not reselling SaaS and keeping a 30-50% revenue share. SaaS monetization means offering others the development/hosting tools, sales channels, support facilities, etc. to quickly launch new SaaS solutions that are targeted at new niche or long tail segments. SaaSification means that existing license-based on-site applications can be quickly converted into subscription-based SaaS offerings. The operator is a SaaS enabler and brings together SaaS creators with SaaS customers.

3. Enterprise Mobilization, BPaaS and BYOD

There are millions of small, medium and large enterprises that have employees which bring smartphones and tablets to work [a.k.a. BYOD - bring-your-own-device]. Managing these solutions (security, provisioning, etc.) as well as mobilizing applications and internal processes [a.k.a. BPaaS - business processes as a service] will be a big opportunity. Corporate mobile app and mobile SaaS stores will be an important starting point. Solutions to quickly mobilize existing solutions, ideally without programming should come next.

4. M2M Monetization Solutions

At the moment M2M is not having big industry standards yet. Operators are ideally positioned to bring standards to quickly connect millions of devices and sensors to value added services. Most of these solutions will not be SIM-based so a pure-SIM strategy is likely to fail. Operators should think about enabling others to take advantage of the M2M revolution instead of building services themselves. Be the restaurant, tool shop and clothing store and not the gold digger during a gold rush.

5. Big Data and Data Intelligence as a Service

Operators are used to manage peta-bytes of data. However converting this data into information and knowledge is the next step towards monetizing data. At the moment big data solutions focus on storing, manipulating and reporting large volume of data. However the Big Data revolution is only just starting. We need big data apps, big data app stores, “big datafication” tools, etc.

6. All-you-can-eat HD Video-on-Demand

Global content distribution can be better done with the help of operators then without. Exporting Netflix-like business models to Europe, Asia, Africa, Latin-America, etc. is urgently necessary if Hollywood wants to avoid the next generation believing “content = free”. All-you-can-eat movies, series and music for €15/month is what should be aimed for.

7. NFC, micro-subscriptions, nano-payments, anonymous digital cash, etc.

Payment solutions are hot. Look at Paypal, Square, Dwolla, etc. Operators could play it nice and ask Visa, Mastercard, etc. how they can assist. However going a more disruptive route and helping Square and Dwolla serve a global marketplace are probably more lucrative. Except for NFC solutions also micro-subscriptions (e.g. €0.05/month) or nano-payments (e.g. €0.001/transaction) should be looked at.

Don’t forget that people will still want to buy things in a digital world which they do not want others to know about or from people or companies they do not trust. Anonymous digital cash solutions are needed when physical cash is no longer available. Unless of course you expect people to buy books about getting a divorce with the family’s credit card…

8. Build your own VAS for consumers and enterprises – iVAS.

Conference calls, PBX, etc. were the most advanced communication solutions offered by operators until recently. However creating visual drag-and-drop environments in which non-technical users can combine telecom and web assets to create new value-added-services can result in a new generation of VAS: iVAS. The VAS in which personal solutions are resolved by the people who suffer them. Especially in emerging countries where wide-spread smartphones and LTE are still some years off, iVAS can still have some good 3-5 years ahead. Examples would be personalized numbering schemas for my family & friends, distorting voices when I call somebody, etc. Let consumers and small enterprises be the creators by offering them visual  do-it-yourself tools. Combine solutions like Invox, OpenVBX, Google’s App Inventor, etc.

9. Software-defined networking solutions & Network as a Service

Networks are changing from hardware to software. This means network virtualization, outsourcing of network solutions (e.g. virtualized firewalls), etc. Operators are in a good position to offer a new generation of complex network solutions that can be very easily managed via a browser. Enterprises could substitute expensive on-site hardware for cheap monthly subscriptions of virtualized network solutions.

10. Long-Tail Solutions

Operators could be offering a large catalog of long-tail solutions that are targeted at specific industries or problem domains. Thousands of companies are building multi-device solutions. Mobile &  SmartTV virtualization and automated testing solutions would be of interest to them. Low-latency solutions could be of interest to the financial sector, e.g. automated trading. Call center and customer support services on-demand and via a subscription model. Many possible services in the collective intelligence, crowd-sourcing, gamification, computer vision, natural language processing, etc. domains.

Basically operators should create new departments that are financially and structurally independent from the main business and that look at new disruptive technologies/business ideas and how either directly or via partners new revenue can be generated with them.

What not to do?

Waste any more time. Do not focus on small or late-to-market solutions, e.g. reselling Microsoft 365, RCS like Joyn, etc. Focus on industry-changers, disruptive innovations, etc.

Yes LTE roll-out is important but without any solutions for telecom revenue 2.0, LTE will just kill ARPU. So action is required now. Action needs to be quick [forget about RFQs], agile [forget about standards - the iPhone / AppStore is a proprietary solution], well subsidized [no supplier will invest big R&D budgets to get a 15% revenue share] and independent [of red tape and corporate control so risk taking is rewarded, unless of course you predicted 5 years ago that Facebook and Angry Bird would be changing industries]…

Big Data Apps and Big Data PaaS

March 21, 2012 5 comments

Enterprises no longer have a lack of data. Data can be obtained from everywhere. The hard part is to convert data into valuable information that can trigger positive actions. The problem is that you need currently four experts to get this process up and running:

1) Data ETL expert – is able to extract, transform and load data into a central system.

2) Data Mining expert – is able to suggest great statistical algorithms and able to interpret the results.

3) Big Data programmer – is an expert in Hadoop, Map-Reduce, Pig,  Hive, HBase, etc.

4) A business expert – that is able to guide all the experts into extracting the right information and taking the right actions based on the results.

A Big Data PaaS should focus on making sure that the first three are needed as little as possible. Ideally they are not needed at all.

How could a business expert be enabled in Big Data?

The answer is Big Data Apps and Big Data PaaS. What if a Big Data PaaS is available, ideally open source as well as hosted, that comes with a community marketplace for Big Data ETL connectors and Big Data Apps? You would have Big Data ETL connectors to all major databases, Excel, Access, Web server logs, Twitter, Facebook, Linkedin, etc. For a fee different data sources could be accessed in order to enhance the quality of data. Companies should be able to easily buy access to data of others on a Pay-as-you-use basis.

The next steps are Big Data Apps. Business experts often have very simple questions: “Which age group is buying my product?”, “Which products are also bought by my customers?”, etc. Small re-useable Big Data Apps could be built by experts and reused by business experts.

A Big Data App example

A medium sized company is selling household appliances. This company has a database with all the customers. Another database with all the product sales. What if a Big Data App could find which products tend to be sold together and if there are any specific customer features (age, gender, customer since, hobbies, income, number of children, etc.) and other features (e.g. time of the year) that are significant? Customer data in the company’s database could be enhanced with publicly available information (from Facebook, Twitter, Linkedin, etc.). Perhaps the Big Data App could find out that parents (number of children >0), whose children like football (Facebook), are 90% more likely to buy waffle makers, pancake makers, oil fryers, etc. three times a year. Local football clubs might organize events three times a year to gain extra funding. Sponsorship, direct mailing, special offers, etc. could all help to attract more parents, of football-loving-kids, to the shop.

The Big Data Apps would focus on solving a specific problem each: “Finding products that are sold together”, “Clustering customers based on social aspects”, etc. As long as a simple wizard can guide a non-technical expert in selecting the right data sources and understanding the results, it could be packaged up as a Big Data App. A marketplace could exist for the best Big Data Apps. External Big Data PaaS platforms could also allow data from different enterprises to be brought together and generate extra revenue as long as individual persons can not be identified.


Get every new post delivered to your Inbox.

Join 274 other followers

%d bloggers like this: