An MIT student recently created a new type of massively distributed database, one that runs on graphical processors instead of CPUs. Mapd, as it has been called, makes use of the immense computational power available in off-the-shelf graphics cards that can be found in any laptop or PC. Mapd is especially suitable for real-time quering, data analysis, machine learning and data visualization. Mapd is probably only one of many databases that will try new hardware configurations to cater for specific application use cases.
Alternative approaches could focus on large sets of cheap mobile processors, Parallella processors, Raspberry PIs, etc. all stitched together. The idea would be to create massive processing clouds based on cheap specialized hardware that could beat traditional CPU Clouds both in price and performance at least for some specific use cases…
Everybody is hearing Cloud Computing on the television now. Operators will store your contacts in the Cloud. Hosting companies will host your website in the Cloud. Others will store your photos in the Cloud.
However how do you make money with the Cloud?
The first thing is to forget about infrastructure and virtualization. If you are thinking that in 2013, the world needs more IaaS providers then you haven’t seen what is currently on offer (Amazon, Microsoft, Google, Rackspace, Joyent, Verizon/Terramark, IBM, HP, etc.).
So what are alternative strategies:
1) Rocket Internet SaaS Cloning
Your best hope is SaaS and PaaS. The best markets are non-English speaking markets. We have seen an explosion of SaaS in the USA but most have not made it to the rest of the world yet. Only some bigger SaaS solutions (Webex, GoToMeeting, Office 365, etc.) and PaaS platforms (Salesforce, Workday, etc.) are available outside of the US and the UK. However most SaaS and PaaS solutions are currently still English-only. So the quickest solution to make some money is to just copy, translate and paste some successful English-only SaaS product. If you do not know how to copy dotcoms, take a look at how the Rocket Internet team is doing it. Of course you should always be open for those annoying problems everybody has that could use a new innovative solution and as such create your own SaaS.
During the gold rush, be the restaurant, hotel or tool shop. While everybody is looking for the SaaS gold, offer solutions that will save gold diggers time and money. SaaSification allows others to focus on building their SaaS business, not on reinventing for the millionth time a web page, web store, email server, search, CRM, monthly subscription billing, reporting, BI, etc. Instead of a “Use Shopify to create your online store”, it should be “Use <YOUR PRODUCT> to create a SaaS Business”.
3) Mobile & Cloud
Everybody is having, or at least thinking about buying, a Smartphone. However there are very few really good mobile services that fully exploit the Cloud. Yet I can get a shopping list app but most are just glorified to-do lists. None is recommending me where to go and buy based on current promotions and comparison with other buyers. None is helping me find products inside a large supermarket. None is learning from my shopping habits and suggesting items on the list. None is allowing me to take a number at the seafood queue. These are just examples for one mobile + cloud app. Think about any other field and you are sure to find great ideas.
4) Specialized IaaS
I mentioned it before, IaaS is already overcrowded but there is one exception: specialized IaaS. You can focus on specialized hardware, e.g. virtualized GPU, DSP, mobile ARM processors. On network virtualization like SDN and Openflow. Mobile and tablet virtualization. Embedded device virtualization. Machine Learning IaaS. Car Software virtualization.
5) Disruptive Innovations + Cloud
Selling disruptive innovations and offering them as Cloud services. Examples could be 3D printing services, wireless sensor networks / M2M, Big Data, Wearable Tech, Open Source Hardware, etc. The Cloud will lower your costs and give you a global elastically scalable solution.
If you just invested a lot of money in a Big Data solution from any of the traditional BI vendors (Teradata, IBM, Oracle, SAS, EMC, HP, etc.) then you are likely to see a sub-optimal ROI in 2013.
Several innovations will come in 2013 that will change the value of Big Data exponentially. Other technology innovations are just waiting for smart start-ups to put them into good use.
The first major innovation will be Google’s Dremel-like solutions coming of age like Impala, Drill, etc. They will allow real-time queries on Big Data and be open source. So you will get a superior offering compared to what is currently available for free.
Cloud-Based Big Data Solutions
The absolute market leader is Amazon with EMR. Elastic Map Reduce is not so much about being able to run a Map Reduce operation in the Cloud but about paying for what you use and not more. The traditional BI vendors are still getting their head around a usage-based licensing for the Cloud. Except a lot of smart startups to come up with really innovative Big Data and Cloud solutions.
Big Data Appliances
You can buy some really expensive Big Data Appliances but also here disruptive players are likely to change the market. GPUs are relatively cheap. Stack them into servers and use something like Virtual OpenCL to make your own GPU virtualization cluster solution. These type of home-made GPU clusters are already being used for security Big Data related work.
Finally Parallella will put a 16-core supercomputer into everybody’s hands for $99. Their 2013 supercomputer challenge is definitely something to keep your eyes on. Their roadmap talks about 64 and 1000 core versions. If Adapteva can keep their promises and flood the market with Parallella’s then expect Parallella Clusters to be 2013 Big Data Appliance.
Distributed Machine Learning
Mahout is a cool project but Map Reduce might not be the best possible architecture to run iterative distributed backpropagation or any other machine learning algorithms. Jubatus looks promising. Also algorithm innovations like HogWild could really change the dynamics for efficient distributed machine learning. This space is definitely ready for more ground-breaking innovations in 2013.
Easier Big Data Tools
This is still a big white spot in the Open Source field. Having Open Source and easy to use drag-and-drop tools for Big Data Analytics would really excel the adoption. We already have some good commercial examples (Radoop = RapidMiner + Mahout, Tableau, Datameer, etc.) but we are missing good Open Source tools.
I am currently looking for new challenges so if you are active in the Big Data space and are looking for a knowledgable senior executive be sure to contact me at maarten at telruptive dot com.
With Big Data in the news all day, you would think that having a lot of high quality data is a guarantee for new revenues. However asking yourself how to generate new revenues from existing data is the wrong question. It is a sub-optimal question because it is like having a hammer and assuming everything else is a nail.
A better question to ask is:”What data insight problems potential customers have that I could solve?” Read more…
There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…
In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.
One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.
The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.
In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
TridentState tweetersToFollowers =
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.aggregate(new Count(), new Fields("reach"));
The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.
The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.
Every company is using Microsoft Office and especially Excel to do some sort of data analytics. However data volumes have grown exponentially and have outgrown Spreadsheets. You need experts in the business domain, in data analytics, in data migration/extraction/transformation/loading, in server management, etc. to get data analytics done on Big Data scale. This makes it expensive and only usable for the happy few.
Why? There must be easier ways to do it.
I think there are. For those unfamiliar with data analytics but eager to learn, you should take a look at a product called RapidMiner. It is close to amazing how a non-expert is able to use Neural Networks, Decision Trees, Support Vector Machines, Genetic Algorithms, etc. and get meaningful results in minutes. The amazing part is also that RapidMiner is open source hence for usage by 1 analyst it is free.
Rapid-i.com, the company behind RapidMiner, also offers server software to run data analytics remotely. It is here where big data opportunities meet easy data analytics. What if RapidMiner data analytics could be ran on hundreds of servers in parallel and you pay by usage just as you pay for any Cloud compute and storage instances?
RapidMiner as a Service
RapidMiner as a Service, RMaaS, would allow millions of business people to be able to analyse Big Data “without Big Investments”. This type of Data Analytics as a Service would provide any SME with the same data analytics tools as large corporations. Data could come from Amazon S3, Amazon’s DynamoDB, Hosted Hadoops, any webservices, any social network, etc.
Visual as a Service
RapidMiner as a Service is only one of the many domain specific tools that could be offered as a visual drag-and-drop Cloud service. VAS as a Service is another example in which complex telecom assets can be easily combined in a drag-and-drop manner. There are many more. These services will be the real revolution of Cloud Computing since they combine IaaS/PaaS/SaaS into a new generation of solutions that bring large savings for new users and potential large revenues for their providers…
LTE roll-outs are taking place in America and Europe. Over-the-top-players are likely to start offering large-scale and free HD mobile VoIP over the next 6-18 months. Steeply declining ARPU will be the result. The telecom industry needs new revenue: telecom revenue 2.0. How can they do it?
1. Become a Telecom Venture Capitalist
Buying the number 2 o 3 player in a new market or creating a copy-cat solution has not worked. Think about Terra/Lycos/Vivendi portals, Keteque, etc. So the better option is to make sure innovative startups get partly funded by telecom operators. This assures that operators will be able to launch innovative solutions in the future. Just being a VC will not be enough. Also investment in quickly launching the new startup services and incorporating them into the existing product catalog are necessary.
2. SaaSification & Monetization
SaaS monetization is not reselling SaaS and keeping a 30-50% revenue share. SaaS monetization means offering others the development/hosting tools, sales channels, support facilities, etc. to quickly launch new SaaS solutions that are targeted at new niche or long tail segments. SaaSification means that existing license-based on-site applications can be quickly converted into subscription-based SaaS offerings. The operator is a SaaS enabler and brings together SaaS creators with SaaS customers.
3. Enterprise Mobilization, BPaaS and BYOD
There are millions of small, medium and large enterprises that have employees which bring smartphones and tablets to work [a.k.a. BYOD - bring-your-own-device]. Managing these solutions (security, provisioning, etc.) as well as mobilizing applications and internal processes [a.k.a. BPaaS - business processes as a service] will be a big opportunity. Corporate mobile app and mobile SaaS stores will be an important starting point. Solutions to quickly mobilize existing solutions, ideally without programming should come next.
4. M2M Monetization Solutions
At the moment M2M is not having big industry standards yet. Operators are ideally positioned to bring standards to quickly connect millions of devices and sensors to value added services. Most of these solutions will not be SIM-based so a pure-SIM strategy is likely to fail. Operators should think about enabling others to take advantage of the M2M revolution instead of building services themselves. Be the restaurant, tool shop and clothing store and not the gold digger during a gold rush.
5. Big Data and Data Intelligence as a Service
Operators are used to manage peta-bytes of data. However converting this data into information and knowledge is the next step towards monetizing data. At the moment big data solutions focus on storing, manipulating and reporting large volume of data. However the Big Data revolution is only just starting. We need big data apps, big data app stores, “big datafication” tools, etc.
6. All-you-can-eat HD Video-on-Demand
Global content distribution can be better done with the help of operators then without. Exporting Netflix-like business models to Europe, Asia, Africa, Latin-America, etc. is urgently necessary if Hollywood wants to avoid the next generation believing “content = free”. All-you-can-eat movies, series and music for €15/month is what should be aimed for.
7. NFC, micro-subscriptions, nano-payments, anonymous digital cash, etc.
Payment solutions are hot. Look at Paypal, Square, Dwolla, etc. Operators could play it nice and ask Visa, Mastercard, etc. how they can assist. However going a more disruptive route and helping Square and Dwolla serve a global marketplace are probably more lucrative. Except for NFC solutions also micro-subscriptions (e.g. €0.05/month) or nano-payments (e.g. €0.001/transaction) should be looked at.
Don’t forget that people will still want to buy things in a digital world which they do not want others to know about or from people or companies they do not trust. Anonymous digital cash solutions are needed when physical cash is no longer available. Unless of course you expect people to buy books about getting a divorce with the family’s credit card…
8. Build your own VAS for consumers and enterprises – iVAS.
Conference calls, PBX, etc. were the most advanced communication solutions offered by operators until recently. However creating visual drag-and-drop environments in which non-technical users can combine telecom and web assets to create new value-added-services can result in a new generation of VAS: iVAS. The VAS in which personal solutions are resolved by the people who suffer them. Especially in emerging countries where wide-spread smartphones and LTE are still some years off, iVAS can still have some good 3-5 years ahead. Examples would be personalized numbering schemas for my family & friends, distorting voices when I call somebody, etc. Let consumers and small enterprises be the creators by offering them visual do-it-yourself tools. Combine solutions like Invox, OpenVBX, Google’s App Inventor, etc.
9. Software-defined networking solutions & Network as a Service
Networks are changing from hardware to software. This means network virtualization, outsourcing of network solutions (e.g. virtualized firewalls), etc. Operators are in a good position to offer a new generation of complex network solutions that can be very easily managed via a browser. Enterprises could substitute expensive on-site hardware for cheap monthly subscriptions of virtualized network solutions.
10. Long-Tail Solutions
Operators could be offering a large catalog of long-tail solutions that are targeted at specific industries or problem domains. Thousands of companies are building multi-device solutions. Mobile & SmartTV virtualization and automated testing solutions would be of interest to them. Low-latency solutions could be of interest to the financial sector, e.g. automated trading. Call center and customer support services on-demand and via a subscription model. Many possible services in the collective intelligence, crowd-sourcing, gamification, computer vision, natural language processing, etc. domains.
Basically operators should create new departments that are financially and structurally independent from the main business and that look at new disruptive technologies/business ideas and how either directly or via partners new revenue can be generated with them.
What not to do?
Waste any more time. Do not focus on small or late-to-market solutions, e.g. reselling Microsoft 365, RCS like Joyn, etc. Focus on industry-changers, disruptive innovations, etc.
Yes LTE roll-out is important but without any solutions for telecom revenue 2.0, LTE will just kill ARPU. So action is required now. Action needs to be quick [forget about RFQs], agile [forget about standards - the iPhone / AppStore is a proprietary solution], well subsidized [no supplier will invest big R&D budgets to get a 15% revenue share] and independent [of red tape and corporate control so risk taking is rewarded, unless of course you predicted 5 years ago that Facebook and Angry Bird would be changing industries]…
With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.
Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.
Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.
Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…