Every 6 months at Canonical, the company behind Ubuntu, I work on something technical to test our tools first hand and to show others new ideas. This time around I created an Instant Big Data solution, more concretely “Instant Storm”.
Storm is now part of the Apache Foundation but previously Storm was build by Nathan Marz during his time at Twitter. Storm is a stream processing engine for real-time and distributed computation. You can use Storm to aggregate real-time flows of events, to do machine learning, for analytics, for distributed ETL, etc.
Storm is build out of several services and requires Zookeeper. It is a complex solution and non-trivial to deploy, integrate and scale. The first technical project I did at Canonical was to create a Storm Juju charm. Although I was able to automate the deployment of Storm, there were still problems because users still had to read about how to actually use Storm.
Instant Storm is the first effort to resolve this problem. I created a StormDeployer charm that can read a yaml file in which a developer can specify multiple topologies. For each you specify the name of the topology, the jar file, the location in Github, how to package the jar file, etc. Afterwards by uploading the yaml file to Github or any public web server and giving it the extension .storm anybody in the world is able to reuse the topologies instantly in two steps:
Alternatively you can use the Juju command line:
juju set stormdeployer "deploy=http://somedomain/somefile.storm"
There are several examples already available on Github but here is one that for sure works:
This is the easiest way to deploy Storm on any public cloud, private cloud or if Ubuntu’s Metal-as-a-Service / MaaS is used on any bare metal server (X86, ARM64, Power 8). See here for Juju installation instructions.
This is a first version with some limitations. One of the really nice things to add would be to use Juju to make integrations between a topology and other charms dynamic. You can for instance create a spout or bolt that connects to the Kafka or Cassandra charms. Juju can automatically tell the topology the connection information and make updates to the running topologies should anything change. This would make it a lot more robust to run long running Storm topologies.
I am happy to donate my work to the Apache Foundation and guide anybody who wants to take ownership…
In 1999 you could easily spend $1M on having a company build a static web site. A few years later any student could make you a web site. HTML became a commodity. The same commodity effect needs to happen to Big Data.
The past: build your own petabyte solution
A few years back only the happy few extremely technically gifted companies were able to create solutions to store TBs and even PBs of data. Google started to write papers. Yahoo and Facebook started to release open source solutions. Shortly after Big Data became a buzz word and anybody that was somebody in the IT consultancy space was talking about Hadoop.
Now: open source solutions and lots of handholding
In 2014 it is possible to download Hadoop, Spark, Storm, etc. You can even find prepackaged solutions from Hortonworks, Cloudera, MapR, Pivotal, IBM, etc. But still Big Data projects are hard. You need very bright people or spend quite a lot to get anywhere. Many projects run over budget and under deliver.
Future: instant Big Data solutions
We are ready for the next step and convert Big Data in a commodity. Several startups are launching Big Data solutions as a service. Unfortunately for many SaaS providers, having a Big Data SaaS solution is not enough. Big Data means lots of data. Data that can hold sensitive information. Data that can grow with GBs a day. This is the reason why if any SaaS Big Data solution ought to be successful, it also needs an on-premise alternative.
We are also missing a portable Big Data logic container. The industry is raving about Docker. Several startups are working on making Docker containers the way to share your map-reduce logic. I predict that many more Big Data logic can be containerised and made portable. Any data scientist should be able to reuse Deep Belief or Random Forest algorithms by just reusing a container.
The other part of the puzzle that is still missing is data visualisation and manipulation tools. There are many Big Data key-value stores and map-reduce engines. However the data visualisation and reporting space is still wide open. The Apache Foundation does not [yet] provide a drag-and-drop tool to setup dashboards, generate reports, schedule notifications, run workflows, automate data imports, etc.
Industry specific reusable assets is another part that is missing. Nobody wants to go and reinvent eCommerce recommendation algorithms every time a new Big Data platform becomes available.
However all of this is coming at enormous speeds. As soon as all the pieces of the puzzle are coming together then cloud orchestration solutions like Juju, ServiceMesh, Brooklyn, etc. will allow enterprises to start consuming Big Data solutions as a commodity. Instant Big Data solutions are 6-36 months away depending on your requirements.
Have you ever counted the number of Linux devices at home or work that haven’t been updated since they came out of the factory? Your cable/fibre/ADSL modem, your WiFi point, television sets, NAS storage, routers/bridges, media centres, etc. Typically this class of devices hosts a proprietary hardware platform, an embedded proprietary Linux and a proprietary application. If you are lucky you are able to log into a web GUI often using the admin/admin credentials and upload a new firmware blob. This firmware blob is frequently hard to locate on hardware supplier’s websites. No wonder the NSA and others love to look into potential firmware bugs. They are the ideal source of undetected wiretapping.
The next IT revolution: micro-servers
The next IT revolution is about to happen however. Those proprietary hardware platforms will soon give room for commodity multi-core processors from ARM, Intel, etc. General purpose operating systems will replace legacy proprietary and embedded predecessors. Proprietary and static single purpose apps will be replaced by marketplaces and multiple apps running on one device. Security updates will be sent regularly. Devices and apps will be easy to manage remotely. The next revolution will be around managing millions of micro-servers and the apps on top of them. These micro-servers will behave like a mix of phone apps, Docker containers, and cloud servers. Managing them will be like managing a “local cloud” sometimes also called fog computing.
Micro-servers and IoT?
Are micro-servers some form of Internet of Things. Yes they can be but not all the time. If you have a smarthub that controls your home or office then it is pure IoT. However if you have a router, firewall, fibre modem, micro-antenna station, etc. then the micro-server will just be an improved version of its predecessor.
Why should you care about micro-servers?
If you are a mobile app developer then the micro-servers revolution will be your next battlefield. Local clouds need “Angry Bird”-like successes.
If you are a telecom or network developer then the next-generation of micro-servers will give you unseen potentials to combine traffic shaping with parental control with QoS with security with …
If you are a VC then micro-server solution providers is the type of startups you want to invest in.
If you are a hardware vendor then this is the type of devices or SoCs you want to build.
If you are a Big Data expert then imagine the new data tsunami these devices will generate.
If you are a machine learning expert then you might want to look at algorithms and models that are easy to execute on constraint devices once they have been trained on potentially thousands of cloud servers and petabytes of data.
If you are a Devop then your next challenge will be managing and operating millions of constraint servers.
If you are a cloud innovator then you are likely to want to look into SaaS and PaaS management solutions for micro-servers.
If you are a service provider then this is the type of solutions you want to have the capabilities to manage at scale and easily integrate with.
If you are a security expert then you should start to think about micro-firewalls, anti-micro-viruses, etc.
If you are a business manager then you should think about how new “mega micro-revenue” streams can be obtained or how disruptive “micro- innovations” can give you a competitive advantage.
If you are an analyst or consultant then you can start predicting the next IT revolution and the billions the market will be worth in 2020.
The next steps…
It is still early days but expect some major announcements around micro-servers in the next months…
At TADHack some months ago it was clear that SMS and phone calls are out and WebRTC is the new hot technology for developers. Via your browser you can talk to your salesman, doctor and coach. Your browser can be mobile. This means that video calls will be universal as soon as 4G is everywhere. Bad news for operators that will see data on their networks balloon without new revenues. Good news for users that will have a whole new world of communication opening up with voice, video, screen sharing, web apps, etc. all seamlessly integrated.
How can business be generated with WebRTC?
Per minute call billing is out. Unless of course you are talking to a highly paid consultant that charges you by the second or minute. One time payment like mobile apps are only viable if you can embed WebRTC technology in a mobile app, not if you need to support an ongoing business. This means that we need a new subscription model for WebRTC. We need a micro subscription model. Especially for services that will be used on a long term basis, e.g. conference facilities, next generation voice mails, etc. As always operators will be hesitant to cannibalise a juicy per minute business for a low margin 1-99 cents per months subscription service. So are there others that could bill micro-subscriptions? The obvious choice would be cloud providers. They can already do hourly micro billing on monthly cycles hence adding some recurring element would be straightforward. So my prediction is that WebRTC will see operator’s problems accelerate whereby cloud will no longer deliver you only IT solutions but also your communication services.
Every day a new orchestration solution is being presented to the world. This post is not about which one is better but about what will happen if you embrace these new technologies.
The traditional scale-up architecture
Before understanding the new solutions, let’s understand what is broken with the current solutions. Enterprise IT vendors have traditionally made software that was sold based on the number of processors. If you were a small company you would have 5 servers, if you were big you would have 50-1000 servers. With the cloud anybody can boot up 50 servers in minutes, so reality has changed. Small companies can manage easily 10000 servers, e.g. think of successful social or mobile startups.
Also software was written optimised for performance per CPU. Many traditional software comes with a long list of exact specifications that need to be followed in order for you to get enterprise support.
Big bloated frameworks are used to manage the thousands of features that are found in traditional enterprise solutions.
The container micro services future
Enterprise software is often hard to use, integrate, scale, etc. This is all the consequence of creating a big monolithic system that contains solutions for as many use cases possible.
In come cloud, containers, micro-services, orchestration, etc. and all rules change.
The best micro services architecture is one where important use cases are reflected in one service, e.g. the shopping cart service deals with your list of purchases however it relies on the session storage service and the identity service to be able to work.
Each service is ran in a micro services container and services can be integrated and scaled in minutes or even seconds.
What benefits do micro services and orchestration bring?
In a monolithic world change means long regression tests and risks. In a micro services world, change means innovation and fast time to market. You can easily upgrade a single service. You can make it scale elastically. You can implement alternative implementations of a service and see which one beats the current implementation. You can do rolling upgrades and rolling rollbacks.
So if enterprise solutions would be available as many reusable services that can all be instantly integrated, upgraded, scaled, etc. then time to market becomes incredibly fast. You have an idea. You implement five alternative versions. You test them. You combine the best three in a new alternative or you use two implementations based on a specific customer segment. All this is impossible with monolithic solutions.
This sounds like we reinvented SOA
Not quite. SOA focused on reusable services but it never embraced containers, orchestration and cloud. By having a container like Docker or a service in the form of a Juju Charm, people can exchange best practice’s instantly. They can be deployed, integrated, scaled, upgraded, etc. SOA only focused on the way services where discovered and consumed. Micro services focus additionally on global reuse, scaling, integration, upgrading, etc.
We are not quite there yet. Standards are still being defined. Not in the traditional standardisation bodies but via market adoption. However expect in the next 12 months to see micro services being orchestrated at large scale via open source solutions. As soon as the IT world has the solution then industry specific solutions will emerge. You will see communication solutions, retail solutions, logistics solutions, etc. Traditional vendors will not be able to keep pace with the innovation speed of a micro services orchestrated industry specific solution. Expect the SAPs, Oracles, etc. of this world to be in chock when all of a sudden nimble HR, recruiting, logistics, inventory, supplier relationship management solutions, etc. emerge that are offered as SaaS and on-premise often open source. Super easy to use, integrate, manage, extend, etc. It will be like LEGO starting a war against custom made toys. You already know who will be able to be more nimble and flexible…
Cisco came up with the term of Fog Computing and The Wall Street Journal has endorsed it, so I guess Fog Computing will become the next hype.
What is Fog Computing?
Internet of Things will embed connectivity into billions of devices. Common thinking says your IoT device is connected to the cloud and shares data for Big Data analytics. However if your Fitbit starts sending your heartbeat every 5 seconds, your thermometer tells the cloud every minute that it is still 23.4 degrees, your car tells the manufacturer its hourly statistics, farmers measure thousands of acres, hospitals measure remote patients health continuously, etc. then your telecom operator will go bankrupt because their network is not designed for this IoT Data Tsunami.
Fog Computing is about taking decisions as close to the data as possible. Hadoop and other Big Data solutions have started the trend to bring processing close to where the data is and not the other way around. Now Fog Computing is about doing the same on a global scale. You want decisions to be taken as close to where the data is generated and stop it from reaching global networks. Only valuable data should be travelling on global networks. Your Fitbit could sent average heartbeat reports every hour or day and only sent alerts when your heartbeat passed a threshold for some amount of time.
How to implement Fog Computing?
Fog Computing is best done via machine learning models that get trained on a fraction of the data on the Cloud. After a model is considered adequate then the model gets pushed to the devices. Having a Decision Tree or some Fuzzy Logic or even a Deep Belief Network run locally on a device to take a decision is lots cheaper than setting up an infrastructure in the Cloud that needs to deal with raw data from millions of devices. So there are economical advantages to use Fog Computing. What is needed are easy to use solutions to train models and send them to highly optimised and low resource intensive execution engines that can be easily embedded in devices, mobile phones and smart hubs/gateways.
Fog Computing is also useful for Non-IoT
Also network elements should become a lot more intelligent. When was the last time you were on a large event with many people around you. Can you imagine any event in the last 24 months where WiFi was working brilliantly? Most of the time WiFi works in the morning when people are still getting in but soon after it stops working. Fog Computing can be the answer here. You only need to analyse data patterns and take decisions on what takes up lots of data. Chances are that all the mobiles, tablets and laptops that are connected to the event WiFi have Dropbox or some other large file sharing enabled. You take some pictures of things on the event and since you are on WiFi the network gets saturated by a photo sharing service that is not really critical for the event. Fog Computing would detect this type of bandwidth abuse and would limit it or even block it. At the moment this has to be done manually but computers would do a lot better job at it. So Software Defined Networking should be all over Fog Computing.
Telecom Operators and Equipment Manufacturers Should Embrace Fog Computing
Telecom operators should heavily invest in Fog Computing by making Open Source standards that can be easily embedded in any device and managed from any cloud. When I say standards, I don’t mean ETSI. I mean organise a global Fog Computing competition with a $10 million award for the best open source Fog Computing solution. Make a foundation around it with a very open license, e.g. Apache License. Invite and if necessary oblige all telecom and general network suppliers to embed it.
The alternatives are…
Not solving this problem will provoke heavy investment in global networks that carry 90% junk data and an IoT Data Tsunami. Solving this problem via network traffic shaping is a dangerous play in which privacy and net neutrality will come up earlier than later. You can not block Dropbox, YouTube or Netflix traffic globally. It is a lot easier if everybody blocks what is not needed or at least minimises such traffic themselves. Most people have no idea how to do it. Creating easy to use open source tools would be a first good step…