Why can Facebook, Google, Salesforce and Twitter role out new features every day and regular telecom operators only every 6 months? Although they are dotcoms, they have thousands of employees and a lot of legacy systems as well. However they are able to roll out a new feature every day, if not every hour or minute and large new systems every so many months, weeks or even days.
How do they do it and how can the telecom industry learn from it?
On highscalability, you will find a lot of information on the architectures of large dotcoms. However if you look at different articles you see that each of the larger dotcoms has an architecture that is shared among different products and services, e.g. scaling messages at facebook.
This is the secret sause of the dotcoms. They have built and continuously improved a highly distributed architecture that can handle millions of users and peta bytes of information. On top of this “shared architecture” go the services. New employees are able to quickly create new services because they do not have to worry about scaling data, monitoring the service, deploying/upgrading versions, backing up data, versioning code, etc.
On the other hands operators have no standardized shared architecture. Instead there is a puzzle of different solutions that often use totally different technologies, hardware, etc. Maintenance and upgrades are a nightmare.
Trying to launch any new service requires a massive amount of planning, lots of different skills, expensive investments in third-party licenses and hardware, etc.
How can you do it differently?
Building a private cloud with virtual servers and storage will not resolve operator’s problems. Just virtualizing the puzzle of solutions is not going to do away with complex integrations.
Operators need to make a more bolt move. They need to separate the new from the old. Legacy systems should be kept and isolated. However a new architecture should be built that works in parallel with the legacy systems. This new architecture should focus on launching new services and partner services at dotcom speeds. Everything should be handled as an independent service. Each service should get its own API. A storage services, a billing service, a monitoring service, a provisioning service, an identity service, a datawarehouse service, a deployment service, a mobile shop service, an inventory service, a support service, etc.
All APIs should use a common technology. APIs for third-parties could use REST. APIs for internal high-load usage could use Thrift or Protobuffers. Each API should have two versions, the easy and the low-level version. The easy API offers the most used but in general basic functionality, e.g. sendSMS(from, to, message). The low-level API offers a complete feature set, e.g. sendBinarySMS, sendSMSWithDeliveryConfirmation, etc. This will allow most services to use the easy API but to have access to the advanced functionality when needed.
Loadbalancing when using the services is key. The loadbalancer is the secret for many rolling upgrades in the dotcom world. An application that uses a certain service will use client-based loadbalancing. By having the loadbalancing be able to receive events, it is possible to dynamically add/remove instances of an API, gradually move requests to a new version of the API, etc.
New service developers will now have to focus on building the business logic for the new service and not on data migrations, scaling, monitoring, backups, etc. The service can have completely new ways of billing and charging, a complex deployment workflow, advanced monitoring requirements, large data storage requirements, etc. However it is not the billing or charging system that has to be extended. Neither a centralized EAI. Nor the monitoring system. Instead it is the service that decides what is best for the service via the use of the easy or low-level APIs. By moving the peculiarities of every service into the service and not into generic OSS and BSS systems, these support systems can be drastically simplified.
Operators should try to focus on launching a lot more niche services and opening up their infrastructure to a long-tail of service suppliers. Instead of general services like PBX for SME, operators should think about hotel reservation services, doctor scheduling services, etc. The value of the operator should be in offering a reliable back-office architecture, assuring service quality and managing the support eco-system. The long-tail of service suppliers should be put to work to launch competing niche offerings and let customers decide which one will survive or not.
SS7 networks or “intelligent networks” have been the core reason why network-based services can not be rolled-out quickly. Specialized skills are needed to launch a new SS7 service.
Currently operators are investing in service delivery platforms or SDPs to move the network intelligence out of SS7. These SDPs will be holding modern copies of the SS7 services.
However do we need intelligent networks? Why can’t we have dumb networks?
The Internet is a dumb TCP/IP network. Intelligence is not in the network but in the applications that run on top of it. Why are telecom networks different? Why do routers have to know if the application is voice, SMS or data? Why does the network have to know about conferencing, numbering plans, etc.?
One example: MSISDN
Why do you want to hard-code an end-user identifier throughout your network & billing systems? Why can’t we have a mechanism like a unique IP address and several DNS names for it. I don’t want to learn a long list of digits to identify a friend. I would like to control my own numbering plan. My direct family starts with 1xx. My friends 2xx. Alternatively I can use their email. I should be able to call a company with its DNS. Ideally I can use the Facebook or Twitter id as well. This would all be possible provided that an internal identifier would be mapped to an end-user identifier instead of using one unique identifier.
Example two: CDRs
Why should every network element know that for every call you need to generate a CDR. However for data, charging is not based on seconds or minutes but data volume? Cannot the metering be done outside the network? Why are we generating millions of CDRs when end-users have a flat-rate or are calling a free number? With software-as-a-service metering can be different per application (pay per GB of storage, per MB of network traffic, per user per month, per company per year, etc.).
The proposal: Define the metering mechanism for each call, SMS or application ad-hoc and use specialized meters outside of the network to meter the service. Time-based meters allow any type of data to pass through to the network but will bill by nano-second, millisecond, second, minute, hour, day, week, month, year, etc. You just configure that this voice application needs second-based billing, that adult entertainment application needs minute-based billing and that compute server needs hourly-based billing. Flat-fee calls would not have to be metered and as such don’t need a meter. Meters could be gateways that scan if data goes through. However they could also be event-based and delegate complex metering into an application to warn them when an event has to be billed, e.g. application download, new user registration, etc.
Simplifying the network by taking out complexity to manage/launch/meter/monitor services would substantially reduce the cost of network equipment. Perhaps to such extend that it becomes too small to meter services and as such also metering can be eliminated. Pure bit-pipe operators could probably do with an Excel or Access database as their billing system.
Most telecom projects involve installing an Oracle RAC cluster, a SAN, application server clusters, etc. Only the time it takes to procure and install the basic hardware and software takes months. We are not even talking about the costs…
If you want to launch new ideas every month, you have to use Cloud Computing. This can be public cloud, private cloud or hybrid cloud. However even then too much time is spend on installation and configuration of software.
Infrastructure automation is about making a team productive in quickly launching new services or updating existing services. Infrastructure starts with having a standardized development environment, automated build tools (e.g. Maven or Ant for Java), continuous test automation (junit but also Hudson or Cruisecontrol), etc. The next step would be to also automate deployment (e.g. Puppet & mCollective from PuppetLabs) of server software and Cloud infrastructure.
This type of automation is often 90% equal so having a standardized framework would extremely shorten the times to get software development up and running as well as deploying it into test and production environments.
This would however only be the start of the journey. Dotcoms launch new features on a weekly or even daily basis. They monitor in detail what users do and often launch multiple alternative versions of a new feature. Gradual deployment of small features allows to see performance problems strait away and avoids extensive regression tests and fast rollback.
Let’s see how this could be applied in telecom.
The key to success is copying Google. Google is having standardized architecture components that are reused among different teams (e.g. BigTable). By building up a shared infrastructure and the tools to quickly deploy new services onto it or update existing features, time to market can be dramatically reduced. Infrastructure is a secret competitive weapon that too few companies use.