Archive

Posts Tagged ‘map reduce’

Big Data Apps and Big Data PaaS

March 21, 2012 2 comments

Enterprises no longer have a lack of data. Data can be obtained from everywhere. The hard part is to convert data into valuable information that can trigger positive actions. The problem is that you need currently four experts to get this process up and running:

1) Data ETL expert – is able to extract, transform and load data into a central system.

2) Data Mining expert – is able to suggest great statistical algorithms and able to interpret the results.

3) Big Data programmer – is an expert in Hadoop, Map-Reduce, Pig,  Hive, HBase, etc.

4) A business expert – that is able to guide all the experts into extracting the right information and taking the right actions based on the results.

A Big Data PaaS should focus on making sure that the first three are needed as little as possible. Ideally they are not needed at all.

How could a business expert be enabled in Big Data?

The answer is Big Data Apps and Big Data PaaS. What if a Big Data PaaS is available, ideally open source as well as hosted, that comes with a community marketplace for Big Data ETL connectors and Big Data Apps? You would have Big Data ETL connectors to all major databases, Excel, Access, Web server logs, Twitter, Facebook, Linkedin, etc. For a fee different data sources could be accessed in order to enhance the quality of data. Companies should be able to easily buy access to data of others on a Pay-as-you-use basis.

The next steps are Big Data Apps. Business experts often have very simple questions: “Which age group is buying my product?”, “Which products are also bought by my customers?”, etc. Small re-useable Big Data Apps could be built by experts and reused by business experts.

A Big Data App example

A medium sized company is selling household appliances. This company has a database with all the customers. Another database with all the product sales. What if a Big Data App could find which products tend to be sold together and if there are any specific customer features (age, gender, customer since, hobbies, income, number of children, etc.) and other features (e.g. time of the year) that are significant? Customer data in the company’s database could be enhanced with publicly available information (from Facebook, Twitter, Linkedin, etc.). Perhaps the Big Data App could find out that parents (number of children >0), whose children like football (Facebook), are 90% more likely to buy waffle makers, pancake makers, oil fryers, etc. three times a year. Local football clubs might organize events three times a year to gain extra funding. Sponsorship, direct mailing, special offers, etc. could all help to attract more parents, of football-loving-kids, to the shop.

The Big Data Apps would focus on solving a specific problem each: “Finding products that are sold together”, “Clustering customers based on social aspects”, etc. As long as a simple wizard can guide a non-technical expert in selecting the right data sources and understanding the results, it could be packaged up as a Big Data App. A marketplace could exist for the best Big Data Apps. External Big Data PaaS platforms could also allow data from different enterprises to be brought together and generate extra revenue as long as individual persons can not be identified.

NextGen Hadoop, beyond MapReduce

Hadoop has run into architectural limitations and the community has started working on the Next Generation Hadoop [NGN Hadoop]. NGN Hadoop has some new management features of which multi-tenant application management is the major one. However the key change is that MapReduce no longer is entangled inside the rest of Hadoop. This will allow Hadoop to be used for MPI, Machine Learning, Master-Worker, Iterative Processing, Graph Processing, etc. New tools to better manage Hadoop are also being incubated, e.g. Ambari and HCatalog.

Why is this important for telecom?
Having one platform that allows massive data storage, peta-byte data analytics, complex parallel computations, large-scale machine learning, big data map reduce processing, etc. all in one multi-tenant set-up means that telecom operators could see massive reductions in their architecture costs together with faster go-to-market, better data intelligence, etc.

Telecom applications, that are redesigned around this new paradigm, can all use one shared back-office architecture. Having data centralized into one large Hadoop cluster instead of tens or hundreds of application-specific databases, will enable unseen data analytics possibilities and bring much-needed efficiencies.

Is this shared-architecture paradigm new? Not at all. Google has been using it since 2004 at least when they published Map Reduce and BigTable.

What is needed is that several large operators define this approach as their standard architecture hence telecom solution providers will start incorporating it into their solutions. Commercial support can be easily acquired from companies like Hortonworks, Cloudera, etc.

Having one shared data architecture and multi-tenant application virtualization in the form of a Telco PaaS would allow third-parties to launch new services quickly and cheaply, think days in stead of years…

Does Google listen in when you use Google Voice?

October 15, 2010 Leave a comment

The dotcoms of this world generate massive amounts of web server logs. Users upload files. Make comments. Vote on items. All this data is unstructured data. Google has just pushed the bar higher with their Google instant that is able to change almost instantly* their search results based on real-time data changes. Percolator is the real-time indexing that makes it all possible. Really impressive to see indexes change almost real-time for a company that moves daily 20 peta bytes of data [= almost 30.000.000 CDs].

So if Google is able to index all our web data, what about our voice and video data? Google Voice brought voice transcription to the general public. Voice transcription is based on machine learning. Every time a voicemail is incorrectly transcribed, users are able to “teach” Google how to do it right. This will make Google’s voice transcription quickly the best trained in the world. From there it is only a small step to also connect it to all Google Voice calls. Next step is to index what you say. Real-time indexing is key to interpret this vast amount of new content! So not only when you send a Gmail message talking about a trip you are planning to Paris, will you get advertisement from travel agents, but also when you talk to your friends about the trip.

Can Google go any further? Yes of course. Don’t forget Google Talk and Android. Two more sources to get voice data from. Google Talk can be used now. Android probably only if you use Google Talk or Google Voice on your mobile given the excessive data charges you would get if you would send “normal circuit” calls to Google.

Where are today’s limits of image recognition? Especially in videos? Taking a video with an Android phone and uploading it to Youtube could also mean that GPS data of where the video was taken could be included. Google Goggles ‘ technology could then find out what it is you were doing.  Probably not possible today but let´s wait some months…

What this means is that Google will very likely be able to subsidize more customer services if you are willing to trade-in some extra privacy. “Free” calls and even video calls for Android will definitely set it apart from iPhones, Nokias, etc. Google would need to subsidize wholesale interconnects to other providers, which are not that bad compared to end-user prices. But could marketing dollars also subsidize a monthly data plan? If yes then Google could become an MVNO and offer free phone and data services. This would be a killer feature to subscribe to Google Voice on the mobile. Killer in both senses, but not the positive sense for operators…

Where does this leave operators? Again a piece of their voice and SMS pie will disappear. But also a large piece of monthly subscription fees. Today there is very few operators can do to defend themselves if they don’t change their own rules. Every operator that thinks their assets are sacred, RFQ’s will bring innovation and scalability is about writing a large check to Oracle, will likely suffer.  Those that are willing to experiment with disruptive innovations and are open to discuss the previously unthinkable, still have a window of opportunity…

Note:

* There are delays between the time a page is updated and the new results being visible due to crawler and indexing delays so real-time indexing does not mean real-time search results.

Ten Times Faster Time-to-Market for Telco Innovations

September 27, 2010 Leave a comment

Google has changed very little to its basic architecture building blocks over the years. Everything runs on top of the Google File System and Bigtable. Except for Google Instant which is reversing Map-Reduce usage, new services have been reusing their existing architecture.

Similar observations can be made for the rest of the main players. So why is it that Telecom operators have not invested in one architecture to launch multiple services? No idea.

One architecture for VAS

The concept is simple. Create one common architecture. This architecture should have multiple components:

  • A high-available real-time data store – stores all application and user data
  • A right-time data analytics service – allows collective intelligence and data mining
  • An asset exposure layer – applications can re-use network assets and get isolated from internal complexities
  • Presentation layer – facilitate mobile GUI and Web 2.0 development
  • Application Engine – allows applications to run and focus on business logic instead of scaling and integration
  • Continuous Deployment – instead of monthly big-bang deployments, incremental daily or weekly releases are possible, even hourly like some dotcoms.
  • Unified Administration – one place to know what is happening both technically and business-wise with the applications.
  • Long-Tail Business Link – all business and accounting transactions for customers, partners, providers, etc. are centralized.
  • etc.

Having such an architecture in place would allow telco innovations to be brought to market at least ten times faster. Application and service designers would have to focus on business logic and not on the rest. Administrators would have one platform to manage and not a puzzle of systems. Integrations would have to be done ones to a common integration layer.

Building such an architecture should be done in the dotcom style and not a telco RFQ. Only by doing iterative projects which bring together the components can you build an architecture that is really used and not a side project that starts to have its own life.

It even makes sense to open source the architecture. Telco’s business is not about building architectures hence having a common platform that was started by one would benefit the whole industry. It even would give a competitive advantage to the one that started the architecture for knowing it better than any competitor. Of course for this to happen, a telco has to recognize that their future major competitors are not the neighboring telco but a global dotcom…

If you are not doing Cloud Computing now, then you are late!

September 14, 2010 Leave a comment

Any operator that has not started a project on Cloud Computing is late. The typical data center at an operator is filled with servers that are under utilized e.g. application servers and database servers are running at 30% of memory, disk and CPU. Just by doing step one of getting to Cloud Computing: virtualization, operators are able to save substantially in the cost of hardware, electricity, maintenance, etc. Virtualization means decoupling software from hardware. This allows to run multiple operating systems on one server.

However this would only be focusing on the tip of the iceberg. Cloud Computing is so much more…

Private Clouds

Automatic Scaling

Let´s first focus on the internal systems of an operator. After solutions have been virtualized, then you are able to scale them to more or less servers. The first step is to automate this process. If you have an application server cluster, do you need 8 nodes all the time? You probably only need them the week before Christmas or during some other peak period. So the ideal is to be able to measure the load and to automate the deployment of more or less cluster nodes based on load. The same can be done with the database. During the night you have 2 nodes. In the morning 3. During the day 4. During peak moments 8. In the evening 3 again. You could save massive amounts of money if application servers and databases can be scaled in this way. You ideally also are able to pay licenses based on what you really use and not on your maximum number of nodes during a yearly peak.

Redesigning Applications and Data

Both Amazon and Google found out that if they redesign their applications then they can get even more gains than pure virtualization. Amazon´s S3 service is a clear example. However internally they started with services like Dynamo on which S3 is build. The first step is to build general data stores. Multiple applications should be using a common data store instead of needing a separate database cluster each.

Unlike popular believe in the IT world, the dotcoms are not filling their data centers with Oracle RAC clusters. The dotcoms are designing special purpose data stores. The data volumes any market-leading dotcom has to deal with are so massive that a SQL database can not keep up. SQL databases are very good at running efficient queries on structural data or making sure transactions are consistent. However they fail when data is unstructured, write operations are massive or data volumes grow with terabytes every data.

Relational Data

So for all low-volume applications that need transactional data and read more than they write, you could still use a unified Oracle RAC cluster to serve multiple applications. An alternative approach are the data stores that have been build by Amazon (Relational Database Service or SimpleDB) or Google´s App Engine (Datastore with JDO).

What other alternatives are there?

Read Mostly Data

Data that needs to be read a lot and is not updated frequently can get an enormous performance and scalability boost by using an in-memory data store. The dotcom standard is memcached. Facebook (800 servers and 28TB) and Twitter are addicted to memcached.

Documents, Images & Videos

Binary and media files are best stored outside of a database. In small numbers they are often stored on a file system. However they occupy a lot of disk as well as network bandwidth when moved around. The ideal is a document store with a content-delivery network or CDN as a front-end. Amazon´s S3 and CloudFront are examples. Storing them in a compressed format, e.g. LZO can save valuable space. Also transcoding into different formats, e.g. thumbnails or preview can help save network bandwidth.

Unstructed Realtime Data

Data that is unstructured and needs to be stored and accessed in real-time in high volumes are best stored in special purpose data stores. You can write a book about the latest NoSQL solutions. Write an email to maarten at telruptive dot com if you are interested.

Analytics Data

Twitter has described most extensively how they use all the unstructured data they get from their logs and other sources. They use technology from Facebook to stream it into a high-available file-system from Yahoo. There they run massive parallel map-reduce operations to get to know a lot more about what users are doing and who is influencing who, etc.

Social Graph

The social graph is about who knows who and what kind of relationship you have. This data is best stored in graph data stores.

Collective Intelligence

Again a chapter by itself but dotcoms are also heavy users of collective intelligence which often means dedicate systems.

Accessing Data

Instead of stove pipes with data, the dotcoms are making data accessible to all their applications. Either via search interfaces, web technology to access data (e.g. REST and JSON) or efficient binary interfaces (Thrift and Protocol Buffers).

Messaging and Notification

Amazon is having a simple queue service and a simple notification service to make sure applications communicate in a uniform matter.

Applications

If applications have access to all the above services then the architecture of an application is simplified enormously. Most of the famous dotcoms don´t use middleware. They prefer the SOA principle. However unlike the IT SOA solutions, a dotcom would take an application and make it into a chain of reusable services. Let´s take an IVR application as an example. There would be a service to do voice recognition. Another one for voice transcription. Another one for text-to-speech. A transcoding service to transcode between different media formats (e.g. high-quality voice and low-phone-quality voice). And so on. Each service has independent load-balancing and can be scaled separately. Services can be re-used between applications. An application is very short because it just need to define which services need to work together and how.

Application Deployment

The dotcoms deploy new features on a daily and even hourly basis. This means that all application deployment is fully automated. When a new feature is deployed it does not necessarily overwrite an existing feature. It is possible that a new functionality has been solved in 5 different approaches. Dotcoms would split the total user base and let small parts of users try out the different approaches. Depending on the user´s feedback they would take the preferred approach and slowly scale up from 1% to 100%. If they detect that the feature has a performance problem or a bug then they would be able to roll-back or decrease the load, fix it and deploy gradually again.

The Network, OSS and BSS

There is a substantial effort needed to redesign a network to be cloud-aware. Some components need latencies lower than 10 milli-seconds (e.g. antennas), hence most of this logic will have to be processed locally. However all systems that can live with 100 milli-seconds latencies benefit from a cloud make-over.

Especially in the area of OSS and BSS there is room for optimizing applications and making them cloud-aware. Global services like a network inventory service, a user profile service, a device profile service, etc. would mean simpler applications and less data duplication.

Opening the Cloud

So the network and IT infrastructure is being redesigned to allow for faster innovation and lower costs. However Cloud Computing can also be used to increment revenues.

Being a Cloud Infrastructure Provider

Many IT consultancies and software/hardware vendors will tell an operator that they could be a Cloud infrastructure provider. On slides this really looks nice. However unless an operator is not using the cloud computing principles for their own systems as described in the first part, they are lacking substantial knowledge about how to manage such an infrastructure. Without this knowledge it would be hard to have a very optimized solution and as such be price competitive with the existing players.

Being a Cloud Platform Provider

Although closer to the operator´s core competencies, being a cloud platform provider would still be for those operators that are Cloud experts. A Cloud platform provider would allow others to use the infrastructure services to create applications on top. The complexity lies in the fact that malicious users try to break the platform which could have a very negative effect on the infrastructure if not handled correctly.

Being a Cloud Service Provider

This is the default option most operators should explore first before moving into the other areas. Being a service provider also has a roadmap:

Reselling SaaS

The easiest step is to be the storefront and to resell IT applications from others, e.g. cloud backup storage, security solutions, etc.

Offering Telco SaaS

The next step would be to offer specific telecom applications. Applications that are build for the operator or even better applications that can be build by others based on the operator´s assets. An example would be a PBX in the Cloud.

Open Market for SaaS

Building all telecom applications yourself is hard. Attracting others to do it for you is easier. However just putting a “Net App Store” and an SDK on the web will not get you to dominate the market. Only an open market with a large eco-system of companies and developers can generate large quantities of “Net Apps”. If you are thinking about building an open market, why don´t we talk first. Send an email to maarten at telruptive dot com.

Scaling to 500 million users

September 10, 2010 Leave a comment

In the telecom domain a scalable real-time architecture means paying a lot of money in hardware and licenses. You buy the Oracle RAC solution, build a Weblogic cluster, set-up a storage area network, etc.

In the dotcom world things look differently. Facebook, Google, Twitter, Yahoo, Amazon, etc. have more active users then any telecom system. However they have build their architecture on top of open source solutions and average servers. Some even build their own software and sometimes open-sourced it.

Some of this software has very exotic names: Hadoop, Bigtable, Cassandra, Pig, Elephant-Bird, Dremel, Pregel, Dynamo, etc. Additionally design decisions are taken that would surprise every IT teacher: “do not normalize”, “do not expect immediate consistency”, “no transaction support”, “store in memory instead of on disk”, etc.

However if you can support 500 million users, 100 million daily hits, 130TB of logs, 20 billion tweet messages, 1 million servers, etc. then something you should be doing right.

The telecom software industry seems to have been isolated from the Internet during the last five years. With the shift to IP it is expected that more IT companies will be able to provide telecom solutions. Is this the solution? Not sure! Also IT companies are still playing catch-up in the cloud computing domain. Few IT solutions providers are demonstrating, they now think Map-Reduce instead of Middleware.

Google Voice is coming and most operators seem to be still more worried about churning subscribers. Google Latitude and Maps demonstrated that with new technology and innovation you can destroy the telecom monopoly  on location-based services overnight…

If you are a telecom operator and you are worried, perhaps it is time we talk.

Follow

Get every new post delivered to your Inbox.

Join 189 other followers