Presto – Facebook´s Exabyte-Scale Query Engine

Presto is Facebook´s answer to Cloudera´s Impala, Hortonworks´ Stinger and Google´s Dremel. Presto is an ANSI-SQL compatible real-time data warehouse query engine so existing data tools should be working with it unlike Hive which needed special integration. Presto is in-memory and runs simple queries in few hundred milliseconds and complex queries in a few minutes. Ideal for interactive data warehousing. Unfortunately Presto will not be open sourced until later this year [probably fall], so the Big Data community will have to be patient.

Open Source real-time massive-scale data warehousing is likely to disrupt existing players like Teradata, Oracle, etc. who until recently were able to charge $100K per tera-byte…

MapD – Massively Parallel GPU-based database

An MIT student recently created a new type of massively distributed database, one that runs on graphical processors instead of CPUs. Mapd, as it has been called, makes use of the immense computational power available in off-the-shelf graphics cards that can be found in any laptop or PC. Mapd is especially suitable for real-time quering, data analysis, machine learning and data visualization. Mapd is probably only one of many databases that will try new hardware configurations  to cater for specific application use cases.

Alternative approaches could focus on large sets of cheap mobile processors, Parallella processors, Raspberry PIs, etc. all stitched together. The idea would be to create massive processing clouds based on cheap specialized hardware that could beat traditional CPU Clouds both in price and performance at least for some specific use cases…

A Big Data-Base that is fast but inaccurate: BlinkDB

April 6, 2013 2 comments

The idea might sound strange at first. Why would you want a database that delivers inaccurate data? However BlinkDB trades accuracy for speed. When you query data you can specify when you want the answer, e.g. within 2 seconds, or how accurate you want the answer to be, e.g. 1% error with 95% confidence.

So if you have very large amounts of data (10-100s of Tera Bytes or even Peta Bytes) and you want quick good enough answers then BlinkDB is for you. An early adopter is Facebook. Would you rather have Justin Bieber‘s followers count exactly right in minutes or 99% right as long as your page loads almost instantly? So if you need fast reasonably accurate answers over slow correct answers, BlinkDB is worth checking out.

What can you use BlinkDB for?

  • The obvious use case would be real-time reporting? If you need to take decisions in the blink of an eye, e.g. day traders, and 5-10% error is acceptable, e.g. what is the average change of all commodity prices in the last 2 seconds.
  • Real-time bookings or price comparison in which users want to know the best possible offer but accept some small error margin, e.g. mobile bar-code scanners that deliver product price comparisons in 1 second instead of 10 will dominate the App Store.
  • Any visitor, friends, tweets, total search results, etc. counter on a large website in the world.
  • Any Power Law or Long Tail data in which there are some extremely popular cases, e.g. Justin Bieber followers, or a very large set of infrequent cases, e.g. the number of blogs that have under 1000 visitors per month.
  • Machine Learning solutions and recommendation engines that are using Collaborative Filtering and other types of algorithms that compare an item or user with large groups of other items and users.
  • and many other use cases…

Build your own 4 G LTE pico cell, GPS receiver, Bluetooth, zig bee, etc.

Software defined radio is like software defined networking but for radio networking, you can build whatever by updating the software. Recently a new project got funded on Kickstarter that allows radio amateurs to build anything they want related to radio. BladeRF is an open source USB 3.0 software defined radio for $400.

20130322-221511.jpg

So the usual suspects will be existing 4G, GPS, Zigbee, Wifi, etc. standards but what if some innovators start thinking outside of the box? White spaces would be one option. But what if 5G or 6G no longer is defined in standard bodies but by a community of open source amateurs that jointly work together? Probably it is going a step too far but M2M (machine to machine) / IoT (Internet of things) can still use more efficient standards. Also federated ad-hoc networks that circumvent local censorship or solve outages could become options. Let’s just hope Chinese suppliers can bring down the price of the BladeRF…

Amazon AWS awkward features to fix and enterprise features to add

AWS is used by more and more enterprises today but Amazon should work on several awkward “features” that make daily usage by enterprises difficult.

AWS console consistency
The console is not very consistent and could be made a lot easier for users. Why do elastic load balancers do not have tags? Why VPC, subnets, route tables, etc. do not have names and do you need to work with their IDs? Why are network ACLs stateless and security groups state full? Why are VPC security groups administration pages in VPC and EC2 different? Why can I not see the name of a security group when I use it in an inbound or outbound rule? Why can I give a temporary role to an API but not give a user or group a temporary role similar to sudo or delegated administration? Why RDS tags do not filter out Cloudformation tags when editing and EC2 tags do?

IAM and the console
End-users that are limited to a small subset of services and resources are up for a surprise. They will be able to see the same options as an administrator but after clicking will get a no permission option. It would be so much easier if services, buttons, menus, etc. you don’t have permission to are not visible.

Java AWS API and Eclipse plugin
Probably the worst Java API of the last 10 years. You have to go to restricted instances to see your on-demand instances. You have list, after list, after list to go through to get somewhere. Some times you do getTags, some times you do request and response. You have to use the RDS ARN to get to tags but you only get the ID from the RDS instance. Etc. etc. etc. Amazon should do a 100K competition on who can create a better API. Whoever gets more than 1 million users for their API wins.

Installing the Eclipse plugin
If you don’t use Eclipse JEE, you will need to fight with several plugins but nobody told you that the plugin is only compatible with JEE. If you do not have the Android SDK installed you can not accept the Eclipse license.

CloudFormation
It seems like few are using it because there are no support posts when you Google for it. Then again you can understand why people do not use it. Several limitations in the parameters page. Try creating a secure password for your RDS master user and you can only use letters and numbers. Only have three valid values for a parameter? Why not put them in a drop down? Wait there is no drop down. You go to the end of the wizard before it complains about a problem in the first page. Start a stack name with a number and it will complain at the end as well. Inside Cloudformation scripts you will find several inconsistencies as well, e.g. no tags for security groups, you can not use underscores in name, try using the instance ID in the tag for the name and you get a circular error, etc.

Missing enterprise functionality
Try encrypting your EBS, good luck. Having finally managed to setup a VPN in your VPC and your IT department is ready to start opening it to multiple departments. Wait how are we going to charge them? Linked accounts is no option because we are not going to setup a VPN for each each department. Adding tags to each instance to include them in your usage report? Good luck with automating tags with referential errors, etc. in Cloudformation or rebuilding a custom portal based on the API. What about limiting department X to instance A, B and C? Inconsistently implemented if at all available for the service you want to use. Migrating instances between VPC subnets? Stop, create AMI, start new instance. Forgot to add a security group to an instance? Stop, create AMI, new instance. Why?

Conclusion
Is AWS a bad service or product? Not at all. Is it ready for global enterprise deployment? It will be in the next 24 months. Should I wait till then? If you are not using the Cloud today, then you are already a year late. Elastic scaling, instant provisioning, pay per use, etc. they beat any awkward “features”. But some API design competitions, customer usability studies and a community roadmap driven by votes would go a long way…

5 hardware trends to watch…

      1. Open Compute

        Open Compute is focusing on creating a new type of server, an open source server based on open source storage, motherboards, racks, data center designs, etc. Instead of proprietary designs, Open Compute makes the design open source. Expect prices for these “commoditized” servers to be substantially lower and ready to enable unseen web-scale data centers. The big driver behind the initiative is Facebook.

      2. Printing everything

        Imprint Energy is a start-up that is putting research of the University of California into practice. By printing batteries they become bendable and can have very thin shapes. A new series of applications are possible that were previously unimagible. 3D printing is probably becoming mainstream in 2013-2014 via manufacturing-as-a-service with consumers buying their first printer in 2014-2015. But also bio printing can allow us to create innovation.

      3. Wearable Tech or Fashion Electronics

        Google Glass, Smart Cloths, Nike’s Fuelband, etc. are all examples of wearable tech. However expect printable batteries to make the tech really flat (cloths) or really small (glasses). This means that we haven’t seen anything yet. Also expect the data explosion of sensor data to also include a lot of “human performance data”.

      4. Miniature Arduino

        RFDuino is a good example of how Arduino’s are shrinking. Open source intelligent miniature hardware will revolutionize many industries, e.g. Jardin & pool computers, bike computers, etc.

      5. FPGAs and other open source hardware

        Mojo is a good example of how not only micro-controllers can be made open source but also FPGAs and other hardware controllers. Due to its parallel processing and multimedia processing capabilities, expect revolutionary products in this domain.

How Intel’s Hadoop distribution wants to be different

February 27, 2013 Leave a comment

Intel announced this week it’s Big Data strategy with its crown jewel their own Hadoop distribution. Many people will be surprised that a chip maker wants to be your Hadoop supplier as well. Mcafee is Intel’s most visible enterprise software offering and it was an acquisition not an offering based on organic growth.
Intel’s Hadoop distribution on the other hand was a Chinese project some years ago that turned into a product.
So how is Intel going to compete with Cloudera, Hortonworks, MapR, IBM, EMC/Greenplum, etc.?
Intel Hadoop Distribution is having real-time queries just like Cloudera’s Impala. But instead of being a separate product, they will be embedded in Hive. Intel also looked at Cloudera Manager for inspiration around how to make Hadoop management easy. This part will however only be available for enterprise customers.
One of the main selling point will be performance. intel’s Hadoop will be fully optimized for Intel’s processors and SSD. Another selling point is security. Intel is launching project Rhino that will include more fine grained security and faster encryption. Further more Intel’s Hadoop is based on Yarn, the latest Hadoop branch, that comes with extra features like support for other than map-reduce frameworks and advanced resource management.
Finally unlike Cloudera, MapR and Hortonworks, Cloudera is a blue chip company with a global footprint and big name partnerships like Cisco, SAP, Terradata, Wipro, SAS, Dell, Redhat, etc.
Will it be enough to stop people from running Hadoop on large volumes of low-cost ARM chips? Only time can tell…

Follow

Get every new post delivered to your Inbox.

Join 141 other followers

%d bloggers like this: