Massively Parallel Processing (MPP) is not a new concept. I mean, how long have mainframes been around? But the computing power available today for $1,000s instead of $1,000,000s is unprecedented. Previously, massively parallel and clustered systems were only attainable by large corporations with big technology budgets or the very adventurous types with a lab environment. Teradata and data warehousing appliances gave us the capability to query very large data sets with fast response times. The big database vendors all have some form of clustering these days, but it comes at a relatively high cost. The game is definitely changing. In this blog, I discuss some of the latest trends and technology options when it comes to massively parallel processing.

The recent Hadoop / Big Data wave brought us clustered computing on commodity hardware and a relatively low price point. The Hadoop ecosystem continues to evolve from a one-trick (MapReduce) pony to a flexible platform for scalable data management. Today, this clustered computing power can be exposed as SQL queries and streaming service calls making it even more accessible to the masses.

Another trend is Cloud Computing. Cloud is not just about external hosting. It is about providing easier access to technology resources and allowing us to focus more on solving business challenges. What if you could provision an essentially unlimited storage environment for a very low cost and allocate clusters of inexpensive machines to process data without buying a single server? With that capability you could capture every log produced by all of your applications, hardware components, and operational devices. You could then combine that with traditional sales and data typically found in data warehouses and easily supplement it with external data such as weather, economics, demographics, etc. Then you could decide the value of that information and how it all fits together over time.

Perhaps you could correlate external events and trends to operational patterns in your applications and calculate how all that relates to financial performance. Or perhaps you could perform statistical analysis to determine how strongly various data sets correlate.

Massively Parallel Processing systems like Hadoop give you the ability to deal with extremely large volumes of detailed data and turn it into something usable. But Hadoop still isn’t quite ready to consistently deliver sub-second response for queries on large volumes. For that, we need a different flavor of MPP.

Amazon Redshift debuted a few years ago and provides analytic query capability starting around $1,000 per Terabyte per year. Using columnar database technology, cloud infrastructure and clustered computing allows sub-second query response on large volumes of data as well as the ability to start small and scale as needed.

So now you can structure and aggregate the detailed data that I described earlier and create a consistent set of structures to report on how that data looks on an ongoing basis – today, yesterday, this month, this year, etc. You can get analytics on your best customers or how trends converge or diverge over time.

So many massively parallel processing options, so little time! There has never been a better time to tackle your data management challenges. Are you ready for MPP?!

Interested in learning more? Join our mailing list!

Authors:

Brandon Davis

Leave a Reply

Your email address will not be published. Required fields are marked *