Author:
Annasaheb Kashinath Sunthe | Technical Manager
at Aditya Birla Minacs
Editor’s Note: You can also read the first two of this series here: Big Data 101 and 102
In my previous
post, I discussed why Big Data is proving to be a strategic asset in the achievement of business results. Here, I present a quick survey of technologies that help harness Big Data.
BIG DATA: A TECHNOLOGY PERSPECTIVE
Tackling the Big Data challenge to take advantage of the business value that it can provide is
going to be a multi-stage process. Most of the new sources of data gathering under the banner of Big Data are fundamentally different from the type of data stored in traditional databases. Therefore, new breeds of technologies and infrastructure are required to take on this multi-dimensional challenge that can:
- Store and manage data differently
- Distill all these types of Big Data into a form that allows further analysis
- Search, understand and analyze complex data sets.
My first stop is the storage and analysis of Big Data coming from an array of sources. Until now it has been too expensive to store and analyze these massive volumes. But it is clear now that there is also a staggering opportunity cost associated with
not tapping into this treasure of information, since its
potential value to business results is near limitless.
DATA STORAGE AND ANALYSIS: MANAGING THE AVALANCHE!
Let’s now look into the relevant technology solutions needed for storage and real time analysis. Most of the tools I discuss are open source, but more varieties of proprietary tools are also emerging.
Organizations have attempted to deal with this problem from many different viewpoints. However, the viewpoint that is currently leading the pack as a popular massive data analysis project is open source and called
Apache Hadoop. It is a computing environment built on a distributed clustered file system designed specifically for very large-scale data operations. It is designed to scan through large data sets to produce results through a highly scalable, distributed batch processing system. It is not about speed-of-thought response times, real-time warehousing, or blazing transaction speeds,
but it is about discovery and making once near-impossible scalability and analysis requirements possible.
Other
No SQL storage options available are:
- Dynamo DB: a distributed key-value store database designed to deal with data stored over a large number of servers. It is a storage “service-in-a-box” driven by SLAs that provides high scalability, availability and performance.
- Berkeley DB, MemcacheDB: are databases that store arbitrary key/value pairs, either in disk or memory, to provide high performance.
- Hadoop, Google’s Big table, and Apache Cassandra: are column oriented databases that use distributed multi-dimensional maps. These are designed to easily scale across hundreds or thousands of machines.
- MangoDB, CouchDB: are document oriented databases with schema-less JSON-style object data storage.
- Neo4j, Hyper GraphDB: are designed to store graph oriented Big Data structures, in graphs rather than in tables.
BIG DATA SEARCH: SPEED, ACCURACY, SCALABILITY
How does one search through these massive volumes of data? Traditionally, scalability is achieved by master-slave replication and data sharing, which can be a huge challenge at large scales.
Map Reduce based technologies such as
HBase and
Hadoop achieve scaling and address this challenge by transparent partitioning, distribution and replication.
HSearch is an open source, distributed, multi-format, structured and unstructured content search engine built on the
HBase platform.
Also, tools like Splunk provide a flexible and simple way to sift through mounds of Big Data stored in file systems. Once understood, the extracted fields can be stored in relational, graphical or XML databases depending on the manner in which the data will be used. Once it’s clear what needs to be looked into, high speed sifting of massive volumes of data can be accomplished by using highly parallel programs created with a technology like Data Rush.
DISCOVER AND ANALYZE: FROM DATA TO BUSINESS INTELLIGENCE
Finally, we look at analyzing and discovering information from Big Data. In order to acclimatize to the influx of Big Data, organizations have tried to extend the life of their RDBMS (relational databases), and business intelligence and data warehousing systems to support the “overflow”.
However, this approach is at its limit. The effort required to extract, transform and load (ETL) unstructured data, manage and integrate multiple data sources, predefine queries, and build Big Data applications is time-intensive and unsustainable.
New innovations are emerging to help business intelligence professionals overcome the challenges of analyzing the combination of structured, semi-structured and unstructured data. Just like the data storage units that are being enhanced to support a variety of data, several front-end solutions for analyzing, visualizing and discovering information from this data are also quickly appearing. Some open source tools in this space are Jasper soft BI and Revolution Analytics’ advanced analytics tools. A variety of operational intelligence vendors also offer solutions that can be used to discover deeper information from Big Data.
For engineers and technical managers like me, Big Data is at once a challenge and an opportunity to deploy technology that solves a real world problem and leads to better business results.
Big Data technology is still nascent and the tools space is still only emerging. So what is your take on the future of Big Data and the technologies that can help harness it?