- September 24, 2016
- Posted by: Nagi Bombhore
- Category: Thought Leadership
The year 2003 was remarkable for the history of Big Data. In month of October that year, Google published a white paper on MapReduce, which is a tool to manage large scale data processing across a huge cluster of commodity servers. A year later, Doug Cutting and Mike Cafarella created Apache Hadoop. It is an open source platform that provides data management. It supports the processing of large data sets in a distributed computing environment. Doug’s son had named his toy elephant as “Hadoop”. This same name was adopted for the open source platform project because Hadoop was easy to pronounce and easy to Google.
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.”
There are two different components of Hadoop – Hadoop distributed File system (HDFS) and MapReduce. MapReduce is the parallel programming component. It distributes the data into a cluster where processing takes place locally on every machine in the cluster. Results from these individual machines are then quickly combined to get processed result. While HDFS is a distributed file system which saves data into nodes and clusters.
- The data is split into 128 MB or 64MB chunks and then it is distributed to various cluster nodes for further steps
- HDFS controls the process as being on top of local file system
- Nodes duplicated for insurance against any future failure
- Checking that the processing codes have successfully executed
- Sending processed data to desired system
With above process it is easier to do the required manipulation in data without using a heavy machine as the data is being split and processing takes place in parallel in a single connected system.
Five following benefits have been realized and proven valuable to companies like Google, Facebook and other traditional enterprises handling or managing large data.
- Scalable: Hadoop is a highly scalable platform as it can store and manipulate large data sets into small inexpensive servers in parallel
- Cost Effective: Hadoop is so designed that it can affordably store all of company’s data for later use. Instead of costing thousands, it offers computing and storage capabilities at cheap rates
- Flexibility: Hadoop can be employed to generate relevant reports from diverse data sources such as social media, email conversations or website visitor tracking, eventually helping the business grow across various fronts. It can also be used in log processing, recommendation systems, market campaign analysis and fraud detection systems
- Fast: For data processing, the tools is on same server as data leading to a faster processing. If your data is having a large volumes of unstructured data, Hadoop is able to process TBs of data in just a few minutes, and PBs of data in just a few hours
- Resilient to failure: When Data is sent to a node, that same data is also duplicated at another node in the cluster, which means that even at the time of failure, there is a copy available for use
Hadoop is helping businesses by splitting its large data, storing it and providing a framework for data manipulations and it does all of these in a very cost effective manner. By employing the Secondary NameNode technique, it also reduces the probability of loosing data to almost zero as there is always two more nodes storing the same data.
With so much advancement, Hadoop still lack some important features
- Multiple copy of Huge data: Hadoop in general makes three copies of the data in three different nodes; which eventually leads to nine copies as six more copies of data are made locally for maintaining performance
- SQL Support: Hadoop has many gaps for SQL support, even as basic as no provision for functions such as group by
- Inefficient Executions: There is no query optimizer and thus clusters are larger than the required one
Hadoop is gaining a wide acknowledgement due to its structure and problem solving steps. For some companies Hadoop is one stop solution for all of their data troubles.The way of duplicating the data for the loss reduction is marked as tremendous innovation. But for some processes this duplication of data is acting as a curse. What do you think it is?
Do you employ Hadoop in your organization? Or are a Hadoop developer yourself? Let us know your thoughts regarding this platform below.