What is Hadoop?
You ever wondered how the Google does theirs queries into their massive data or Facebook quickly fetches your feed from a massive amount of data.
Why the big data is important?
Because in the next few months or years mostly you (techi folks) will be having conversation mostly about big data. Today data is growing rapidly. 90% of the worldwide data has been generated from just last 2 years. Most of these data are coming from smartphones, social networks like facebook, twitter etc.
First we talk about the management of tons of tons of data called big data.
In this article we will talk about
- Big Data
- Map Reduce
- Hadoop Ecosystem
How hadoop is invented?
In the early 2000s companies like Google were running into the wild vast quantity of data, simply too large enough to pump through the single database bottleneck and process the data.
To address this issue, Google developers started to invent the algorithm that allowed to divide the large calculations into the smaller chunks and map them to the single computers to parrallely process the inputs and result into the single data result. They called this the MapReduce. Google has their own implementation of MapReduce.
They published the MapReduce and GFS papers in 2004. Basic on this document Hadoop is implemented and licensed under Apache. In another words Hadoop is the open source implementation of MapReduce algorithm. Apache Hadoop is a simple, powerful, efficient, scalable and shared batch processing platform. In simple terms Hadoop allows you to run the database processing in parallel in hundreds of computer nodes to generate the single outcome.
And this didn't stopped here. In 2005 Google further developed the language query called Sawzall. After three years in 2008 Pig & Hive as implemented to support batch queries in Hadoop environment.
In 2006 Google published the BigTable paper, on the basis of which HBase was launched. HBase provides the ability to store data in massive tables (billions of rows/ millions of columns) for fast random access. HBase allows to store and retrieve data using key/value pairs in real time in Hadoop ecosystem.
Last year in 2012 Google released papers for their Spanner system. With spanner online queries likes joins and transactions are supported on bigdata infrastructure.
Hope this pattern will be followed and Great Apache contributors will come with the solution and will write the implementation of Spanner to support complex real time queries and transaction in Hadoop ecosystem.