Cloud refers to large internet services like Google and Yahoo! etc., that use 10000s of machines. Most recently though, cloud computing refers to services by these companies that let external customers rent computing cycles on their clusters.
Hadoop is an open-source Cloud computing environment that implements the Google MapReduce framework in Java which can be used to handle huge set of data that ranges upto pete-byte(PB). MapReduce makes it very easy to process and generate large data sets on the cloud. Using MapReduce, you can divide the work to be performed into smaller chunks, where multiple chunks can be processed concurrently. You can then combine the results to obtain the final result. MapReduce enables one to exploit the massive parallelism provided by the cloud and provides a simple interface to a very complex and distributed computing infrastructure. By modeling a problem as a MapReduce problem, we can take advantage of the Cloud computing environment provided by Hadoop .
MapReduce is used At yahoo! for Web map,Spam detection,Network analysis and Click through optimization; At facebook for Data mining,Ad optimization and Spam Detection;At Google for Index construction,Article clustering for news and Statistical machine translation.
In HDFS data is organized into ﬁles and directories ,files are divided into uniform sized blocks(default 128MB),blocks are replicated (default 3 replicas) and distributed to handle hardware failure ,replication for performance and fault tolerance (Rack-Aware placement), HDFS exposes block placement so that computation can be migrated to data and it has Checksum for detecting corruption.
I used MapReduce for my project, “Find the best restaurant in the USA – from review comments”. Tasks were quite simple.
First, used a python crawler to extract the data from the web http://www.restaurantica.com and put into a text document, it was a huge set of data.
Second, used PoS(Parts of Speech) tool to extract the key words, this was crucial step as we were supposed to find best restaurant .
Finally, run the MapReduce job on the data. We used PIG on top of hadoop to make things easier for us and multinode cluster was used to make computation faster.