Bigdata

Installing Sun Java in Ubuntu 14.04


Ubuntu comes with Open java and jdk but most of the time application developments demands for sun java.

Steps to install Sun Java in Ubuntu 14.04

1. Initial commands to execute

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

2. For Oracle JDK 6

sudo apt-get install oracle-java6-installer

3. For Oracle JDK 7

sudo apt-get install oracle-java7-installer

4. For Oracle JDK 8

sudo apt-get install oracle-java8-installer

5. Setting JAVA_HOME
Copy the path where preferred JDK is installed and edit /etc/environment file

sudo nano /etc/environment
JAVA_HOME="YOUR_PATH"

6.Reload the file

source /etc/environment
Standard
Bigdata

Meaningful Stories with Data


“If history were taught in the form of stories, it would never be forgotten.” The same applies to data. A simple analogy for why stories to be told with data.

In her “Persuasion and the Power of Story” video, Stanford University Professor of Marketing Jennifer L. Aaker explains that stories are meaningful when they are memorable, impactful and personal. Through the use of interesting visuals and examples, she details the way people respond to messaging when it’s delivered either with statistics or through story. Although she says engagement is quite different from messaging, she does not suggest one over the other. Instead, Aaker surmises that the future of storytelling incorporates both, stating, “When data and stories are used together, they resonate with audiences on both an intellectual and emotional level.”

 

Standard
Bigdata

Install hadoop on OpenSuse 12.1


Firstly, Pseudo-Distributed mode is effectively a 1 node Hadoop Cluster setup. This is really the best way to get started with Hadoop as it makes it really easy to modify the config to be fully distributed once you’ve got a handle on the basics.

Step 1: Update OpenSuse packages from Software manager.

Step 2:Install Sun JDK(click here to refer the previous post to install Sun JDK in OpenSuse 12.1).

Create a user “hadoop” on your suse machine and login with the user hadoop to carry out below activities.

Step 3:Setup Passwordless SSH- Activate  sshd  and set bootable from root bash.

>sudo bash
#rcsshd  start
#chkconfig  sshd  on

Now create ssh key for connet ssh without password.
>ssh-keygen  -N ”  -d  -q  -f  ~/.ssh/id_dsa
>ssh-add   ~/.ssh/id_dsa
Identity added: /root/.ssh/id_dsa (/root/.ssh/id_dsa)

Test connect to ssh without password — with Key
>ssh  localhost
The authenticity of host ‘localhost (: :1)’ can’t be established.
RSA key fingerprint is 05:22:61:78:05:04:7e:d1:81:67:f2:d5:8a:42:bb:9f.
Are you sure you want to continue connecting (yes/no)? Please input   yes

Step 4:Hadoop Installation:
Download hadoop-0.21.0.tar.gz file from http://www.apache.org/dyn/closer.cgi/hadoop/core/

Create a directory /home/hadoop/hadoop-install
/home/hadoop> mkdir hadoop-install

Extract the hadoop-0.21.0 tar file to this new directory.
/home/hadoop>sudo tar -zxvf /home/hadoop/Downlods/hadoop-0.21.0.tar.gz

Edit the following files in /home/hadoop/hadoop-install/hadoop-0.21.0/conf directory.

conf/core-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-install/hadoop-datastore/</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://local:54310</value>
</property>
</configuration>

conf/mapred-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>local:54311</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>100</value>
</property>
</configuration>

conf/hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

conf/masters
localhost

conf/slaves
localhost

conf/hadoop-env.sh
Uncomment the line where you provide the details about JAVA_HOME. It should be pointing to sun-jdk. That is as shown below.
export JAVA_HOME=/usr/java/default

Setting the environmental variables for JDK and HADOOP
Open the file ~/.bashrc file and paste the below two command at the end of the file.

>vi ~/.bashrc

export JAVA_HOME=/usr/java/default
export HADOOP_COMMON_HOME=/home/hadoop/hadoop-install/hadoop-0.21.0

To get the immediate effect of .bashrc file, following command must be run.
$source ~/.bashrc

Starting hadoop processes

Format the namenode using the following command
bin/hdfs namenode -format

Start the dfs:
hadoop@localhost:~/hadoop/hadoop-0.21.0>bin/start-dfs.sh

Start the mapred:
hadoop@localhost:~/hadoop/hadoop-0.21.0>bin/start-mapred.sh

Check for running processes.
hadoop@localhost:~/hadoop/hadoop-0.21.0>jps
SecondaryNameNode
NameNode
DataNode
TaskTracker
JobTracker

Standard
Bigdata

Installing Sun JDK in OpenSuse 12.1


Most of the applications require sun JDK as prerequisite. OpenSuse above 12.1 versions does not includes sun java package in the repository by default due to license issues.

Follow the below steps to install Sun JDK in OpenSuse 12.1.

Check the current version from a terminal window.

>java -version

By default openJDK will be installed.Filter installed version of OpenJDK to uninstall it.

# rpm -qa | grep jdk

Remove it from the system. Replace your system specific OpenJDK version that you got from above command.

# rpm -e java-1_6_0-openjdk-1.6.0.0_b24.1.11.5-16.1.x86_64

Verify that the default Java package is uninstalled.

which java

Download latest JDK rpm package from Oracle site(jdk-7u25-linux-x64.rpm)

http://www.oracle.com/technetwork/java/javase/downloads/index.html

change directory to Downloads directory to install JDK.

localhost:/home/hadoop/Downloads # rpm -ivh  jdk-7u25-linux-x64.rpm

All essential java commands seem works fine but there is something we have to commit, finally: Setting JAVA_HOME directory in PATH.

Suse stores all its profile information /etc/profile.d directory, grant access under the /etc/profile.d with root.

localhost:/etc/profile.d # su

Create jdk.sh file under /etc/profile. and write output of below echo command.

# echo 'export JAVA_HOME=/usr/java/jdk1.7.0_25'>/etc/profile.d/jdk.sh

Append output of echo command.

# echo 'export PATH=$JAVA_HOME/bin:$PATH'>>/etc/profile.d/jdk.sh

trigger jdk.sh

# source /etc/profile.d/jdk.sh

Finally, You have to logout and login to see the effect with your own user.

 

Standard
Bigdata

‘Big data’ and ‘Tweet’ enters Oxford Dictionary..!!


The Oxford English Dictionary becomes part of the social media technology revolution.

Image

The Oxford English Dictionary has a rule that “a new word needs to be current for ten years before consideration for inclusion”.

Chief Editor John Simpson, who made the announcement in a blog spot that, OED breaks its rule to match the tech savy race and adds the words  ‘Big data’ and ‘Tweet’ to the dictionary.

From quarterly update of the Oxford English Dictionary:

The word “tweet,” appearing both as a noun and a verb, was added to the dictionary.

Image

The word ‘big data’ is also added to the dictionary.

Image

The OED got on board with other tech lingo. The words “crowdsourcing,” “e-reader,” “mouseover,” “stream” and “redirect” “Flash mob,” 3D printer” and “live-blogging” also made their entry in the century old dictionary.

Standard
Bigdata

What is the point with Hadoop…???


Whenever I have a chitchat or formal talk with a BI or Analytic person, the most widely asked question is

what is the point of Hadoop?’.

Image

It is a more fundamental question than ‘what analytic workloads is Hadoop used for’ and really gets to the heart of uncovering why businesses are deploying or considering deploying Apache Hadoop. There are three core roles:

  • Big data storage: Hadoop as a system for storing large, unstructured, data sets
  •  Big data integration: Hadoop as a data ingestion/ETL layer
  •  Big data analytic: Hadoop as a platform new new exploratory analytic applications

While much of the attention for Apache Hadoop use-cases focuses on the innovative analytic applications it has enabled and high-profile adoption at Web properties. Initial adoption of Hadoop at traditional enterprises and later adopters are more likely triggered by the first two features. Indeed there are some good examples of these three roles representing an adoption continuum.

We also see the multiple roles playing out at a vendor level, with regards to strategies for Hadoop-related products. Oracle’s Big Data Appliance, for example, is focused very specifically on Apache Hadoop as a pre-processing layer for data to be analyzed in Oracle Database.

While Oracle focuses on Hadoop’s ETL role, it is no surprise that the other major incumbent vendors showing interest in Hadoop can be grouped into three main areas:

  • Storage vendors
  • Existing database/integration vendors
  • Business intelligence/analytic vendors

This is just a small instance I took to showcase how the major DATA players are slowly adopting this new technology to harness its capabilities to retain there position in the major players list.

Standard
Bigdata

CONFIGURING PERL ON WAMP


My 7 semester lab exams got scheduled for 22 of Nov,2011,to work out programs at hostel I have been struggling to configure perl to execute my “Web Programming Lab” programs. It took hell lot of time and could finally complete it.So i thought sharing the configuring the steps which might be useful for those who are stuck jus like how I was, few days before.
Perl, a scripting language  developed by Larry Wall in 1987. Perl has been constantly getting huge user response for its simplicity in text processing from the day of its release.
Here, I choose WAMP, a packages of independently-created programs which includes Apache(web server), MySQL(open source database) and  PHP as principal components.
Ok let’s start with the step by step instructions to configure.
STEP 1: Download and install wamp 2. version.Click here to download.
STEP2:Similar to step 1 download and install Active Perl 5.10.0 build 1005 from active state web.
STEP 3:Now right click on wamp     server icon which is at the left corner of windows taskbar and select put offline option else select stop all the sevices.Once all wamp services are stopped again right click on wamp server icon and select Apache then open httdp.conf file.
STEP 4: Now we need to make some changes in this httpd.conf file let’s do it one by one;
a)scroll down and look for the line “Options Indexes FollowSymLinks ” and replace it with “Options Indexes FollowSymLinks Includes ExecCGI ”

before

after

b)scroll down and look for the line  “#AddHandler cgi-script .cgi” and replace it with “AddHandler cgi-script .cgi
AddHandler cgi-script .pl ”

before

after

c)Now look for the line “DirectoryIndex index.php index.php3 index.html index.htm“  and  add index.cgi and index.pl in this line.

before

after

STEP 5:server is now configured and ready to run perl and cgi script.Now need to add additional repository and install from that repository. For that:
1. Open command prompt , then type
“ppm repo add uwinnipeg”

screen of ppm installation

2. After the “uwinnipeg” repository is added successfully, install DBD-mysql by typing this command
“ppm install DBD-mysql”
Hmmm, now were done with configuring stuffs.Try  writing some simple perl scripts     and save them in  C:\wamp\bin\apache\Apache2.2.11\cgi-bin\
to run the scripts open the browser and type this url :http://localhost/cgi-bin/  followed by your program name as shown

NOTE:  Please make sure that no process is running on port 80

Standard
Bigdata, Workshops and Conferences

25th CSI Student Convention


CSI, Computer Society of India conducted its 25th student convention at R.V College of Engineering on 13th and 14th of October,2011. I got an opportunity to be part of the convention and present our paper entitled ” Map/Reduce Algorithm Performance Analysis in Computing Frequency of Tweets ” along with my co-author Nagashree.

The convention was a great time for all the students who came from all across the state to learn about the latest trends in the field of Information Technology.Also it was a wonderful platform for innovative young minds to share there ideas and innovations.Students who came from different places of karnataka took part in the convention and presented there papers.

Hadoop and map/reduce being my area of interest we decided to present a paper on “Map/Reduce Algorithm Performance Analysis” so that more and more students get to know about this latest emerging technology.We were given just 10min to present our paper and we had only 10 min to impress the judges and to communicate our ideas with our counter friends who were present in the convention.It was a wonderful experience to present paper in front of eminent professionals who were the judges for the event between were quite nervous as it was our first ever paper presentation.The day became much more memorable when we got to know that we got 3rd place for our presentation.

Here is the abstract of our presentation:

  Abstract of Paper presentation

Title:Map/Reduce Algorithm Performance Analysis in Computing Frequency of Tweets

Background

This paper proposes method to extract the tweets from twitter and analyses the efficiency of Map/Reduce algorithm on Hadoop framework hence achieves maximum performance.

New research in cloud computing has shown that implementing mapreduce not only influencing the performance -it also influences on more reliable storage management.

For about a decade it was considered that distributed computing is more complex to handle than expanding memory of single node cluster since inter-process communication (IPC) to be used to communicate with the nodes which was tedious to implement as the code would run longer than the computation procedure itself. But now apache.hadoop offers a more scalable and reliable platform to implement distributed computing .Through this paper we have analysed  that Map/Reduce algorithm run on hadoop  influences the performance significantly while handling huge data set stored on different nodes of a multi-node cluster .

Aim of the study

Cloud computing is the future and it will  focuses more on distributed computing. In order to evaluate the features offered by hadoop for cloud computing huge unstructured data set is required. The present study investigated those questions.

The main focus of the study was to analysis the performance of Map/Reduce algorithm in computing the frequency of tweets.

Method

About 6 to 10 lines of python algorithm was used to extract the tweets of people, taking input from twitter search API. Tweets were extracted consecutively for about 1 week resulted in a huge data set piling up to 50MB

The study was carried out in to parts. The first part was extracting tweets as mentioned above and the second was to implement customized Map/Reduce algorithm to compute the frequency of tweets on particular keywork(say “Anna Hazare”).

 Result

It was found that this approach offers a more reliable method to analyse huge data compared to any other classic methods.

Here is the slides of our presentation:

Finally after the presentation I got to know that hadoop is the platform used for the India UID (ADAR Card) project and I felt proud for having the knowledge of it.

Standard
Bigdata

txtWeb :browse internet through sms


 

“There are roughly 700 million mobile subscribers in India. But, out of those 700 million, more than 600 million Indians  do NOT have access to a computer or mobile data.”

txtWeb is a global platform where anyone with a mobile phone can access internet  just by SMSing keywords ( like web address in browsers) to ONE national number, and receive back content (up to 900 characters per SMS). Keywords represent an  application that user can make use of to  get content from the internet. These applications are created by an open community of publishers and developers.Applications include wikipedia content, local market prices, government programs, financial literacy tips etc.

txtWeb is an SMS-based browser wherein one can browse internet for no charges(provided you have free sms plan to your mobile), but much more accessible than web-browsers on computers since anyone with a simple feature phone can use it. Deploying existing content via a txtWeb site takes only 5 minutes. Creating and deploying an SMS-based app on txtWeb usually takes about 5 hours.

Using txtWeb:

Just type the keyword and send that sms to txtweb Indian national number 9243342000

Ex: “@cat ignite” , this sms would search for meaning  of the word ignite

Working of txtWeb:

  1. User sends a request to the txtWeb number e.g. @dictionary alibi to 9243342000.
  2. The request is forwarded from the phone carrier to the platform as a SMS.
  3. Platform accepts the keyword and maps it to the external URL for the application (or to the text provided if it’s a text site).The AppUrl /text should be provided by the developer when he is building an app. If it is a txtSite, the content is retrieved from the platform’s database. If it is a txtApp steps 4 and 5 described below are followed
  4. A HTTP call is made to the URL of the application.
  5. The content of the app is sent back to the platform over HTTP.
  6. Platform accepts the content . This is converted to an SMS.
  7. The SMS is transferred to the phone carrier.
  8. The SMS reaches the end user.

txtsites are static text pages used to publish information. It is analogous to a static web page on the internet. A publisher can provide content and the same can be published as a txtSite for consumption over SMS.

Steps to build your first txtSite-

  1. Click on “Create a txtSite” on your home page.
  2. Enter a keyword which would be the handle for your application.(say the keyword is Hello)
  3. Give your txtSite an appropriate description. This description would help in easy discovery of your application. The Search on the platform takes the description into account when searching for relevant apps against the search term entered by the user.(You could enter- “This is my first text application”.
  4. Enter relevant text to be sent to the end user when he accesses your application e.g- “Hello World!! I am live”
  5. Click “Publish” to get our app up and running on the platform.

Txtapps are dynamic pages used to provide information to an end user on the basis of the request he makes via SMS. It is analogous to dynamically populated web pages on the Internet. Unlike a txtsite, one needs to develop a web application, to render dynamic information to the end user using a txtApp.

There are 3 parameters that the platform sends to an application viz-

Txtweb-mobile- The mobile number of the end user in hash format

Txtweb-message- message sent by the end user

TxtWeb-location- The location as set by the end user.

One needs to access these information via API calls. The relevant information is passed as an XML.

Example Code to build a Hello World txtapp

private String TestMessage() {

String resp=””;

Resp= “+ “\” />Hello World < br/>< br/>”;

return(resp);

}

This is a html response that would display hello world on the browser once the servlet is invoked. The String resp is sent to a method sendResponse which is given below-

private void sendResponse(HttpServletResponse response, String resp)

{

try{

//resp contains htmlized version of Hello World

PrintWriter out = response.getWriter();

out.println(resp);

} catch (IOException e) {}

}

So where you are if wanna google someting jus send an sms “goog <search parameter>” to no 9243342000.. Have fun using  txtWeb.

Standard
Bigdata

HADOOP & CLOUD COMPUTING


Cloud refers to large internet services like Google and Yahoo! etc., that use 10000s of machines. Most recently though, cloud computing refers to services by these companies that let external customers rent computing cycles on their clusters.

Hadoop is an open-source Cloud computing environment that implements the Google MapReduce framework in Java which can be used to handle huge set of data that ranges upto pete-byte(PB). MapReduce makes it very easy to process and generate large data sets on the cloud. Using MapReduce, you can divide the work to be performed into smaller chunks, where multiple chunks can be processed concurrently. You can then combine the results to obtain the final result. MapReduce enables one to exploit the massive parallelism provided by the cloud and provides a simple interface to a very complex and distributed computing infrastructure. By modeling a problem as a MapReduce problem, we can take advantage of the Cloud computing environment provided by Hadoop .

MapReduce is used At yahoo! for Web map,Spam detection,Network analysis and Click through optimization; At facebook for Data mining,Ad optimization and Spam Detection;At Google for Index construction,Article clustering for news and Statistical machine translation.


Fig:Hadoop core

In HDFS data is organized into files and directories ,files are divided into uniform sized blocks(default 128MB),blocks are replicated (default 3 replicas) and distributed to handle hardware failure ,replication for performance and fault tolerance (Rack-Aware placement), HDFS exposes block placement so that computation can be migrated to data and it has Checksum for detecting corruption.

I used MapReduce for my project, “Find the best restaurant in the USA – from review comments”. Tasks were quite simple.

First, used a python crawler to extract the data from the web http://www.restaurantica.com and put into a text document, it was a huge set of data.

Second, used PoS(Parts of Speech) tool to extract the key words, this was crucial step as we were supposed to find best restaurant .

Finally, run the MapReduce job on the data. We used PIG on top of hadoop to make things easier for us and multinode cluster was used to make computation faster.

Standard