Journal of Data Science and Analysis | by Vikash Ruhil: January 2012

Welcome back guys, as I promised to a setup of Hadoop cluster on Ubuntu Server(64 bit)
last time I posted a blog about OpenStack and Devstack.

While everybody celebrating New Year, I was working on setting up a Hadoop cluster and processing unstructured data.

Purpose
The purpose of this document is to help you get a single-node Hadoop installation up and running very quickly so that you can get a flavor of the Hadoop Distributed File System (see HDFS Architecture) and the Map/Reduce framework; that is, perform simple operations on HDFS and run example jobs.

Our Pre-requisites for setup:-
1) We Must need a Ubuntu Server (64 bit) or Debian (64bit ) {I am using same this tutorial}
2) You must have there installed (in the main root)
   $ sudo apt-get install ssh (openssh )
     $ sudo apt-get install rsync

3) Java^TM 1.6.x, preferably from Sun, must be installed.

    Installing Java

$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
    $ sudo apt-get update

$ sudo apt-get install sun-java6-jdk sun-java6-plugin ( while after Downloading during installation plz press tab to accept JAVA terms and condition while a tab for conditions open )

A check Sun java is there :)

root@ruhil:~# sudo apt-get install sun-java6-jdk sun-java6-plugin
Reading package lists... Done
Building dependency tree
Reading state information... Done
sun-java6-plugin is already the newest version.
sun-java6-jdk is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
root@ruhil:~# java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
root@ruhil:~#

Add a Dedicated User

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configure and check is ssh working for local-host
(Please press the enter , you need not to specify the name for File for and Public key )
root@ruhil:~# su - hduser
hduser@ruhil:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ruhil
The key's randomart image is:
[...snipp...]
hduser@ruhil:~$

After doing this step carefully

hduser@ruhil:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_key

hduse@ruhil:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is b8:be:26:41:44:7d:9b:82:02:fd:13:61:3c:ac:d4:0a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ruhil 3.0.0-14-server #23-Ubuntu SMP Mon Nov 21 20:49:05 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
[...snipp...]
hduser@ruhil:~$

IN Ubuntu 11.10
open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

Installation of Hadoop (perform this action in your main root like root@ruhil)
{Note Downloading 0.20 version is stable other not stable mainly 0.23,Would like go with 0.20 :)}
root@ruhil:~# mkdir -p /usr/local
root@ruhil:~# cd /usr/local
root@ruhil:~# wget http://apache.mirrorcatalogs.com/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz -O hadoop-0.20.2.tar.gz
root@ruhil:~# sudo tar xzf hadoop-0.20.2.tar.gz
root@ruhil:~# mv hadoop-0.20.2 hadoop
root@ruhil:~$ sudo chown -R hduser:hadoop hadoop

Create.Bashrc or If have already pasted below for Hadoop(Note: -you need paste in root and hduser ,if you like you can paste for all, My Paste of .bashrc is http://paste.ubuntu.com/791667/):-

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Configuration(Note all the configuration setting you need to be made in hduser):-

The following picture gives an overview of the most important HDFS components.HDFS Architecture (source: http://hadoop.apache.org/core/docs/current/hdfs_design.html)

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.
hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open /conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

Now we create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 755 /app/hadoop/tmp
{Set Your chmod according to your settings }

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
Note for all given below we need perform all this below config files

In file conf/core-site.xml:(cd /usr/local/hadoop there all this config )

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:


<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

In file conf/hdfs-site.xml:


<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

hduser@ruhil:~$ /usr/local/hadoop/bin/hadoop namenode -format
The output will look like this:
hduser@ruhil:/usr/local/hadoop$ bin/hadoop namenode -format
1/01/12 1:30:41 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ruhil/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'ruhil' on Sun Jan 1 01:30:41 UTC 2012
************************************************************/
1/01/12 1:30:41 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
1/01/12 1:30:41 INFO namenode.FSNamesystem: supergroup=supergroup
1/01/12 1:30:41 INFO namenode.FSNamesystem: isPermissionEnabled=true
1/01/12 1:30:41 INFO common.Storage: Image file of size 96 saved in 0 seconds.
1/01/12 1:30:41 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
1/01/12 1:30:41 5/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ruhil/127.0.1.1
************************************************************/
hduser@ruhil:/usr/local/hadoop$

Starting your single-node clusterhduser@ruhil:/usr/local/hadoop/bin/start-all.sh
hduser@ruhil://usr/local/hadoop$ jps
all this shown below in bash too

Stop Hadoop using below command:-
hduser@ruhil:~$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hduser@ruhil:~$

Now Run a Map-reduce job:-
Just watch bash carefully

NOW Finally Look into your Browser Yeah:-)

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)

I hope you enjoyed it :), process your data with ease and super speed :)
Looking for any kind Help (on Hadoop IRC with #cloudgeek)
Feel free to mail me vikasruhil06@gmail.com

THANKS FOR VISITING

Journal of Data Science and Analysis | by Vikash Ruhil

Monday, January 2, 2012

Large Scale & Big Data analysis with Hadoop cluster (Using Ubuntu 11.10 server)

Hadoop Web Interfaces

LHS AS A SOURCE OF INFORMATION – AND A SOURCE OF INSPIRATION – I HOPE YOU’LL CHOOSE TO ACT RIGHT NOW. ENJOY KEEP LEARNING.