User:GuptaAnu/sandbox

Hadoop Development Environment Setup
This is a step by step guide to build, setting up an eclipse based development environment and deploying Hadoop on a multi-node cluster. We used hadoop version 2.4.0 while writing this guide.

Although Hadoop is Java based, it has dependency on Google protobuf which is written in C++ and hence has dependency on native hardware. So, we decided to do our development on the same OS as on hpcl (Marquette MSCS cluster) i.e. CentOS 6.x. We took the approach of doing development on a VM. In case you are not using a VM, skip the CentOS on VirtualBox installation steps below.

Installing Cent OS on Oracle VBox

 * Install virtual box on host OS
 * Download the CentOS 6.x iso
 * Create VM with 3GB ram and 16GB hard disk space
 * Start the VM and provide path the downloaded OS iso.
 * Install VBox guest additions once OS is up.
 * Enable the Devices -> Shared Clipboard -> clipboard bidirectional on the virtual box menu
 * Check Devices->Network-> Network Settings -> Cable Connected option on the virtual box menu
 * Check System -> Preferences -> Network connections -> eth0-> edit and then select connect automatically option.
 * Click on File menu on Virtual box main window and navigate to Preferences->Display. Select "Hint" as Maximum guest screen size and set the width and height per your display resolution.

Downloading Hadoop sources
One can download Hadoop sources from the following URL:

Once the tar ball is downloaded, you can inflate it using the following command:

Additional Build dependencies
Hadoop is dependent on Google protobuf libraries and in order to build them we will need some c/c++ build tools. Please use the following command to install the required build tools.

For CentOS install maven following the link below

Export the JAVA_HOME variable as it is used by eclipse during the build process.

Building Google ProtoBuf
Download protobuf sources from the link below

In order to build protobuf, please issue the following commands

In order to build the java bindings, give the following commands right after the earlier commands

At this point we have all the hadoop dependencies. We are ready to build Hadoop.

Building and Packaging Hadoop
Give the following commands to build Hadoop

The hadoop tarball should be waiting for you in  directory.

Installing eclipse
To download eclipse use the following command

This will create a directory called eclipse.

Add eclipse directory in your path at the beginning. The reason is that CentOS comes with its own eclipse version and we do not want to launch that by default.

Once in path, you can launch eclipse simply by typing "eclipse"

Setting up Hadoop in eclipse
At this point give the following command under haddop directory

Note: This may take a while the first time, as the whole build is performed.

After the above, do the following to finally have projects in Eclipse ready

First, Set the M2_REPO classpath variable from Winow->Preferences->Java->Buid Path->Classpth variable. M2_REPO should point to the maven repository which is normally in ~/.m2/repository

For Common


 * File -> Import...


 * Choose "Existing Projects into Workspace"


 * Select the hadoop-common-project directory as the root directory


 * Select the hadoop-annotations, hadoop-auth, hadoop-auth-examples, hadoop-nfs and hadoop-common projects


 * Click "Finish"


 * File -> Import...


 * Choose "Existing Projects into Workspace"


 * Select the hadoop-assemblies directory as the root directory


 * Select the hadoop-assemblies project


 * Click "Finish"


 * To get the projects to build cleanly:


 * Navigae to hadoop-common/target/generated-test-sources/java and right click. Select BuildPath->Use as source folder

For HDFS


 * File -> Import...


 * Choose "Existing Projects into Workspace"


 * Select the hadoop-hdfs-project directory as the root directory


 * Select the hadoop-hdfs project


 * Click "Finish"


 * This should build cleanly after import

For MapReduce


 * File -> Import...


 * Choose "Existing Projects into Workspace"


 * Select the hadoop-mapreduce-project directory as the root directory


 * Select the hadoop-mapreduce-project project


 * Click "Finish"


 * This will not build cleanly until you import yarn project in next step

For YARN


 * File -> Import...


 * Choose "Existing Projects into Workspace"


 * Select the hadoop-yarn-project directory as the root directory


 * Select the hadoop-yarn-project project


 * Click "Finish"


 * All the projects should build cleanly at this point

To run tests from Eclipse you might have to additionally do the following:


 * Under project Properties, select Java Build Path, and the Libraries tab


 * Click "Add External Class Folder" and select the build directory of the current project

Hadoop Deployment
We followed the guidelines on  to deploy the hadoop package we created above on hpcl cluster. Though the guidelines on the page are for single node cluster, they can be easily extended to multi-node cluster.

Additonally, add the following property to conf/yarn-site.xml

Property = yarn.resourcemanager.scheduler.class

Value = org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

Formatting data in Hadoop cluster

 * Shutdown the yarn service
 * Shutdown the hdfs service
 * Go to each node and remove everything under the hadoop storage directory.
 * Go to each node and run "hdfs namenode -format" command.
 * Start the hdfs service
 * Start the yarn service
 * Copy the new data to hadoop storage directory.

Running Teragen and Terasort programs
We referred  to run Terasort and Teragen examples.

We set the dfs block size as

where n is the data size in gigabytes.

We set the dfs block size via dfs.block.size property in hdfs-site.xml We set the io.file.buffer.size property to 131072bytes in core-site.xml

In order to run Terasort example, we had to setup YARN. See details below on YARN setup,

YARN setup
We had to set mapreduce.framework.name property to "yarn" in mapred-site,xml

Set the hadoop.tmp.dir in core-site.xml to a file in /tmp otherwise one can get into locking issues as all the nodes are trying to access it at the same time.

Power measurement setup
The nodes n01-n08 are AMD nodes with AMD processors. Four AMD nodes ( n03, n04, n05, and n08) are connected with power measurement meters. The following are steps to measure the power of these four nodes: /apps/power-bench/mclient -H 10.1.1.27 -d /dir_youwant /apps/power-bench/mclient -H 10.1.1.27 -l logfile_youwant ./yourprogram /apps/power-bench/mclient -H 10.1.1.27 -e log Once you end logging, you should be able to see the power data in /dir_youwant/logfile_youwant. The following is sample contents:
 * Set the power log dir:
 * Set the power log file and start logging
 * You can run your programs here
 * end power logging

//There is a power sample per second for each node. Sandy is a dual 8-core Intel processor machine. You should be able to ssh to sandy once you are on hpcl. Power measurement for node sandy:
 * 1) Node-level measurement: the power consumption of node sandy can be measured with the same commands as AMD node. You only need to change option -H 10.1.1.28.
 * 2) CPU and memory power measurement and management.

In addition to node power, the cpu and memory power can be measured for sandy. The command is rapl located at /usr/local/bin/rapl. Use command “rapl –h” to see the available options.

Sample usage:

Sample log file:

Here P_PKG stands for package power, P_PP0 stands for power of 8 cores from core 0 to core 7, P_DRAM stands for power of DRAMs bound to cores 0-7. The rest of columns can be interpretted similarly for the second processor chip.

CPU Freq Setup
The CPUfreq interface allows to change the speed of each core. The interface is at. Note there is one directory for one core. There are several files under it. The file names are self explanatory.

For example,

Pay attention to the content in file scaling_available_frequencies. It is a list of cpu frequencies in KHz that the processor supports.

Another file is scaling_setspeed. The frequency change is through this file. For example, you can use the following command to change the cpu speed to 2.5GHz.

Useful Comamnds for hpcl cluster
SWIM Jobs:

java GenerateReplayScript /home/sanjeev/SWIM-master/workloadSuite/Experiments.tsv 600 600 67108864 250 /home/sanjeev/SWIM-master/workloadSuite/scriptsTest2 /HDFS/workGenInput /HDFS/workGenOutputTest 67108864 /home/sanjeev/SWIM-master/workloadSuite/swimOutput2 /home/sanjeev/hadoop/bin/hadoop /home/sanjeev/SWIM-master/workloadSuite/WorkGen.jar /home/sanjeev/SWIM-master/workloadSuite/workGenKeyValue_conf.xsl

~/hadoop/bin/hadoop jar HDFSWrite.jar org.apache.hadoop.examples.HDFSWrite -conf ./randomwriter_conf.xsl /HDFS/workGenInput

Screen command to lanuch for job:

screen -d -m ./jobClient.py -c /home/sanjeev/HadoopJobKit/hadoopJobKit/cluster_job_config.json -s /home/sanjeev/SWIM-master/workloadSuite/scriptsTest2/

Processing the data:

for i in `ls ~/JobRunner_Output`; do ./dataProcessor.py -d /home/sanjeev/JobRunner_Output/$i -m -r; done

Set/Get cpu frequency:

sudo cpupower -c 0-19 frequency-set -f 2501000

cpupower -c 0-19 frequency-info -f

Awk command :

awk 'FNR==1 && NR!=1{next;}{print}' job*.csv > combined.csv

Cleanup commands:

hdfs dfsadmin -safemode leave

hdfs dfs -rmr /HDFS/workGenOuputTest*