User:Bloodysnowrocker/Hadoop


 * Hadoop needs dfs, namenode, datanode, job tracker to start.By default under single node mode, things will be put under /tmp/hadoop-${user}/dfs(for namenode) and /tmp/hadoop-${user}/dfs/mapred(for datanode). VERSIONs of the two should be consistent.


 * Hadoop needs it own file system HDFS, to create one before starting the daemon, execute.

Hadoop Single Node Set Up trouble shooting
localhost: /home/bloodysnow/Documents/code/hadoop-
 * Ubuntu Hadoop Working Directory: /home/bloodysnow/Documents/code/hadoop-0.20.203.0/bin
 * Use the comprehensive hadoop guide to start/stop hadoop pseudo code
 * A few notes on using this guide
 * 1) It is OK to change fs default port (conf/core-site.xml) and mapred job tracker port (conf/mapred-site.xml) to new port number.
 * 2) Create customized name node and dfs creats problem such as inconsistent nameSpaceId for namenode and datanode. Mapred client can not be start successfully
 * 3) After running, need to use jps, a Java Virtual Machine Process Status Tool, to check if namenode, data node, jobtracker are all started.
 * 4) If there are any errors, go to $HADOOP_HOME/logs to check.
 * Upgrade Ubuntu to 12.04,but can not start Hadoop using . Error message:

0.20.203.0/bin/../bin/hadoop: line 345: /usr/lib/jvm/java-6-

openjdk/bin/java: No such file or directory This is because  has been wiped out. Reinstall java 7 and update the following places where JAVA_HOME is hard coded: , ./hadoop dfs ls 12/06/10 10:20:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s). 12/06/10 10:20:46 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s). 12/06/10 10:20:47 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 2 time(s).
 * Check Hadoop dfs directories and see the following error:

In fact Hadoop is listening on:9001 Change $HADOOP_HOME/conf/core-site.xml into the following: hadoop.tmp.dir /home/bloodysnow/Documents/data/hadoop/tmp

fs.default.name hdfs://localhost:9001 The above changes port number requests being sent to. In addition, it sets hadoop dfs dir to Then perform   and if it is successful there are messages like: 2/06/10 11:02:46 INFO common.Storage: Storage directory home/bloodysnow/Documents/data/hadoop/tmp/dfs/name has been successfully formatted.
 * Namenode can not start. By running  there is no namenode. Log shows inconsistent file system state. Go to   (by default it is at  ). Check if   is there. Delete everything in   and reformat the namenode

Maven Building Mahout Trouble Shooting
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project mahout-math: Compilation failure [ERROR] Unable to locate the Javac Compiler in: [ERROR] /usr/lib/jvm/java-7-openjdk-i386/jre/../lib/tools.jar [ERROR] Please ensure you are using JDK 1.4 or above and [ERROR] not a JRE (the com.sun.tools.javac.Main class is required). This is because the following: which javac /usr/bin/javac bloodysnow@bloodysnow-HP:/usr/local/apache-maven/apache-maven-3.0.3$ ls -l /usr/bin/javac lrwxrwxrwx 1 root root 23 Jun 10 08:54 /usr/bin/javac -> /etc/alternatives/javac bloodysnow@bloodysnow-HP:/usr/local/apache-maven/apache-maven-3.0.3$ ls /etc/alternatives/javac /etc/alternatives/javac bloodysnow@bloodysnow-HP:/usr/local/apache-maven/apache-maven-3.0.3$ ls -l /etc/alternatives/javac lrwxrwxrwx 1 root root 42 Jun 10 08:54 /etc/alternatives/javac -> /usr/lib/jvm/java-6-openjdk-i386/bin/javac Need to update  to the updated the right version of java jdk
 * Have installed a newer version of java open jdk . Contents in   has been wiped out.
 * Have updated JAVA_HOME in  pointing to the newer directory. However when running maven build, producing the following error:

sudo apt-get install openjdk-7-jre This specific command will update-alternatives: warning: forcing reinstallation of alternative /usr/lib/jvm/java-7-openjdk-i386/bin/javac because link group javac is broken. which re-links. is linked by
 * Upon updating Ubuntu system, Java Run Time Environment will be updated. However to get specific tools like  and , still need to install JDK 7.

Mahout can be build and install separately. The following on any directories with  can just compile and build the jar without running unit tests mvn compile mvn clean install -DskipTests=true
 * Mahout does not build on fresh checkout.

Edit pom.xml (see)
 * Patch Mahout

cd $MAHOUT_HOME svn rever -R. # discard any changes svn update
 * Sync with depo for latest Mahout version

Local vs Hadoop job
Mahout can be configured either on running locally (with a JVM) or as hadoop job. The env variable  once set ( no matter what value it has been set to ) will enable Mahout running locally. However, running locally has encountered the following error: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. MAHOUT_LOCAL is set, running locally Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.

HDFS
bin/mahout seqdumper -i imdb-kmeans/clusteredPoints -o $IMDB_DUMP/docs_dump input file is a hdfs file and output is a local directory
 * . When running as a hadoop job, Mahout makes assumption on reading and writing files on HDFS. So making sure  is set to $HADOOP_HOME/bin/conf.
 * Format conversion and other utilities. Mahout provides a number of utilities to convert text file to sequential file on HDFS. All command line utilities can be found at
 * can dump most of sequential file in HDFS to local directories:


 * : more to come

Clustering
// input and output should all be bin/mahout seqdirectory \ --input imdb_input \ --output test \
 * Command line of running K means clustering from a directories of free text documents

// dump the seqdirectory bin/mahout seqdumper \ -i test \ -o $IMDB_DUMP/seqdir_dump

// output of dumping Input Path: hdfs://localhost:54311/user/bloodysnow/test/chunk-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text Key: /IMDBNewsTraining_small_title_body_0: Value: Mike Wallace Scores Biggest "Get" of Year  He may be 89 years old and officially retired, but Mike Wallace may have scored the biggest "get" of the year -- an interview with Iranian President Mahmoud Ahma ....

// bin/mahout seq2sparse \ -i test \ -o imdb_index \ -wt tfidf \ -chunk 5 \ --minSupport 2 \ --minDF 5 \ --maxDFPercent 90 \ --norm 2 \ --namedVector

//dump

//dictionary

bin/mahout seqdumper \ -i imdb_index/dictionary.file-0 \ -o $IMDB_DUMP/dictionary_dump

// key is the word, value is the id Key: ahmadinejad: Value: 75 Key: aicn: Value: 76 Key: ain't: Value: 77 Key: aint: Value: 78 Key: air: Value: 79 Key: alamo: Value: 80 Key: album: Value: 81 Key: alex: Value: 82 Key: ali: Value: 83 Key: alien: Value: 84 Key: all: Value: 85 Key: allowed: Value: 86 Key: almost: Value: 87 Key: alone: Value: 88

//frequency bin/mahout seqdumper \ -i imdb_index/frequency.file-0 \ -o $IMDB_DUMP/frequency_dump

//frequency Key: 949: Value: 2 Key: 950: Value: 24 Key: 951: Value: 8 Key: 952: Value: 3 Key: 953: Value: 2 Key: 954: Value: 2 Key: 955: Value: 9 Key: 956: Value: 2 Key: 957: Value: 1 Key: 958: Value: 4 Key: 959: Value: 10

bin/mahout seqdumper \ -i imdb_index/wordcount/part-r-00000 \ -o $IMDB_DUMP/wordcount_dump

// Key: 2002: Value: 2 Key: 2008: Value: 7 Key: 2009: Value: 10 Key: 2010: Value: 2

//dump tfidf bin/mahout seqdumper \ -i imdb_index/tfidf-vectors/part-r-00000 \ -o $IMDB_DUMP/tfidf

Key: /IMDBNewsTraining_small_title_body_0: Value: /IMDBNewsTraining_small_title_body_0:{194:0.13152226215516968,819:0.13152226215516968,1639:0.13152226215516968,1500:0.13

// use dictionary to look up // cluster bin/mahout kmeans \ -i imdb_index/tfidf-vectors/ \ -c imdb_cluster/ \ -o imdb-kmeans \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -x 10 \ -k 3 \ --clustering \

//cluster dump, representative keyword for each cluster bin/mahout clusterdump \ -i imdb-kmeans/clusters-*-final \ -o $IMDB_DUMP/kmeans3_clusterdump \ -d imdb_index/dictionary.file-0 \ -dt sequencefile -b 20 -n 20 -sp 0

// seems not filtering enough stop words bin/mahout seqdumper -i imdb-kmeans/clusteredPoints -o $IMDB_DUMP/docs_dump


 * Tuning up parameters
 * visualizing / validating clusters