Digging on the Semantic Web: January 2012

Wednesday 4 January 2012

Hadoop: Example Program:
In this example by using HadoopStreaming you are able to interact with Hadoop. HadooopStreaming is an API that allows you to write code in any language and use a simple text-based record format for the input and output <key, value> pairs.

Hadoop example program.

In the above link there is a tutorial that shows how to write a program in Python and then it shows how to make it run through Hadoop.

When you reach the part in which you need to put your files in HDFS don't follow the instructions in the tutorial. Instead do the following:

$ bin/hadoop fs -mkdir urls
$ bin/hadoop fs -put url1 urls/
$ bin/hadoop fs -put url2 urls/
$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
      -mapper $HOME/proj/hadoop/multifetch.py \
      -reducer $HOME/proj/hadoop/reducer.py     \
      -input urls/*                                                    \
      -output titles

Hadoop: How to set up a single node:
Simple operations to use Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

Prerequisites
GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.

$ sudo apt-get install ssh
$ sudo apt-get install rsync

Installation
Download a recent stable release from one of the Apache Download Mirrors.

Edit the file conf/hadoop-env.sh and point JAVA_HOME to your JAVA root installation. How to know where is your JAVA ($ echo $JAVA_HOME)

Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. Follow this link.

In one of the steps you will need to ssh to the localhost:
$ ssh localhost

If you get the error
$ ssh: connect to host localhost port 22: Connection refused

Then try:
$ sudo apt-get install openssh-server