Hadoop for windows 7 64bit
I recently got a new job, Yay! for me. Part of this new job includes working with Hadoop and MapReduce, I had previous read the tutorials and understood how it works but never got into the actual code. As a historically Microsoft developer it was a scary thought, to step into Unix and Java seemed like a challenge. Despite Java fear I could see the benefits of starting to step away from Microsoft and into the new OpenSource world of BigData, however baby steps are required and stepping away from Windows was one step to far for me now, I have so much other development to do and mostly use Visual Studio.SSH service setup
I started by following this tutorial (https://gist.github.com/tariqmislam/2159173) But found what seemed to be a better SSHD guide (http://evalumation.com/blog/86-cygwin-windows7-sshd) and then a potentially even better one (http://venuktan.wordpress.com/2012/11/14/setting-up-hadoop-0-20-2-single-node-cluster-on-ubuntu/)Cygwin
First things first you need an introduction to Cygwin. The site mentions what it is and is not, but the best explanation I have seen is "Cygwin provides native integration of Windows-based applications, data, and other system resources with applications, software tools, and data of the Unix-like environment" on Wikipedia.
Problems with installing Cygwin
The install is fairly easy but I did come across some problems one of which is picking the right mirror. You pick the mirror in the installer itself It is almost impossible to know what is right for you but I found some of them resulted in a bad install, so you have to go back and pick another, simple resolution but can be annoying to figure out.
Problems with Setting up SSH
1) Running the SSH install (Privilege separation issue)No matter how many guides I followed I always come up against a issue with Privilege separation. If you cannot get your service to start then you likely have the same issue, you can test this by running ssh.exe in the Cygwin tool to see full error. I found two solutions to this one felt like a hack the second change my life (maybe a bit dramatic but it helped.)
- Add config: Add "sshd:x:71:65:SSH daemon:/var/lib/sshd:/bin/false" to the "etc/passwd" file in the Cygwin directory. You need to change the highlighted text for a real user id, to obtain these numbers just type "id" in the Cygwin prompt. See here for the details that led me to this.
- Install Param: Rather than following all the steps to install I found you can just run "ssh-host-config -y" it was by far the easier option.
A tip is, if it is going wrong uninstall and start again. If you have the service already installed use the cmd line below.
- C:>sc delete [service name]
When accessing the SSH service I kept getting an error as seen here. I found that if you run the command below after updating the ssh/config as described in the link the error is removed.
- $ SSH Localhost -oAddressFamily=inet
Hopefully you now have a service up and running?
Hadoop setup
The tutorials I followed all work against an old version of Hadoop which you can find here. If the link is broken the you need to find version 0.20.2 to run this. I found that newer versions have little or no support for windows, if you find details of a newer version working on windows let me know.For me it was not clear where to place Hadoop, I in fact tried several places but in the end you need to put it in the "Home/{UserName}" folder. Before starting all the install you need to run "cd hadoop" it wasn't obvious to me and I continued to struggle with having the incorrect directory.
Perhaps this is fixed in newer version but at the point of formatting you will be asked to confirm, make sure you use uppercase "Y". See here for details of the bug.
You will need to configure some settings outside of Cygwins, one of which is JAVA_HOME. If you are not familiar with Environment Variables then look up how to do this, see here for details.
"bin/hadoop: line 320 : C:\Program: Command not found" in the console. The route cause of this is the fact we have a space between "Program" and "Files". You can easily resolve this with the fix detailed in the link below http://stackoverflow.com/questions/12378199/hadoop-configuration-on-windows-through-Cygwin.
It appears Hadoop has or Cygwin has moved on since the tutorial I followed. as a result you will need to amend your Core-Site.XML to include "hdfs://localhost:9100"
looking at the HDFS
Technically if you have followed the tutorials and worked through the issues above you should have Hadoop up and running you can validate this by looking at the following URLs.
You can also make sure that the Hadoop HDFS system is in place by following the details in this link (http://stackoverflow.com/questions/8209616/how-to-read-a-file-from-hdfs-through-browser)
You can also look to run the examples for mapReduce which are in "{drive}:\cygwin64\home\{username}\hadoop\src\examples\org\apache\hadoop\examples". You run these with the following command in Cygwin, or similar.
- $ bin/hadoop jar hadoop/examples.jar wordcount /user/t1/tharris/input output2
Update
Although this might seem like a great way for a Windows developer to get started the reality is you can achieve this all in minutes if you use AWS or Azure. Basically save your time in setup and invest it in features.
No comments:
Post a Comment