Hadoop 集群配置(centos\CDH3)

ref:http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
https://docs.cloudera.com/display/DOC/CDH3+Installation

记录下,许久不弄hadoop,都生疏了。

ENV

jdk1.6
wget http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-6u24-linux-x64-rpm.bin?BundledLineItemUUID=eE6J_hCxGTcAAAEuiVIWty2U&OrderID=TPKJ_hCxXJYAAAEudFIWty2U&ProductID=oSKJ_hCwOlYAAAEtBcoADqmS&FileName=/jdk-6u24-linux-x64-rpm.bin
101  sudo rpm -Uvh ./jdk-6u24-linux-amd64.rpm

复制java安装包到各slave
1002  scp jdk-6u24-linux-amd64.rpm  platformA:/home/dev  安装java,注意版本

配置主机别名,两台机器:platformA(slave)PlatformB(master)
1003  sudo vi /etc/hosts
10.129.8.58     platformA
10.129.8.74     platformB
master
cd /etc/yum.repos.d/
sudo wget http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo

106  sudo yum install -0.20
108  sudo yum  install hadoop-0.20-namenode
109  sudo yum  install hadoop-0.20-datenode
110  sudo yum  install hadoop-0.20-datanode
111  sudo yum  install hadoop-0.20-jobtracker
112  sudo yum  install hadoop-0.20-tasktracker

保证master和slave都使用相同的用户名
sudo /usr/sbin/adduser dev
passwd dev

1014  ssh-copy-id -i $HOME/.ssh/id_rsa.pub dev@platformA
1015  ssh dev@platformA

1018  ssh-copy-id -i $HOME/.ssh/id_rsa.pub dev@platformB
1019  ssh dev@platformB

slave

sudo rpm -Uvh ./jdk-6u24-linux-amd64.rpm

cd /etc/yum.repos.d/
sudo wget http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo
sudo yum install hadoop-0.20

#install daemon
sudo yum install hadoop-0.20-datanode
sudo yum install hadoop-0.20-tasktracker

sudo chmod -R 777 /usr/lib/hadoop/logs
sudo chmod -R 777 /usr/lib/hadoop/pids

cd /var/lib
sudo mkdir hadoop-0.20
sudo chmod 777 -R  hadoop-0.20

修改配置 sudo vi conf/masters,添加master服务器
platformB

test
149  hadoop-0.20 jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output20 ‘con[a-z.]+’

config

A.platformB(master)

1.conf/masters 此处设置master节点的ip,如果有secondmaster,也要添加,启动的时候,secondmaster由primarymaster来启动(执行bin/start-dfs.sh的节点自动成为primarymaster)
platformB

2.conf/slaves 设置slave的节点地址,当前我们是2个
platformB
platformA

B.platformA/platformB(all machines)
Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.

on platformB:
拷贝配置样例
cd conf.pseudo/
sudo cp *.* ../../conf/

稍作修改
<!– In: conf/core-site.xml –>
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://platformB:8020</value><!–此处用别名,需要和slave配置一样,并且注意hosts文件的配置,并且一个host只能指向一个ip,否则与slave通信失败–>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>
</configuration>

<!– In: conf/mapred-site.xml –>

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>platformB:8021</value>
</property>

<!– Enable Hue plugins –>
<property>
<name>mapred.jobtracker.plugins</name>
<value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value>
<description>Comma-separated list of jobtracker plug-ins to be activated.
</description>
</property>
<property>
<name>jobtracker.thrift.address</name>
<value>0.0.0.0:9290</value>
</property>
</configuration>

<!– In: conf/hdfs-site.xml –>
根据需要修改dfs.replication 的数目

集群
copy以上几个配置文件到platformA
1.格式化namenode
创建cachedir(platformA、platformB)
cd /var/lib
sudo mkdir hadoop-0.20
sudo chmod 777 -R  hadoop-0.20
/usr/lib/hadoop/bin/hadoop namenode -format

如果遇到类似如下错误:
Re-format filesystem in /var/lib/hadoop-0.20/cache/hadoop/dfs/name ? (Y or N) y
Format aborted in /var/lib/hadoop-0.20/cache/hadoop/dfs/name
删除cache下的历史目录重新格式化即可

2.启动集群(platformB)
sudo chmod -R 777 /usr/lib/hadoop/logs
sudo chmod -R 777 /usr/lib/hadoop/pids

如果遇到类似错误
touch: cannot touch `/usr/lib/hadoop/bin/../logs/hadoop-dev-datanode-platformA.out’: Permission denied
在platformA上执行
sudo chmod -R 777 /usr/lib/hadoop/logs
sudo chmod -R 777 /usr/lib/hadoop/pids

检查服务端口是否正常监听
netstat -ano|grep 8020
jps

查看master的日志,确保master的dfs启动成功(platformB)
cat logs/hadoop-dev-datanode-platformB.log

查看slave的日志,确保slave的dfs启动成功(platformA)
cat logs/hadoop-dev-datanode-platformA.log

3.启动mapreduce服务
bin/start-mapred.sh (on platformB , master)

各种服务器,查看日志
cat  logs/hadoop-dev-tasktracker-platformB.log
cat  logs/hadoop-dev-tasktracker-platformA.log

错误:
2011-02-18 18:03:48,924 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: platformB/10.129.8.74:8021. Already tried 6 time(s).
ip绑定的错误,同上处理
至此,安装完毕,运行jps
jps

master:
[dev@platformB hadoop]$ jps
11805 JobTracker
11095 DataNode
11902 TaskTracker
10992 NameNode
11233 SecondaryNameNode

slave:
7938 DataNode
8255 TaskTracker

停止集群,与安装过程相反
bin/stop-mapred.sh    on master
bin/stop-dfs.sh            on master

添加更多节点
先按上面的安装配置slave,然后停掉集群,master服务器上的slave配置添加其他slave的ip地址,启动集群

http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)

几个webconsole地址,查看livenode和日志等信息:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)

在集群上执行MR任务

创建input目录

[dev@platformB hadoop]$ hadoop fs -mkdir input

拷贝测试文件到dfs的input目录,此处为几个xml配置文件

[dev@platformB hadoop]$ hadoop fs -copyFromLocal conf/*.xml input

执行Mapreduce

[dev@platformB hadoop]$ bin/hadoop jar hadoop-0.20.2+737-examples.jar wordcount   input/* output1

如果遇到类似错误:

16:17:48,327 INFO org.apache.hadoop.mapred.JobTracker: Removing task ‘attempt_201102201606_0001_m_000008_0’ 2011-02-20 16:17:51,203 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102201606_0001_r_000002_0: Error initializing attempt_201102201606_0001_r_000002_0: java.lang.IllegalArgumentException: Wrong FS: hdfs://10.129.8.74/var/lib/hadoop-0.20/cache/dev/mapred/system/job_201102201606_0001/jobToken, expected: hdfs://platformB at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:385) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:515) at org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:3968) at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:1020) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:967) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174) 2011-02-20 16:17:51,204 ERROR org.apache.hadoop.mapred.TaskStatus: Trying to set finish time for task attempt_201102201606_0001_r_000002_0 when no start time is set, stackTrace is : java.lang.Exception at org.apache.hadoop.mapred.TaskStatus.setFinishTime(TaskStatus.java:145) at org.apache.hadoop.mapred.ReduceTaskStatus.setFinishTime(ReduceTaskStatus.java:64) at org.apache.hadoop.mapred.TaskInProgress.incompleteSubTask(TaskInProgress.java:665) at org.apache.hadoop.mapred.JobInProgress.failedTask(JobInProgress.java:2729) at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1069) at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4481) at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3455) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3154) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)

请检查你的/etc/hosts文件的配置,如果一个hostname对应了多个ip,则会报错,此处的platformB上的配置同时指向了127.0.0.1和10.129.8.74,去掉127.0.0.1,重启集群