记录生活
分类 Category : Hadoop

how 2 run hadoop streaming job over brisk

<Category: Hadoop> 发表评论
/usr/local/brisk-1.0/bin/brisk hadoop jar /usr/local/brisk-1.0/resources/hadoop/hadoop-streaming-0.20.203.1-brisk1-beta2.jar \
-file /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py \
-reducer /tmp/testmr/reducer.py \
-mapper /tmp/testmr/mapper.py \
-input /test_input/test_userId -output /test_output1

-----
[root@platformD testmr]# ./job.sh
rmr: cannot remove /test_output: No such file or directory.
File: /tmp/testmr/-Dbrisk.job.tracker=10.129.6.36:8012 does not exist, or is not readable

/usr/local/brisk-1.0/bin/brisk hadoop jar /hadoop-0.21.0-streaming.jar \
-file /tmp/testmr/mapper.py -mapper /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py -reducer /tmp/testmr/reducer.py \
-input /test_input/test_userId  -output /test_output

阅读这篇文章的其余部分 »

本文来自: how 2 run hadoop streaming job over brisk

brisk调试部署全纪录

<Category: cassandra, Hadoop, nosql> 发表评论

brisk快速测试记录。
参考链接:

http://www.datastax.com/docs/0.8//about_pig

阅读这篇文章的其余部分 »

本文来自: brisk调试部署全纪录

流计算是什么东东?

<Category: Hadoop, 分布式> 发表评论

 

貌似现在正在流行流计算,流计算或流式计算主要用来做实时数据分析,如实时交易数据,广告,查询等,

我们知道一般用Hadoop来做离线分析都需要一定的延时,并且必须等数据收集处理完等一系列若干的操作,等报告结果出来之后,黄花菜都凉了,而流计算则刚好填补这一块的空白,流计算对正在发生的事件产生的数据进行实时分析,而FlumeBase就是这样一个项目,它建立在Flume(cloudra的分布式日志收集系统)之上,并提供类sql的查询方式(rtsql)。

Flumebase允许用户动态的插入查询到flume日志收集环境,这些查询请求会对进来的日志进行抽查处理,只要是符合查询条件的,就会进行相应的处理,如持续监控、数据格式转换、过滤等各种任务。

https://github.com/cloudera/flume

https://github.com/flumebase/flumebase

http://blog.flumebase.org/?p=14

http://flumebase.org/documentation/0.2.0/UserGuide.html#d0e7

http://www.docin.com/p-152156266.html

类似的开源流计算框架还有yahoo的s4,s4貌似比flume要成熟不少,不过都值得关注。

http://s4.io/

s4最开始是为yahoo个性化广告产品而开发的一个产品,号称能够每秒处理上千个事件。http://docs.s4.io/manual/overview.html

本文来自: 流计算是什么东东?

Hadoop and MapReduce: Big Data Analytics [gartner]

<Category: Hadoop> 发表评论

收藏,下载地址:http://dl.medcl.com/get.php?id=29&path=books%2Fgartner%2CHadoop+and+MapReduce+Big+Data+Analytics.7z

阅读这篇文章的其余部分 »

本文来自: Hadoop and MapReduce: Big Data Analytics [gartner]

Hive derby lock及目录权限错误

<Category: Hadoop> 发表评论

FAILED: in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache...ql.exec.DDLTask
Hive history file=/tmp/dev/hive_job_log_dev_201107062337_381665684.txt
FAILED: Error in semantic analysis: line 1:83 Exception while processing raw_daily_stats_table: Unable to fetch table raw_daily_stats_table

查看hive配置文件/etc/hive/conf/hive-default.xml,找到你的元数据存放位置

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>

打开hdfs目录发现
/user/hive/warehouse

raw_daily_pagecounts_table	dir				2011-03-28 15:39	rwxr-xr-x	dev	supergroup
raw_daily_stats_table	dir				2011-07-06 23:03	rwxr-xr-x	root	supergroup

raw_daily_stats_table 目录的权限成root了,但是我是以dev身份执行的,

执行:

hadoop fs -chown -R dev:dev /user/hive/warehouse/raw_daily_stats_table

结果发现还是报,神啊

FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

打开配置文件/etc/hive/conf/hive-site.xml发现如下节点

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

然后定位到相应目录

[dev@platformB metastore]$ ls -al
total 16
drwxrwxrwt 4 root root 4096 Mar 28 15:25 .
drwxr-xr-x 3 root root 4096 Mar 28 15:22 ..
drwxrwxr-x 5 dev  dev  4096 Jul  6 23:51 metastore_db
drwxrwxrwt 3 root root 4096 Mar 28 15:22 scripts
[dev@platformB metastore]$ cd metastore_db/
[dev@platformB metastore_db]$ ls
dbex.lck  db.lck  log  seg0  service.properties  tmp
[dev@platformB metastore_db]$ ls -al
total 32
drwxrwxr-x 5 dev  dev  4096 Jul  6 23:51 .
drwxrwxrwt 4 root root 4096 Mar 28 15:25 ..
-rw-rw-r-- 1 dev  dev     4 Jul  6 23:03 dbex.lck
-rw-r--r-- 1 root root   38 Jul  6 23:03 db.lck
drwxrwxr-x 2 dev  dev  4096 Jul  6 15:22 log
drwxrwxr-x 2 dev  dev  4096 Apr  6 13:04 seg0
-rw-rw-r-- 1 dev  dev   860 Mar 28 15:25 service.properties
drwxrwxr-x 2 dev  dev  4096 Jul  6 23:51 tmp
[dev@platformB metastore_db]$

db.lck 干掉, dbex.lck干掉

再跑hadoop相关脚本,ok~

本文来自: Hive derby lock及目录权限错误

热门话题,时间及空目录的处理

<Category: Hadoop, Linux> 发表评论

 

先查看hadoop目录的文件数,然后再决定是不是在input里面加上该目录
[dev@platformB dailyrawdata]$  fs -ls / |wc -l
3

计算时间的方法
[dev@platformB dailyrawdata]$ lastdate=20110619
[dev@platformB dailyrawdata]$ echo $lastdate
20110619
[dev@platformB dailyrawdata]$ echo `date --date "-d $lastdate + 1day" +"%Y%m%d" `
20110620

[dev@platformB dailyrawdata]$ echo D9=`date --date "now -20 day" +"%Y%m%d"`
D9=20110530

 

[dev@platformB dailyrawdata]$ D1=`date --date "now" +"%Y/%m/%d"`
[dev@platformB dailyrawdata]$ echo $D1
2011/06/20

注:等号后面不能有空格,如下面:

[dev@platformB dailyrawdata]$ D1= `date --date "now" +"%Y/%m/%d"`
-bash: 2011/06/20: No such file or directory

 

拷贝今天的文件到指定目录

DAYSTR=`date --date "now" +"%Y/%m/%d"`

hadoop fs -copyFromLocal dailyrawdata/* /trendingtopics/data/raw/$DAYSTR

 

慢着,当目录下文件为空的时候,Hadoop Stream Job的根据你指定的Input Pattern找不到文件的时候会抛异常,结果就造成了Job的失败。

找了半天也没有找到好的办法(那个知道比较好的办法,还请不吝赐教啊),只能先判断目录是否为空,为空则将文件夹重定向到一个空文件。

#touch blank file
BLANK="/your folder/temp/blank"
hadoop fs -touchz $BLANK

#define a function to check files
function check_hdfs_files(){

#run hdfs command to check the files
hadoop fs -ls $1 &>/dev/null

#if file match is zero
#check file exists
if  [ $? -ne 0 ]
then
eval "$2=$BLANK"
echo "can't find any files,use blank file instead"
fi

return $?
}

 

--

D0=`date --date "now" +"/your folder/%Y/%m/%d/${APPNAME}-${TENANT}*"`
D1=`date --date "now -1 day" +"/your folder/%Y/%m/%d/$APPNAME-$TENANT*"`

#check file exists
check_hdfs_files $D0 "D0"
check_hdfs_files $D1 "D1"

本文来自: 热门话题,时间及空目录的处理

hadoop thrift client

<Category: Hadoop> 发表评论

http://code.google.com/p/-sharp/

貌似不给力,pass

http://wiki.apache.org/hadoop/-APIs

http://wiki.apache.org/hadoop/MountableHDFS

http://wiki.apache.org/hadoop/Hbase/Stargate

http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html

统统不给力啊,走thrift吧,看了下svn,cocoa之类的都有现成的了,为啥没有c#,faint
阅读这篇文章的其余部分 »

本文来自: hadoop thrift client

Hive安装Tips

<Category: Hadoop> 发表评论

Hive安装

下载地址

http://.apache.org/releases.html

阅读这篇文章的其余部分 »

本文来自: Hive安装Tips

Hadoop 集群配置(centos\CDH3)

<Category: Hadoop> 发表评论

ref:http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

https://docs.cloudera.com/display/DOC/CDH3+Installation

记录下,许久不弄hadoop,都生疏了。
阅读这篇文章的其余部分 »

本文来自: Hadoop 集群配置(centos\CDH3)

Advanced Hadoop Tuning & Optimisation

<Category: Hadoop, nosql, 分布式> 发表评论

周末去参加了Milind Bhandarkar的【Hadoop应用程序性能调优案例分析】,顺便参观了Yahoo的研发中心,现场到的人不少,收获也蛮多的,Milind Bhandarkar介绍了hadoop的配置、调优以及一些在Yahoo的经验技巧,还介绍了一个叫Hadoop Vaidya的诊断分析框架,附ppt下载地址
下面也是找的hadoop调优的一个ppt,学习in'。

PPT on Advanced Hadoop Tuning n Optimisation

View more presentations or Upload your own.

本文来自: Advanced Hadoop Tuning & Optimisation