write down,forget
分类 Category : Hadoop

cloudra-manager修改使用自定义源

<Category: Hadoop> Comments Off on cloudra-manager修改使用自定义源

使用cloudra-manager来管理hadoop集群,但是官方源太慢了,搭本地源呗,另外repo写死在package里面了,将包解开,修改下,替换repo仓库地址为本地源即可。
阅读这篇文章的其余部分 »

本文来自: cloudra-manager修改使用自定义源

clouderaCDH3国内源

<Category: Hadoop> Comments Off on clouderaCDH3国内源

贡献一个cloudra CDH3 国内源 #如何使用呢?

阅读这篇文章的其余部分 »

本文来自: clouderaCDH3国内源

how 2 run hadoop streaming job over brisk

<Category: Hadoop> Comments Off on how 2 run hadoop streaming job over brisk

–error—
[root@platformD testmr]# ./job.sh
rmr: cannot remove /test_output: No such file or directory.
File: /tmp/testmr/-Dbrisk.job.tracker=10.129.6.36:8012 does not exist, or is not readable

阅读这篇文章的其余部分 »

本文来自: how 2 run hadoop streaming job over brisk

brisk调试部署全纪录

<Category: cassandra, Hadoop, nosql> Comments Off on brisk调试部署全纪录

brisk快速测试记录。
参考链接:
http://www.datastax.com/docs/0.8/brisk/about_pig
阅读这篇文章的其余部分 »

本文来自: brisk调试部署全纪录

流计算是什么东东?

<Category: Hadoop, 分布式> Comments Off on 流计算是什么东东?

 

貌似现在正在流行流计算,流计算或流式计算主要用来做实时数据分析,如实时交易数据,广告,查询等,

我们知道一般用Hadoop来做离线分析都需要一定的延时,并且必须等数据收集处理完等一系列若干的操作,等报告结果出来之后,黄花菜都凉了,而流计算则刚好填补这一块的空白,流计算对正在发生的事件产生的数据进行实时分析,而FlumeBase就是这样一个项目,它建立在Flume(cloudra的分布式日志收集系统)之上,并提供类sql的查询方式(rtsql)。

Flumebase允许用户动态的插入查询到flume日志收集环境,这些查询请求会对进来的日志进行抽查处理,只要是符合查询条件的,就会进行相应的处理,如持续监控、数据格式转换、过滤等各种任务。

https://github.com/cloudera/flume

https://github.com/flumebase/flumebase

http://blog.flumebase.org/?p=14

http://flumebase.org/documentation/0.2.0/UserGuide.html#d0e7

http://www.docin.com/p-152156266.html

类似的开源流计算框架还有yahoo的s4,s4貌似比flume要成熟不少,不过都值得关注。

http://s4.io/

s4最开始是为yahoo个性化广告产品而开发的一个产品,号称能够每秒处理上千个事件。http://docs.s4.io/manual/overview.html

本文来自: 流计算是什么东东?

Hadoop and MapReduce: Big Data Analytics [gartner]

<Category: Hadoop> Comments Off on Hadoop and MapReduce: Big Data Analytics [gartner]

收藏,下载地址:http://dl.medcl.com/get.php?id=29&path=books%2Fgartner%2CHadoop+and+MapReduce+Big+Data+Analytics.7z

阅读这篇文章的其余部分 »

本文来自: Hadoop and MapReduce: Big Data Analytics [gartner]

Hive derby lock及目录权限错误

<Category: Hadoop> Comments Off on Hive derby lock及目录权限错误

FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Hive history file=/tmp/dev/hive_job_log_dev_201107062337_381665684.txt
FAILED: Error in semantic analysis: line 1:83 Exception while processing raw_daily_stats_table: Unable to fetch table raw_daily_stats_table

查看hive配置文件/etc/hive/conf/hive-default.xml,找到你的元数据存放位置

打开hdfs目录发现
/user/hive/warehouse

raw_daily_stats_table 目录的权限成root了,但是我是以dev身份执行的,

执行:

结果发现还是报,神啊

FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

打开配置文件/etc/hive/conf/hive-site.xml发现如下节点

然后定位到相应目录

db.lck 干掉, dbex.lck干掉

再跑hadoop相关脚本,ok~

本文来自: Hive derby lock及目录权限错误

热门话题,时间及空目录的处理

<Category: Hadoop, Linux> Comments Off on 热门话题,时间及空目录的处理

 

先查看hadoop目录的文件数,然后再决定是不是在input里面加上该目录
[dev@platformB dailyrawdata]$  hadoop fs -ls /trendingtopics |wc -l
3

计算时间的方法
[dev@platformB dailyrawdata]$ lastdate=20110619
[dev@platformB dailyrawdata]$ echo $lastdate
20110619
[dev@platformB dailyrawdata]$ echo date --date "-d $lastdate + 1day" +"%Y%m%d"
20110620

[dev@platformB dailyrawdata]$ echo D9=date --date "now -20 day" +"%Y%m%d"
D9=20110530

 

[dev@platformB dailyrawdata]$ D1=date --date "now" +"%Y/%m/%d"
[dev@platformB dailyrawdata]$ echo $D1
2011/06/20

注:等号后面不能有空格,如下面:

[dev@platformB dailyrawdata]$ D1= date --date "now" +"%Y/%m/%d"
-bash: 2011/06/20: No such file or directory

 

拷贝今天的文件到指定目录

DAYSTR=date --date "now" +"%Y/%m/%d"

hadoop fs -copyFromLocal dailyrawdata/* /trendingtopics/data/raw/$DAYSTR

 

慢着,当目录下文件为空的时候,Hadoop Stream Job的根据你指定的Input Pattern找不到文件的时候会抛异常,结果就造成了Job的失败。

找了半天也没有找到好的办法(那个知道比较好的办法,还请不吝赐教啊),只能先判断目录是否为空,为空则将文件夹重定向到一个空文件。

#touch blank file
BLANK=”/your folder/temp/blank”
hadoop fs -touchz $BLANK

#define a function to check hdfs files
function check_hdfs_files(){

#run hdfs command to check the files
hadoop fs -ls $1 &>/dev/null

#if file match is zero
#check file exists
if  [ $? -ne 0 ]
then
eval “$2=$BLANK”
echo “can’t find any files,use blank file instead”
fi

return $?
}

 

D0=date --date "now" +"/your folder/%Y/%m/%d/${APPNAME}-${TENANT}*"
D1=date --date "now -1 day" +"/your folder/%Y/%m/%d/$APPNAME-$TENANT*"

#check file exists
check_hdfs_files $D0 “D0”
check_hdfs_files $D1 “D1”

本文来自: 热门话题,时间及空目录的处理

hadoop thrift client

<Category: Hadoop> Comments Off on hadoop thrift client

http://code.google.com/p/hadoop-sharp/
貌似不给力,pass

http://wiki.apache.org/hadoop/HDFS-APIs
http://wiki.apache.org/hadoop/MountableHDFS
http://wiki.apache.org/hadoop/Hbase/Stargate
http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html

统统不给力啊,走thrift吧,看了下svn,cocoa之类的都有现成的了,为啥没有c#,faint
阅读这篇文章的其余部分 »

本文来自: hadoop thrift client

Hive安装Tips

<Category: Hadoop> Comments Off on Hive安装Tips

Hive安装

下载地址
http://hive.apache.org/releases.html
阅读这篇文章的其余部分 »

本文来自: Hive安装Tips