<Date: 2011-09-13>
<Author: medcl>
<Category: Hadoop>
/usr/local/brisk-1.0/bin/brisk hadoop jar /usr/local/brisk-1.0/resources/hadoop/hadoop-streaming-0.20.203.1-brisk1-beta2.jar \
-file /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py \
-reducer /tmp/testmr/reducer.py \
-mapper /tmp/testmr/mapper.py \
-input /test_input/test_userId -output /test_output1
--error---
[root@platformD testmr]# ./job.sh
rmr: cannot remove /test_output: No such file or directory.
File: /tmp/testmr/-Dbrisk.job.tracker=10.129.6.36:8012 does not exist, or is not readable
/usr/local/brisk-1.0/bin/brisk hadoop jar /hadoop-0.21.0-streaming.jar \
-file /tmp/testmr/mapper.py -mapper /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py -reducer /tmp/testmr/reducer.py \
-input /test_input/test_userId -output /test_output
阅读这篇文章的其余部分 »

本文来自: how 2 run hadoop streaming job over brisk
<Date: 2011-09-06>
<Author: medcl>
<Category: cassandra, Hadoop, nosql>
brisk快速测试记录。
参考链接:
http://www.datastax.com/docs/0.8/brisk/about_pig
阅读这篇文章的其余部分 »

本文来自: brisk调试部署全纪录
<Date: 2011-09-05>
<Author: medcl>
<Category: cassandra>
https://github.com/riptano/brisk/archives/brisk1)
wget https://github.com/downloads/riptano/brisk/brisk-1.0~beta2-bin.tar.gz
//压缩包里面包含了所有的组件:brisk1.0,pig,hive,hadoop,cassandra
或者使用包来安装
redhat或centos下:
第一步,先安装EPEL(Extra Packages for Enterprise Linux),包含了brisk依赖的相关包,如jna和jpackage-utils
如果不确定是否安装EPEL,可以通过查看/etc/yum.repos.d下的epel.repo和epel-testing.repo 文件
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
如果遇到警告: RPM-GPG-KEY-EPEL key not being found,可以忽略或者到这里下载key:https://fedoraproject.org/keys
ok,开始正式安装brisk
添加源
vi /etc/yum.repos.d/datastax.repo
替换成你系统自己的,有EL或Fedora两种
[datastax]
name= DataStax Repo for Apache Cassandra
baseurl=http://rpm.datastax.com//$releasever
enabled=1
gpgcheck=0
替换之后的repo文件如下:
[datastax]
name= DataStax Repo for Apache Cassandra
baseurl=http://rpm.datastax.com/EL/6
enabled=1
gpgcheck=0
安装
yum install brisk-full
yum install brisk-demos
debian下:
编辑文件/etc/apt/sources.list
可选 lenny, lucid, maverick or squeeze
deb http://debian.datastax.com/ main
debian5.0使用如下
deb http://backports.debian.org/debian-backports lenny-backports main
添加datastx的key
wget -O - http://debian.datastax.com/debian/repo_key | sudo apt-key add -
安装
sudo aptitude update
sudo aptitude install brisk-full
sudo aptitude install brisk-demos
阅读这篇文章的其余部分 »

本文来自: datastax brisk 安装
<Date: 2011-07-06>
<Author: medcl>
<Category: Hadoop>
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Hive history file=/tmp/dev/hive_job_log_dev_201107062337_381665684.txt
FAILED: Error in semantic analysis: line 1:83 Exception while processing raw_daily_stats_table: Unable to fetch table raw_daily_stats_table
查看hive配置文件/etc/hive/conf/hive-default.xml,找到你的元数据存放位置
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
打开hdfs目录发现
/user/hive/warehouse
raw_daily_pagecounts_table dir 2011-03-28 15:39 rwxr-xr-x dev supergroup
raw_daily_stats_table dir 2011-07-06 23:03 rwxr-xr-x root supergroup
raw_daily_stats_table 目录的权限成root了,但是我是以dev身份执行的,
执行:
hadoop fs -chown -R dev:dev /user/hive/warehouse/raw_daily_stats_table
结果发现还是报,神啊
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
打开配置文件/etc/hive/conf/hive-site.xml发现如下节点
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
然后定位到相应目录
[dev@platformB metastore]$ ls -al
total 16
drwxrwxrwt 4 root root 4096 Mar 28 15:25 .
drwxr-xr-x 3 root root 4096 Mar 28 15:22 ..
drwxrwxr-x 5 dev dev 4096 Jul 6 23:51 metastore_db
drwxrwxrwt 3 root root 4096 Mar 28 15:22 scripts
[dev@platformB metastore]$ cd metastore_db/
[dev@platformB metastore_db]$ ls
dbex.lck db.lck log seg0 service.properties tmp
[dev@platformB metastore_db]$ ls -al
total 32
drwxrwxr-x 5 dev dev 4096 Jul 6 23:51 .
drwxrwxrwt 4 root root 4096 Mar 28 15:25 ..
-rw-rw-r-- 1 dev dev 4 Jul 6 23:03 dbex.lck
-rw-r--r-- 1 root root 38 Jul 6 23:03 db.lck
drwxrwxr-x 2 dev dev 4096 Jul 6 15:22 log
drwxrwxr-x 2 dev dev 4096 Apr 6 13:04 seg0
-rw-rw-r-- 1 dev dev 860 Mar 28 15:25 service.properties
drwxrwxr-x 2 dev dev 4096 Jul 6 23:51 tmp
[dev@platformB metastore_db]$
db.lck 干掉, dbex.lck干掉
再跑hadoop相关脚本,ok~

本文来自: Hive derby lock及目录权限错误
<Date: 2011-07-01>
<Author: medcl>
<Category: Hadoop, Linux>
先查看hadoop目录的文件数,然后再决定是不是在input里面加上该目录
[dev@platformB dailyrawdata]$ hadoop fs -ls /trendingtopics |wc -l
3
计算时间的方法
[dev@platformB dailyrawdata]$ lastdate=20110619
[dev@platformB dailyrawdata]$ echo $lastdate
20110619
[dev@platformB dailyrawdata]$ echo `date --date "-d $lastdate + 1day" +"%Y%m%d" `
20110620
[dev@platformB dailyrawdata]$ echo D9=`date --date "now -20 day" +"%Y%m%d"`
D9=20110530
[dev@platformB dailyrawdata]$ D1=`date --date "now" +"%Y/%m/%d"`
[dev@platformB dailyrawdata]$ echo $D1
2011/06/20
注:等号后面不能有空格,如下面:
[dev@platformB dailyrawdata]$ D1= `date --date "now" +"%Y/%m/%d"`
-bash: 2011/06/20: No such file or directory
拷贝今天的文件到指定目录
DAYSTR=`date --date "now" +"%Y/%m/%d"`
hadoop fs -copyFromLocal dailyrawdata/* /trendingtopics/data/raw/$DAYSTR
慢着,当目录下文件为空的时候,Hadoop Stream Job的根据你指定的Input Pattern找不到文件的时候会抛异常,结果就造成了Job的失败。
找了半天也没有找到好的办法(那个知道比较好的办法,还请不吝赐教啊),只能先判断目录是否为空,为空则将文件夹重定向到一个空文件。
#touch blank file
BLANK="/your folder/temp/blank"
hadoop fs -touchz $BLANK
#define a function to check hdfs files
function check_hdfs_files(){
#run hdfs command to check the files
hadoop fs -ls $1 &>/dev/null
#if file match is zero
#check file exists
if [ $? -ne 0 ]
then
eval "$2=$BLANK"
echo "can't find any files,use blank file instead"
fi
return $?
}
--
D0=`date --date "now" +"/your folder/%Y/%m/%d/${APPNAME}-${TENANT}*"`
D1=`date --date "now -1 day" +"/your folder/%Y/%m/%d/$APPNAME-$TENANT*"`
#check file exists
check_hdfs_files $D0 "D0"
check_hdfs_files $D1 "D1"

本文来自: 热门话题,时间及空目录的处理
<Date: 2011-03-29>
<Author: medcl>
<Category: Hadoop>
http://code.google.com/p/hadoop-sharp/
貌似不给力,pass
http://wiki.apache.org/hadoop/HDFS-APIs
http://wiki.apache.org/hadoop/MountableHDFS
http://wiki.apache.org/hadoop/Hbase/Stargate
http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html
统统不给力啊,走thrift吧,看了下svn,cocoa之类的都有现成的了,为啥没有c#,faint
阅读这篇文章的其余部分 »

本文来自: hadoop thrift client
<Date: 2011-03-28>
<Author: medcl>
<Category: Hadoop>
Hive安装
下载地址
http://hive.apache.org/releases.html
阅读这篇文章的其余部分 »

本文来自: Hive安装Tips
<Date: 2011-03-08>
<Author: medcl>
<Category: 小道消息>
https://github.com/datawrangling/trendingtopics
https://github.com/datawrangling/spatialanalytics
搭建trendingtopics,步骤。
环境准备
sudo apt-get install ruby
sudo gem install rails -include-dependenciesgem
/home/cloudera/.gem/ruby/1.8/bin
git clone git://github.com/datawrangling/trendingtopics.git
配置文件
cd trendingtopics
cp config/config.yml.example config/config.yml
cp config/database.yml.example config/database.yml
安装
如果保错:undefined local variable or method `version_requirements'
vi config/environment.rb
在开头加入:
if Gem::VERSION >= "1.3.6"
module Rails
class GemDependency
def requirement
r = super
(r == Gem::Requirement.default) ? nil : r
end
end
end
end
安装mysql client和mysql gem
//解压msyql源码包
./configure
//./configure --prefix=/usr/local/mysql
make install
//错误error: sys/ttydefaults.h: No such file or directory':
//http://phaseshiftllc.com/archives/2008/10/26/installing-mysql-gem-on-windows-cygwin-for-rails
//make distclean
//./configure --without-readline CFLAGS=-O2
//./configure --prefix /usr/local/mysql --without-server --without-readline --without-libeditCFLAGS=-O2 CFLAGS=-O2 CXXFLAGS=-O2
//make install
//gem install mysql
cp support-files/my-medium.cnf /etc/my.cnf
cd /usr/local/mysql
vi /etc/my.conf
[client] 中加入 protocol=TCP
//ref:http://www.phpvim.net/os/windows/build-mysql-client-on-cygwin.html
mysql -h localhost -u root
gem install mysql -include-dependenciesgem
mysqld_safe --user=mysql &
mysql -h localhost -u root -p
配置数据库连接
replace
socket: /tmp/mysql.sock
with
username: root
password: 555555
host: localhost
安装数据库
rake db:create
rake db:migrate
生成100条文章来做demo数据
//启动server
script/server
//报错,缺少包,执行如下
rake gems:install
gem sources -a http://gems.github.com
gem install jpignata-bossman
//再执行
script/server
server启动后,访问地址http://localhost:3000/
$ script/server
=> Booting WEBrick
=> Rails 2.3.2 application starting on http://0.0.0.0:3000
=> Call with -d to detach
=> Ctrl-C to shutdown server
[2011-03-24 13:50:11] INFO WEBrick 1.3.1
[2011-03-24 13:50:11] INFO ruby 1.8.7 (2008-08-11) [i386-cygwin]
[2011-03-24 13:50:11] INFO WEBrick::HTTPServer#start: pid=4760 port=3000
报错:
242093716 [main] bash 4404 exception::handle: Exception: STATUS_ACCESS_VIOLATION
242095173 [main] bash 4404 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
242305333 [main] bash 2116 exception::handle: Exception: STATUS_ACCESS_VIOLATION
242306088 [main] bash 2116 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
242617570 [main] bash 4032 exception::handle: Exception: STATUS_ACCESS_VIOLATION
242619190 [main] bash 4032 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
243121910 [main] bash 3596 exception::handle: Exception: STATUS_ACCESS_VIOLATION
243123323 [main] bash 3596 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
243458891 [main] bash 4968 fork: child -1 - died waiting for longjmp before init
ialization, retry 0, exit code 0x600, errno 11
bash: fork: Resource temporarily unavailable
//原因:temp放在虚拟磁盘,cygwin访问的权限不够
//如果是其他的原因可尝试如下方法:
rebaseall: only ash processes are allowed during rebasing
Exit all Cygwin processes and stop all Cygwin services.
Execute ash from Start/Run... or a cmd or command window.
Execute '/bin/rebaseall' from ash.
from:http://cygwin.com/ml/cygwin/2005-09/msg00919.html
创建表 CREATE TABLE raw_daily_stats_table1 (redirect_title STRING, dates STRING, pageviews STRING, total_pageviews BIGINT, monthly_trend DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; 加载数据 LOAD DATA INPATH '/home/dev/finalresult-a' INTO TABLE raw_daily_stats_table; //文件路径为hadoop的文件路径,上面的路径对应为hdfs://platformB/home/dev/finalresult-a
hive> LOAD DATA INPATH '/home/dev/finalresult-a' INTO TABLE raw_daily_stats_table;
Loading data to table raw_daily_stats_table
OK
Time taken: 4.927 seconds
加载的时候如果报加载失败,检查你的hdfs,会发现生成了一个你的文件名+_copy_1的文件,然后你load这个文件就成了。 hive> show tables > ; FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database '/var/lib/hive/metastore/metastore_db', see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database '/var/lib/hive/metastore/metastore_db', se e the next exception for details. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive> cat derby.log ============= begin nested exception, level (3) =========== ERROR XSDB6: Another instance of Derby may have already booted the database /var/lib/hive/ metastore/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Un known Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknow n Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source ) 原来异常退出造成前面的访问derby进程还在,而derby是文件型的存储,每次只能一个进程打开,so,你懂的,看来生成环境使用mysql才是王道,打开配置文件hive-default.xml
hive.metastore.warehouse.dir
/user/hive_remote/warehouse
hive.metastore.local
true
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost/hive_remote?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
root
javax.jdo.option.ConnectionPassword
dandan
hive查询及排序: select * from raw_daily_stats_table sort by monthly_trend; select * from raw_daily_stats_table sort by monthly_trend desc limit 10; http://www.fuzhijie.me/?p=377 http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin

本文来自: 搭建trendingtopics
<Date: 2011-03-04>
<Author: medcl>
<Category: Mahout>