记录生活
标签 Tag : Hadoop

how 2 run hadoop streaming job over brisk

<Category: Hadoop> 发表评论
/usr/local/brisk-1.0/bin/brisk hadoop jar /usr/local/brisk-1.0/resources/hadoop/hadoop-streaming-0.20.203.1-brisk1-beta2.jar \
-file /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py \
-reducer /tmp/testmr/reducer.py \
-mapper /tmp/testmr/mapper.py \
-input /test_input/test_userId -output /test_output1

-----
[root@platformD testmr]# ./job.sh
rmr: cannot remove /_output: No such file or directory.
File: /tmp/testmr/-Dbrisk.job.tracker=10.129.6.36:8012 does not exist, or is not readable

/usr/local/brisk-1.0/bin/brisk hadoop jar /hadoop-0.21.0-streaming.jar \
-file /tmp/testmr/mapper.py -mapper /tmp/testmr/mapper.py \
-file /tmp/testmr/reducer.py -reducer /tmp/testmr/reducer.py \
-input /test_input/test_userId  -output /test_output

阅读这篇文章的其余部分 »

本文来自: how 2 run hadoop streaming job over brisk

brisk调试部署全纪录

<Category: cassandra, Hadoop, nosql> 发表评论

brisk快速测试记录。
参考链接:

http://www..com/docs/0.8//about_pig

阅读这篇文章的其余部分 »

本文来自: brisk调试部署全纪录

datastax brisk 安装

<Category: cassandra> 发表评论

https://github.com/riptano//archives/1)

wget https://github.com/downloads/riptano/brisk/brisk-1.0~beta2-bin.tar.gz

//压缩包里面包含了所有的组件:brisk1.0,pig,

或者使用包来安装
redhat或centos下:
第一步,先安装EPEL(Extra Packages for Enterprise Linux),包含了brisk依赖的相关包,如jna和jpackage-utils
如果不确定是否安装EPEL,可以通过查看/etc/yum.repos.d下的epel.repo和epel-testing.repo 文件

rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm

如果遇到警告: RPM-GPG-KEY-EPEL key not being found,可以忽略或者到这里下载key:https://fedoraproject.org/keys

ok,开始正式安装brisk

添加源

vi /etc/yum.repos.d/datastax.repo

替换成你系统自己的,有EL或Fedora两种

[datastax]
name= DataStax Repo for Apache Cassandra
baseurl=http://rpm.datastax.com//$releasever
enabled=1
gpgcheck=0

替换之后的repo文件如下:

[datastax]
name= DataStax Repo for Apache Cassandra
baseurl=http://rpm.datastax.com/EL/6
enabled=1
gpgcheck=0

安装

yum install brisk-full
yum install brisk-demos

debian下:
编辑文件/etc/apt/sources.list

vi /etc/apt/sources.list

可选 lenny, lucid, maverick or squeeze

 deb http://debian.datastax.com/  main

debian5.0使用如下

 deb http://backports.debian.org/debian-backports lenny-backports main

添加datastx的key

 wget -O - http://debian.datastax.com/debian/repo_key | sudo apt-key add -

安装

 sudo aptitude update
 sudo aptitude install brisk-full
 sudo aptitude install brisk-demos

阅读这篇文章的其余部分 »

本文来自: datastax brisk 安装

Hadoop and MapReduce: Big Data Analytics [gartner]

<Category: Hadoop> 发表评论

收藏,下载地址:http://dl.medcl.com/get.php?id=29&path=books%2Fgartner%2CHadoop+and+MapReduce+Big+Data+Analytics.7z

阅读这篇文章的其余部分 »

本文来自: Hadoop and MapReduce: Big Data Analytics [gartner]

Hive derby lock及目录权限错误

<Category: Hadoop> 发表评论

FAILED: in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache...ql.exec.DDLTask
Hive history file=/tmp/dev/hive_job_log_dev_201107062337_381665684.txt
FAILED: Error in semantic analysis: line 1:83 Exception while processing raw_daily_stats_table: Unable to fetch table raw_daily_stats_table

查看hive配置文件/etc/hive/conf/hive-default.xml,找到你的元数据存放位置

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>

打开hdfs目录发现
/user/hive/warehouse

raw_daily_pagecounts_table	dir				2011-03-28 15:39	rwxr-xr-x	dev	supergroup
raw_daily_stats_table	dir				2011-07-06 23:03	rwxr-xr-x	root	supergroup

raw_daily_stats_table 目录的权限成root了,但是我是以dev身份执行的,

执行:

hadoop fs -chown -R dev:dev /user/hive/warehouse/raw_daily_stats_table

结果发现还是报,神啊

FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
NestedThrowables:
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

打开配置文件/etc/hive/conf/hive-site.xml发现如下节点

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

然后定位到相应目录

[dev@platformB metastore]$ ls -al
total 16
drwxrwxrwt 4 root root 4096 Mar 28 15:25 .
drwxr-xr-x 3 root root 4096 Mar 28 15:22 ..
drwxrwxr-x 5 dev  dev  4096 Jul  6 23:51 metastore_db
drwxrwxrwt 3 root root 4096 Mar 28 15:22 scripts
[dev@platformB metastore]$ cd metastore_db/
[dev@platformB metastore_db]$ ls
dbex.lck  db.lck  log  seg0  service.properties  tmp
[dev@platformB metastore_db]$ ls -al
total 32
drwxrwxr-x 5 dev  dev  4096 Jul  6 23:51 .
drwxrwxrwt 4 root root 4096 Mar 28 15:25 ..
-rw-rw-r-- 1 dev  dev     4 Jul  6 23:03 dbex.lck
-rw-r--r-- 1 root root   38 Jul  6 23:03 db.lck
drwxrwxr-x 2 dev  dev  4096 Jul  6 15:22 log
drwxrwxr-x 2 dev  dev  4096 Apr  6 13:04 seg0
-rw-rw-r-- 1 dev  dev   860 Mar 28 15:25 service.properties
drwxrwxr-x 2 dev  dev  4096 Jul  6 23:51 tmp
[dev@platformB metastore_db]$

db.lck 干掉, dbex.lck干掉

再跑hadoop相关脚本,ok~

本文来自: Hive derby lock及目录权限错误

热门话题,时间及空目录的处理

<Category: Hadoop, Linux> 发表评论

 

先查看hadoop目录的文件数,然后再决定是不是在input里面加上该目录
[dev@platformB dailyrawdata]$  fs -ls / |wc -l
3

计算时间的方法
[dev@platformB dailyrawdata]$ lastdate=20110619
[dev@platformB dailyrawdata]$ echo $lastdate
20110619
[dev@platformB dailyrawdata]$ echo `date --date "-d $lastdate + 1day" +"%Y%m%d" `
20110620

[dev@platformB dailyrawdata]$ echo D9=`date --date "now -20 day" +"%Y%m%d"`
D9=20110530

 

[dev@platformB dailyrawdata]$ D1=`date --date "now" +"%Y/%m/%d"`
[dev@platformB dailyrawdata]$ echo $D1
2011/06/20

注:等号后面不能有空格,如下面:

[dev@platformB dailyrawdata]$ D1= `date --date "now" +"%Y/%m/%d"`
-bash: 2011/06/20: No such file or directory

 

拷贝今天的文件到指定目录

DAYSTR=`date --date "now" +"%Y/%m/%d"`

hadoop fs -copyFromLocal dailyrawdata/* /trendingtopics/data/raw/$DAYSTR

 

慢着,当目录下文件为空的时候,Hadoop Stream Job的根据你指定的Input Pattern找不到文件的时候会抛异常,结果就造成了Job的失败。

找了半天也没有找到好的办法(那个知道比较好的办法,还请不吝赐教啊),只能先判断目录是否为空,为空则将文件夹重定向到一个空文件。

#touch blank file
BLANK="/your folder/temp/blank"
hadoop fs -touchz $BLANK

#define a function to check files
function check_hdfs_files(){

#run hdfs command to check the files
hadoop fs -ls $1 &>/dev/null

#if file match is zero
#check file exists
if  [ $? -ne 0 ]
then
eval "$2=$BLANK"
echo "can't find any files,use blank file instead"
fi

return $?
}

 

--

D0=`date --date "now" +"/your folder/%Y/%m/%d/${APPNAME}-${TENANT}*"`
D1=`date --date "now -1 day" +"/your folder/%Y/%m/%d/$APPNAME-$TENANT*"`

#check file exists
check_hdfs_files $D0 "D0"
check_hdfs_files $D1 "D1"

本文来自: 热门话题,时间及空目录的处理

hadoop thrift client

<Category: Hadoop> 发表评论

http://code.google.com/p/-sharp/

貌似不给力,pass

http://wiki.apache.org/hadoop/-APIs

http://wiki.apache.org/hadoop/MountableHDFS

http://wiki.apache.org/hadoop/Hbase/Stargate

http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html

统统不给力啊,走thrift吧,看了下svn,cocoa之类的都有现成的了,为啥没有c#,faint
阅读这篇文章的其余部分 »

本文来自: hadoop thrift client

Hive安装Tips

<Category: Hadoop> 发表评论

Hive安装

下载地址

http://.apache.org/releases.html

阅读这篇文章的其余部分 »

本文来自: Hive安装Tips

搭建trendingtopics

<Category: 小道消息> 发表评论

https://github.com/datawrangling/trendingtopics
https://github.com/datawrangling/spatialanalytics

搭建trendingtopics,步骤。

环境准备

sudo apt-get install ruby
sudo gem install rails -include-dependenciesgem
 
/home/cloudera/.gem/ruby/1.8/bin
git clone git://github.com/datawrangling/trendingtopics.git

配置文件

cd trendingtopics
cp config/config.yml.example  config/config.yml
cp config/database.yml.example  config/database.yml

安装

rake gems:install

如果保错:undefined local variable or method `version_requirements'
vi config/environment.rb
在开头加入:

if Gem::VERSION &gt;= "1.3.6"
    module Rails
        class GemDependency
            def requirement
                r = super
                (r == Gem::Requirement.default) ? nil : r
            end
        end
    end
end

安装mysql client和mysql

//解压msyql源码包
./configure
//./configure --prefix=/usr/local/mysql
make install  
 
//错误error: sys/ttydefaults.h: No such file or directory':
//http://phaseshiftllc.com/archives/2008/10/26/installing-mysql-gem-on-windows-cygwin-for-rails
//make distclean
//./configure --without-readline CFLAGS=-O2
//./configure --prefix /usr/local/mysql --without-server --without-readline --without-libeditCFLAGS=-O2 CFLAGS=-O2 CXXFLAGS=-O2
//make install
//gem install mysql
 
cp support-files/my-medium.cnf /etc/my.cnf
cd /usr/local/mysql
 
vi /etc/my.conf
[client] 中加入 protocol=TCP
//ref:http://www.phpvim.net/os/windows/build-mysql-client-on-cygwin.html
 
mysql -h localhost -u root
 
gem install mysql  -include-dependenciesgem
 
mysqld_safe --user=mysql &amp; 
 
mysql -h localhost -u root -p

配置数据库连接

replace
 socket: /tmp/mysql.sock
with
  username: root
  password: 555555
  host: localhost

安装数据库

rake db:create
rake db:migrate

生成100条文章来做demo数据

rake db:develop
//启动server
script/server
 
//报错,缺少包,执行如下
rake gems:install
 
gem sources -a http://gems.github.com
gem install jpignata-bossman 
 
//再执行
script/server

server启动后,访问地址http://localhost:3000/

$ script/server
=&gt; Booting WEBrick
=&gt; Rails 2.3.2 application starting on http://0.0.0.0:3000
=&gt; Call with -d to detach
=&gt; Ctrl-C to shutdown server
[2011-03-24 13:50:11] INFO  WEBrick 1.3.1
[2011-03-24 13:50:11] INFO  ruby 1.8.7 (2008-08-11) [i386-cygwin]
[2011-03-24 13:50:11] INFO  WEBrick::HTTPServer#start: pid=4760 port=3000

报错:

242093716 [main] bash 4404 exception::handle: Exception: STATUS_ACCESS_VIOLATION
 
242095173 [main] bash 4404 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
242305333 [main] bash 2116 exception::handle: Exception: STATUS_ACCESS_VIOLATION
 
242306088 [main] bash 2116 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
242617570 [main] bash 4032 exception::handle: Exception: STATUS_ACCESS_VIOLATION
 
242619190 [main] bash 4032 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
243121910 [main] bash 3596 exception::handle: Exception: STATUS_ACCESS_VIOLATION
 
243123323 [main] bash 3596 open_stackdumpfile: Dumping stack trace to bash.exe.s
tackdump
243458891 [main] bash 4968 fork: child -1 - died waiting for longjmp before init
ialization, retry 0, exit code 0x600, errno 11
bash: fork: Resource temporarily unavailable
//原因:temp放在虚拟磁盘,cygwin访问的权限不够
//如果是其他的原因可尝试如下方法:
rebaseall: only ash processes are allowed during rebasing
    Exit all Cygwin processes and stop all Cygwin services.
    Execute ash from Start/Run... or a cmd or command window.
    Execute '/bin/rebaseall' from ash.
from:http://cygwin.com//cygwin/2005-09/msg00919.html

创建表 CREATE TABLE raw_daily_stats_table1 (redirect_title STRING, dates STRING, pageviews STRING, total_pageviews BIGINT, monthly_trend DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; 加载数据 LOAD DATA INPATH '/home/dev/finalresult-a' INTO TABLE raw_daily_stats_table; //文件路径为hadoop的文件路径,上面的路径对应为hdfs://platformB/home/dev/finalresult-a

hive&gt; LOAD DATA INPATH '/home/dev/finalresult-a' INTO TABLE raw_daily_stats_table;
Loading data to table raw_daily_stats_table
OK
Time taken: 4.927 seconds

加载的时候如果报加载失败,检查你的hdfs,会发现生成了一个你的文件名+_copy_1的文件,然后你load这个文件就成了。 > show tables > ; FAILED: in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database '/var/lib/hive/metastore/metastore_db', see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database '/var/lib/hive/metastore/metastore_db', se e the next exception for details. FAILED: Execution , return code 1 from org.apache..hive.ql.exec.DDLTask hive> cat derby.log ============= begin nested exception, level (3) =========== XSDB6: Another instance of Derby may have already booted the database /var/lib/hive/ metastore/metastore_db. at org.apache.derby.iapi..StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Un known Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknow n Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source ) 原来异常退出造成前面的访问derby进程还在,而derby是文件型的存储,每次只能一个进程打开,so,你懂的,看来生成环境使用mysql才是王道,打开配置文件hive-default.xml

  hive.metastore.warehouse.dir
  /user/hive_remote/warehouse

  hive.metastore.local
  true

  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/hive_remote?createDatabaseIfNotExist=true

  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver

  javax.jdo.option.ConnectionUserName
  root

  javax.jdo.option.ConnectionPassword
  dandan

hive查询及排序: select * from raw_daily_stats_table sort by monthly_trend; select * from raw_daily_stats_table sort by monthly_trend desc limit 10; http://www.fuzhijie.me/?p=377 http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin

本文来自: 搭建trendingtopics

mahout ppt 收集

<Category: Mahout> 发表评论

mahout相关的ppt收集.
阅读这篇文章的其余部分 »

本文来自: mahout ppt 收集