write down,forget

python正则的中文处理

<Category: Python>

匹配中文时,正则表达式规则和目标字串的编码格式必须相同

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe4 in position 18: ordinal not in range(128)

print text报错
解释:控制台信息输出窗口是按照ascii编码输出的(英文系统的默认编码是ascii),而上面代码中的字符串是Unicode编码的,所以输出时产生了错误。
改成 print(word.encode(‘utf8’))即可
阅读这篇文章的其余部分

本文来自: python正则的中文处理

hadoop thrift client

<Category: Hadoop>

http://code.google.com/p/hadoop-sharp/
貌似不给力,pass

http://wiki.apache.org/hadoop/HDFS-APIs
http://wiki.apache.org/hadoop/MountableHDFS
http://wiki.apache.org/hadoop/Hbase/Stargate
http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfsproxy.html

统统不给力啊,走thrift吧,看了下svn,cocoa之类的都有现成的了,为啥没有c#,faint
阅读这篇文章的其余部分

本文来自: hadoop thrift client

python下的两个分词工具

<Category: NLP>

http://code.google.com/p/pychseg/
基于的MMSEG中文分词算法Python实现,正向最大匹配+多个规则。

需要安装psyco,有点费劲,下面是使用方法:

http://code.google.com/p/pymmseg-cpp/
https://github.com/pluskid/pymmseg-cpp/
阅读这篇文章的其余部分

本文来自: python下的两个分词工具

Hive安装Tips

<Category: Hadoop>

Hive安装

下载地址
http://hive.apache.org/releases.html
阅读这篇文章的其余部分

本文来自: Hive安装Tips

.net AOP实现

<Category: .NET>

http://www.uml.org.cn/net/201004213.asp
阅读这篇文章的其余部分

本文来自: .net AOP实现

cygwin下netcat编译与使用

<Category: Linux>

cd /tmp
wget http://ftp.de.debian.org/debian/pool/main/n/netcat/netcat_1.10.orig.tar.gz
wget http://ftp.de.debian.org/debian/pool/main/n/netcat/netcat_1.10-38.diff.gz

tar vxzf netcat_1.10.orig.tar.gz
cd netcat-1.10.orig/

zcat ../netcat_1.10-38.diff.gz | patch -p1

//Vi Makefile line 11
DFLAGS = -DDEBIAN_VERSION='”1.10-38″‘ -DGAPING_SECURITY_HOLE -DIP_TOS -DTELNET

mv nc.exe /bin/
mv nc.1 /usr/share/man/man1/

nc -h

$ nc -h
[v1.10-38]
connect to somewhere: nc [-options] hostname port[s] [ports] …
listen for inbound: nc -l -p port [-options] [hostname] [port]
options:
-c shell commands as `-e’; use /bin/sh to exec [dangerous!!]
-e filename program to exec after connect [dangerous!!]
-b allow broadcasts
-g gateway source-routing hop point[s], up to 8
-G num source-routing pointer: 4, 8, 12, …
-h this cruft
-i secs delay interval for lines sent, ports scanned
-k set keepalive option on socket
-l listen mode, for inbound connects
-n numeric-only IP addresses, no DNS
-o file hex dump of traffic
-p port local port number
-r randomize local and remote ports
-q secs quit after EOF on stdin and delay of secs
-s addr local source address
-T tos set Type Of Service
-t answer TELNET negotiation
-u UDP mode
-v verbose [use twice to be more verbose]
-w secs timeout for connects and final net reads
-z zero-I/O mode [used for scanning]
port numbers can be individual or ranges: lo-hi [inclusive];
hyphens in port names must be backslash escaped (e.g. ‘ftp\-data’).

TEST

//连接指定端口
nc -nvv 192.168.x.x 80

//端口转发,抓包
nc -l -p 1234 -c ‘tee 1234.txt | nc 192.168.x.x 22 | tee ssh.txt’
putty logon with localhost port 1234

//端口扫描
nc -v -n -z -w1 192.168.x.x 1-65535
nc -nvv -w2 -z 192.168.x.x 80-445

//本地监听端口,简单实现双机聊天
HOSTA: nc -l -p 800 或 nc -l -p 800

本文来自: cygwin下netcat编译与使用

Gephi cluster 部署

<Category: 数据可视化>

话说单台的机器处理能力总是太差,5M节点分析就把4g内存全部吃光了,可是哥还有几百万节点等待去处理呢,要是能够部署gephi cluster,是不是可以支持更多节点的分析,答案是可以的,gephi安装上cluster的插件就o了,但不是什么都可以进行分布的,首先算法是要支持图形的分布式分析处理和布局(Distributed graph layouts) 可喜的是gephi已经支持了OpenOrd layout的插件,这是一种新的基于力导向布局算法,支持多核、并行,据称是目前最快的渲染方法,机器越多越快哦,酷哉.

openord插件地址:http://gephi.org/plugins/openord-layout/

基本设置

首先,安装64位的jdk,然后修改配置:C:\Program Files (x86)\Gephi-0.7\etc\gephi07beta.conf,修改里面的JVM堆的初始值和最大值,根据你机器做相应修改即可,如下:

设置openord
其实我还么有找到设置方法,找遍了也没有找到只言片语,etc/下的gephi07beta.clusters文件也不知道怎么设置,哎,该出现的时候还是会出现的,先遗忘吧

捕获1

gephi wiki: http://wiki.gephi.org/index.php/Main_Page
gephi dev: https://launchpad.net/gephi
gephi forum: http://forum.gephi.org

本文来自: Gephi cluster 部署

flare自定义节点Render

<Category: 数据可视化>

阅读这篇文章的其余部分

本文来自: flare自定义节点Render

搭建trendingtopics

<Category: 小道消息>

https://github.com/datawrangling/trendingtopics
https://github.com/datawrangling/spatialanalytics

搭建trendingtopics,步骤。

环境准备

配置文件

安装

如果保错:undefined local variable or method `version_requirements’
vi config/environment.rb
在开头加入:

安装mysql client和mysql gem

配置数据库连接

安装数据库

生成100条文章来做demo数据

server启动后,访问地址http://localhost:3000/

报错:

创建表 CREATE TABLE raw_daily_stats_table1 (redirect_title STRING, dates STRING, pageviews STRING, total_pageviews BIGINT, monthly_trend DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE; 加载数据 LOAD DATA INPATH ‘/home/dev/finalresult-a’ INTO TABLE raw_daily_stats_table; //文件路径为hadoop的文件路径,上面的路径对应为hdfs://platformB/home/dev/finalresult-a

加载的时候如果报加载失败,检查你的hdfs,会发现生成了一个你的文件名+_copy_1的文件,然后你load这个文件就成了。 hive> show tables > ; FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database ‘/var/lib/hive/metastore/metastore_db’, see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database ‘/var/lib/hive/metastore/metastore_db’, se e the next exception for details. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive> cat derby.log ============= begin nested exception, level (3) =========== ERROR XSDB6: Another instance of Derby may have already booted the database /var/lib/hive/ metastore/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Un known Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknow n Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source ) 原来异常退出造成前面的访问derby进程还在,而derby是文件型的存储,每次只能一个进程打开,so,你懂的,看来生成环境使用mysql才是王道,打开配置文件hive-default.xml

hive查询及排序: select * from raw_daily_stats_table sort by monthly_trend; select * from raw_daily_stats_table sort by monthly_trend desc limit 10; http://www.fuzhijie.me/?p=377 http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin

本文来自: 搭建trendingtopics

python发送邮件

<Category: Python>

本文来自: python发送邮件