a tiny data analyst

Tuesday, June 1, 2010

Prior and Posterior probability

UserID LabelID Prior Likelihood Posterior
1 1 71/206 15/71 .07
1 2 27/206 2/27 .009
1 3 108/206 1/108 .004
2 1 71/206 21/71 .101
2 2 27/206 15/27 .07
2 3 108/206 7/108 .03
3 1 71/206 35/71 .169
3 2 27/206 0/27 0
3 3 108/206 100/108 .485
4 1 71/206 0/71 0
4 2 27/206 10/27 .04
4 3 108/206 0/108 0

Posterior = Prior * Likehood(category)
71/206 * 15/71 == .07

in reference to: How I Would Use the Google Prediction API ( To Find Your Musical Profile) | The Data Scientist (view on Google Sidewiki)

Wednesday, April 28, 2010

淘宝商城情况和定位,淘宝CFO商城主管张勇

1：淘宝商城做B2C，相当于一个设防的经济特区，要进入这个经济特区经营的商家，必须是企业，同时它必须要符合一定的资质要求，实行更严格的管理规则和对商家更严重的服务要求

2：淘宝商城的情况：20个淘宝商城（分类），12000家商户，吸引点是：1.9亿注册会员，中国的网名3.85亿

3：商家/ 企业在淘宝商城的定位很重要，“1.9亿用户和一天4000万的访问者，怎么样在这里面找到所在商品品牌的目标用户，怎么样对这些目标用户进行定向营销”

* 数据挖掘可以做的事情：分析淘宝用户的购物需求，为企业提供需求分析

4：电子商务非常重要的问题是后台问题，即电子商务的解决方案，比如配货，例子：一个商家做活动一天得配送4W个包裹，“怎么样进行流程设计，能够保证这些包裹能够很好的报，并且很好的检验，没有发错，怎么样能够保证快递冷却及时到达消费者手里”

5：企业在淘宝商城战略上的资源配置
6：补货
* 数据挖掘：应该要能够很好的预测到可能的销量，帮助企业未雨绸缪准备货物储备

7：淘宝商城一年年收入增长500%

in reference to: 图文：淘宝商城主管张勇演讲_互联网_科技时代_新浪网 (view on Google Sidewiki)

Wednesday, April 21, 2010

Hive 安装过程

Hive 是由Facebook found的一个Hadoop子项目，看
淘宝数据平台师兄的介绍

我记录下我在安装Hive时候遇到的问题，以便后来者能够借鉴之

首先我考虑的是官方的tutorial
http://wiki.apache.org/hadoop/Hive/GettingStarted#Hive_introduction_videos_From_Cloudera

$ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
$ cd hive
$ ant package
$ cd build/dist
$ ls
README.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)

但是发现在ant的时候，一直出现 ivy:retrieve .....的提示，我估计是从网上需要下载东西，后来仔细看了下install 过程的提示发现了

[ivy:retrieve] downloading http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.17.2.1/hadoop-0.17.2.1.tar.gz ...

在中华大局域网下，你想从facebook下东西？先翻墙
因为我有SSH，我用了proxychains，天真的把安装程序丢到proxychains中间去就以为能够万事大吉，

proxychains ant package

结果错了，还是出现这种问题。
我不知道是proxychains无能，还是别的什么我没想到的配置
最后，只好作罢，树挪死人挪活
想起淘宝数据平台博客（我暑假就要去淘宝实习了，也是这个部门，师兄的作品呢）有安装步骤，
淘宝数据平台
果然

Hive 的下载配置安装
请参考入门指南, 这里给出最基本的提纲:

* 安装配置 Hadoop。
* 安装配置数据库（mysql 等）。
* 获得 Hive 源码或者可执行代码。wget http://www.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz
* tar xzf hive-0.5.0-bin.tar.gz
* cd hive-0.5.0
* 配置 Hive 如何访问数据库，如何访问 Hadoop。
* 运行 Hive。

当看到 Hive 提示符‘Hive>’的时候，恭喜，你可以开始你的 Hive 之旅了。

最后按照此方法下载bin source code ,tar,设置了$HADOOP_HOME
最后 done

hive>>

Wednesday, March 31, 2010

不会“思维”只会“批判”，谨防网络舆论“怨妇化”

2010年02月26日 02:27 来源：侨报作者：南桥【大中小】

卫斯理大学校长迈克尔·罗斯(Michael Roth)近日撰文《超越评判式思维》(Beyond Critical Thinking)，警告学生不要变成只会批评，不会思维，却还一个个自鸣得意的废人。“批判性思维”一说缘起于1962年《哈佛教育评论》上罗伯特·恩尼斯（Robert H. Ennis)的一篇文章，此后这个说法就不胫而走，成为教育界多年以来一直追捧的一个话题。

恩尼斯当初提出批判性思维，重点是“思维”，只不过如罗斯所述，不少人借“批评”来彰显自己的聪明，倒把“思维”给边缘化了。2002年，恩尼斯重新说明了批判性思维的一些特征，比如“思维开放，熟知多个选项的优劣”、“力求多方查证”、“善于判断信息来源” 、“识别言论的结论、推论和潜在假设”、“能形成合理的立场”、“善于发问，澄清问题本质”等。一言以蔽之，他是要大家养成严谨的思维习惯，不被人随意忽悠、人云亦云。

时隔半个世纪，恩尼斯老调重弹，再次强调批判性思维应该重“思维”，是而今传播方式的改变使然。 2009年，中国网络在不少公共事件中的正面作用显著，但网络也不是世外桃源，瞎起哄、瞎围观者也不少。中国社会科学院教授于建嵘在分析“泄愤事件”时说，“自从有了互联网，有了手机短信，现时代的中国已经没有了权威信息。”没有权威信息未必是坏事，就怕出现了错误的“权威信息”，一家独大。现在恰恰就是过去来自政府的“权威信息”，被网络“意见领袖”的“权威信息”所取代。网络推手能掀起波澜，制造出种种伪热点，让网民趋之若鹜，把网络变成了是非之地。

与此同时，追捧网络超级偶像的粉丝则自甘放逐到隧道式思维里，坐井观天，不去看学人的真知灼见，围观几个所谓“网络红人”的吃喝拉撒。北京大学新闻与传播学院教授胡泳曾指出，中国整个社会日益童稚化，他指的是舆论监管。其实还有另一种“童稚化”，那就是在接受信息，选择信息来源的时候，由于缺乏思考而导致的“无脑化”。

由于负面消息和批评容易引起轰动，在网络这个江湖里，一些本可善用其影响的人，堕落成了为否定而否定的人，比如海外一些不论青红皂白“逢中必反”的人。在否定的时候，他们又提不出什么建设性意见，变得“怨妇化”。

“怨妇化”的“意见领袖”，外加“无脑化”的粉丝大军，就是目前中国网络上最大的景观。

笔者在上文提到的于建嵘，他的批评总是有建设性。因为他的真知灼见总是来自调查研究。当下，社会轻易就把“知识分子”头衔加在某人身上，理由是敢说话，却不在乎其话语到底是否有质量。见到几条负面消息，就把专家统统当作“砖家”，教授全部唤作“叫兽”。这样的反智倾向令人忧虑。

网络影响甚至左右舆论已成了既成事实。网络能让愚蠢的人更蠢，让聪明的人更聪明，善用之者鉴别黑白，去伪存真，不善用者随波逐流，任人催眠。如罗斯教授强调的那样，在网络言论良莠不齐，牢骚过盛之时，读者或许应该把“批判性思维”的重点，从“批评”移到“思维”上。

（作者系旅美华人学者）

Wednesday, November 25, 2009

3 dimensions on Behaviral targeting

* CRM Dimension : Customer Retention
* Branding Dimension : Brandwashing
* Direct response : Customer Acquisition

in reference to: http://www.clickz.com/3401511 (view on Google Sidewiki)

Tuesday, October 27, 2009

DataMining Tools

Wake
R
Excel

what funny is follow...

A;
SAS Base does the great job there. SPSS Modeler as well. SPSS Statistic trial is available at http://www.spss.com.

Java is cool. But you are wasting time with programming. Data-miner has more important task to do than generating tons of code.
Doesn’t (s)he?

B
I disagree for two reasons:
- I cannot count the algorithms I can give you a understandable description of in 1 minute, but when it comes to a real data analysis you will meet special cases where you have to know EXACTLY how this algorithm is implemented. That is the reason I could never work with non-open source programs
- If you are not able to write code (at least for changing the behavior of present algorithms or create new ones) you restrict yourself to use only what’s available. Are you sure your data mining environment is prepared for every possible data analysis problem ?

@tools: You forgot RapidMiner (former Yale) which does an excellent job in handling large datasets and data preparation (its key focus). It is free, it is open source and it is written in java.

in reference to:

"I disagree for two reasons:
- I cannot count the algorithms I can give you a understandable description of in 1 minute, but when it comes to a real data analysis you will meet special cases where you have to know EXACTLY how this algorithm is implemented. That is the reason I could never work with non-open source programs
- If you are not able to write code (at least for changing the behavior of present algorithms or create new ones) you restrict yourself to use only what’s available. Are you sure your data mining environment is prepared for every possible data analysis problem ?
@tools: You forgot RapidMiner (former Yale) which does an excellent job in handling large datasets and data preparation (its key focus). It is free, it is open source and it is written in java."
- Data into results » Data mining tools (view on Google Sidewiki)

Clusters on Twitter users

using kmeans method to extract the clusters
comtain : common people , geek,profession manager,online addict.any more..
also supply a result

http://www.dataintoresults.com/pub/twitter-seg-01.php

in reference to:

"A Twitter users segmentation"
- Data into results » A Twitter users segmentation (view on Google Sidewiki)

a tiny data analyst

Tuesday, June 1, 2010

Prior and Posterior probability

Wednesday, April 28, 2010

淘宝商城情况和定位,淘宝CFO商城主管张勇

Wednesday, April 21, 2010

Hive 安装过程

Wednesday, March 31, 2010

不会“思维”只会“批判”，谨防网络舆论“怨妇化”

2010年02月26日 02:27 来源：侨报作者：南桥【大中小】

Wednesday, November 25, 2009

3 dimensions on Behaviral targeting

Tuesday, October 27, 2009

DataMining Tools

Clusters on Twitter users

Blog Archive

About Me

a tiny data analyst

Tuesday, June 1, 2010

Prior and Posterior probability

Wednesday, April 28, 2010

淘宝商城情况和定位,淘宝CFO商城主管张勇

Wednesday, April 21, 2010

Hive 安装过程

Wednesday, March 31, 2010

不会“思维”只会“批判”，谨防网络舆论“怨妇化”

2010年02月26日 02:27 来源：侨报 作者： 南桥 【大 中 小】

Wednesday, November 25, 2009

3 dimensions on Behaviral targeting

Tuesday, October 27, 2009

DataMining Tools

Clusters on Twitter users

Blog Archive

About Me

2010年02月26日 02:27 来源：侨报作者：南桥【大中小】