a tiny data analyst: October 2009

Tuesday, October 27, 2009

DataMining Tools

Wake
R
Excel

what funny is follow...

A;
SAS Base does the great job there. SPSS Modeler as well. SPSS Statistic trial is available at http://www.spss.com.

Java is cool. But you are wasting time with programming. Data-miner has more important task to do than generating tons of code.
Doesn’t (s)he?

B
I disagree for two reasons:
- I cannot count the algorithms I can give you a understandable description of in 1 minute, but when it comes to a real data analysis you will meet special cases where you have to know EXACTLY how this algorithm is implemented. That is the reason I could never work with non-open source programs
- If you are not able to write code (at least for changing the behavior of present algorithms or create new ones) you restrict yourself to use only what’s available. Are you sure your data mining environment is prepared for every possible data analysis problem ?

@tools: You forgot RapidMiner (former Yale) which does an excellent job in handling large datasets and data preparation (its key focus). It is free, it is open source and it is written in java.

in reference to:

"I disagree for two reasons:
- I cannot count the algorithms I can give you a understandable description of in 1 minute, but when it comes to a real data analysis you will meet special cases where you have to know EXACTLY how this algorithm is implemented. That is the reason I could never work with non-open source programs
- If you are not able to write code (at least for changing the behavior of present algorithms or create new ones) you restrict yourself to use only what’s available. Are you sure your data mining environment is prepared for every possible data analysis problem ?
@tools: You forgot RapidMiner (former Yale) which does an excellent job in handling large datasets and data preparation (its key focus). It is free, it is open source and it is written in java."
- Data into results » Data mining tools (view on Google Sidewiki)

Clusters on Twitter users

using kmeans method to extract the clusters
comtain : common people , geek,profession manager,online addict.any more..
also supply a result

http://www.dataintoresults.com/pub/twitter-seg-01.php

in reference to:

"A Twitter users segmentation"
- Data into results » A Twitter users segmentation (view on Google Sidewiki)

Monday, October 26, 2009

NUll hypothesis

one cannot make decisions or draw conclusions that assume the truth of the null hypothesis. Just as failing to reject it does not "prove" the null hypothesis, one does not conclude that the alternative hypothesis is dis-proven or rejected, even though this seems reasonable. One simply concludes that the null hypothesis is not rejected.[clarification needed] Not rejecting the null hypothesis still allows for getting new data to test the alternative hypothesis again. On the other hand, rejecting the null hypothesis only means that the alternative hypothesis may be true, pending further testing.

in reference to: Null hypothesis - Wikipedia, the free encyclopedia (view on Google Sidewiki)

Friday, October 23, 2009

stanford opencourse data mining ...

the course is more deeply

1:
involve a lot of really application such as pageRank ,recommended system .anti-spam.
more applicaly
2:i should do the excise in my textbook introduction to data mining first
and then do some deeper study from this course .

3:a lot of data sets available for public use ,such as netFlixprize.com a competion for predict missing rate records from given 100 million rating records.what a huge data, i may be try to do something with it when i reach the enough high level of my skills ..
add to Yahoo Data .ACM multimedia challenge ....big data sets is available everywhere..

4:practice and making mistake ,get error .and solve the problems ..then improve myself on data mining .

the open course from stanford is here .

http://www.stanford.edu/class/cs345a/

Thursday, October 22, 2009

《Pro MySql》关于随机主键对InnoDB插入性能的影响

事实证明，《Pro MySQL》对InnoDB的描述是不对的。不过《Pro MySQL》已经出版了很多年了，或许以前的InnoDB是按照他说的方式实现的也未可知。从测试结果来看，InnoDB（MySQL5.0.84自带）是真正采用了聚集索引，数据存放的物理位置与聚集索引相关。

in reference to:

"事实证明，《Pro MySQL》对InnoDB的描述是不对的。不过《Pro MySQL》已经出版了很多年了，或许以前的InnoDB是按照他说的方式实现的也未可知。从测试结果来看，InnoDB（MySQL5.0.84自带）是真正采用了聚集索引，数据存放的物理位置与聚集索引相关。"
- 悔恨+懒惰=进步的动力 » Blog Archive » 随机主键对InnoDB插入性能的影响 (view on Google Sidewiki)

Mysql 4种架构

1：单库结构
2：master-slave (MS) 结构
3：master-master （MM）
4：复合结构 --建立在单库基础上，然后对单库上的节点进行2，3替换

在资金投入有限的情况下，DBA需要在高性能、高稳定、可拓展这三个要素中玩跷跷板。

in reference to:

"在资金投入有限的情况下，DBA需要在高性能、高稳定、可拓展这三个要素中玩跷跷板。"
- 悔恨+懒惰=进步的动力 » Blog Archive » MySQL数据库架构方案 (view on Google Sidewiki)

51 Mysql 容灾

全量备份+增量备份+主从同步备份+一致性检测+带库备份+定期容灾演练

平台化和插件化

in reference to:

"平台化和插件化"
- 悔恨+懒惰=进步的动力 » Blog Archive » 51.com的MySQL容灾 (view on Google Sidewiki)

Mysql exclusive locked Bug

key:Next-key lock

事务1在主键a=7上面加了排他锁，事务2就无法在主键a<=6上面加排他锁了。
初步认为是Next-key lock导致了这个问题。具体原因有待确证。

in reference to:

"事务1在主键a=7上面加了排他锁，事务2就无法在主键a<=6上面加排他锁了。
初步认为是Next-key lock导致了这个问题。具体原因有待确证。"
- 悔恨+懒惰=进步的动力 » Blog Archive » 一个关于主键排他锁的问题 (view on Google Sidewiki)

Tuesday, October 20, 2009

web 3.0

http://www.theeasybee.com/

in reference to: HimmiH (view on Google Sidewiki)

diaper and beer

http://www.dssresources.com/newsletters/66.php

the most famous data mining example

father was set out buy diaper who are attend buy something for themselves for award...Beer is a good choice

in reference to: HimmiH (view on Google Sidewiki)

one & zero attribute rule

One-attribute-rule
The one-attribute-rule, or OneR, is an algorithm for finding association rules. According to Ross, very simple association rules, involving just one attribute in the condition part, often work well in practice with real-world data.[17]. The idea of the OneR (one-attribute-rule) algorithm is to find the one attribute to use to classify a novel datapoint that makes fewest prediction errors.

For example, to classify a car you haven't seen before, you might apply the following rule: If Fast Then Sportscar, as opposed to a rule with multiple attributes in the condition: If Fast And Softtop And Red Then Sportscar.

The algorithm is as follows:

For each attribute A:
For each value V of that attribute, create a rule:
1. count how often each class appears
2. find the most frequent class, c
3. make a rule "if A=V then C=c"
Calculate the error rate of this rule
Pick the attribute whose rules produce the lowest error rate

[edit]Zero-attribute-rule
The zero-attribute-rule, or ZeroR, does not involved any attribute in the condition part, and always returns the most frequent class in the training set. This algorithm is frequently used to measure the classification success of other algorithms.

in reference to: HimmiH (view on Google Sidewiki)

3 association algorithms

Many algorithms for generating association rules were presented over time.

Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step need to be done after to generate rules from frequent itemsets found in a database.

[edit]Apriori algorithm
Main article: Apriori algorithm
Apriori[5] is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.

[edit]Eclat algorithm
Eclat[6] is a depth-first search algorithm using set intersection.

[edit]FP-growth algorithm
FP-growth (frequent pattern growth)[16] uses an extended prefix-tree (FP-tree) structure to store the database in a compressed form. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.

in reference to: HimmiH (view on Google Sidewiki)

Friday, October 16, 2009

a world in motion @google acquire gapminder's treandalyzer

facts are stubborn ,but statistis are more pliable

in reference to: Official Google Blog: A world in motion (view on Google Sidewiki)

The Most Intelligent Java IDE

JAVA IDE

in reference to:

"The Most Intelligent Java IDE"
- IntelliJ IDEA :: Best Java IDE with smart Java editor, Java debugger, Java code generator, automatic Java code coverage measurement, Java GUI builder, fully supporting Java EE, for productive Java programming (view on Google Sidewiki)

tagsschema

he said he invent the data 2.0

in reference to: TagSchema (view on Google Sidewiki)

<a href="http://trendistic.com">trendistic.com</a>

in reference to: “google” trends in Twitter with Trendistic (view on Google Sidewiki)

Neural Data Mining for Credit Card Fraud Detection

in reference to:

"Neural Data Mining for Credit Card Fraud Detection"
- Free PDF: Neural Data Mining for Credit Card Fraud Detection - Free ebook manual download - PDFee.com (view on Google Sidewiki)

Thursday, October 15, 2009

dming trainning course

in reference to: BI-Quotient » Training (view on Google Sidewiki)

<a href="http://trendistic.com">trendistic.com</a>

trend GFW in tweets...

in reference to: “gfw” trends in Twitter with Trendistic (view on Google Sidewiki)

freebase developer

using ur own language to query freebase 's data...in API

in reference to: Freebase - Code Search (view on Google Sidewiki)

AWS openData

a lot of open database from AWS ...especially Freebase www.freebase.com

in reference to: Public Data Sets on Amazon Web Services (AWS) (view on Google Sidewiki)

Wednesday, October 14, 2009

1

当你用某个自由软件遇到困难的时候，不应该埋怨软件的作者，因为他们对你并没有义务。你不应该把自己当成一个挑剔的顾客，而要把自己作为这个软件的顾问和一个和蔼的建议者，这样你才能理解作者写这个程序时的快乐， 2009-08-01 00:47 (分类:默认分类)

next:

在遇到问题时向作者反映，帮助他完善这个软件，成为一个快乐的参与者。就像你的哥哥送你一个他用旧了的自行车，你应该珍惜这份友情，而不要在车坏了，或者骑车摔了一跤的时候大骂你的哥哥。如果你真的不能使用这种合作的心态，那么最好不要使用这个软件。

in reference to:

"当你用某个自由软件遇到困难的时候，不应该埋怨软件的作者，因为他们对你并没有义务。你不应该把自己当成一个挑剔的顾客，而要把自己作为这个软件的顾问和一个和蔼的建议者，这样你才能理解作者写这个程序时的快乐，
2009-08-01 00:47 (分类:默认分类)

next:在遇到问题时向作者反映，帮助他完善这个软件，成为一个快乐的参与者。就像你的哥哥送你一个他用旧了的自行车，你应该珍惜这份友情，而不要在车坏了，或者骑车摔了一跤的时候大骂你的哥哥。如果你真的不能使用这种合作的心态，那么最好不要使用这个软件。"
- 人人网校内 - 浏览日志 - 当你用某个自由软件遇到困难的时候，不应该埋怨软件的作者，因为他们对你并没有义务。你不应该把自己当成一个挑剔的顾客，而要把自己作为这个软件的顾问和一个和蔼的建议者，这样你才能理解作者写这个程序时的快乐， (view on Google Sidewiki)

Title of entry (optional)昨天偶然看到一个民工的账本，我哭了！！！(转自拍砖＠豆瓣)

Helpful information about "人人网校内 - 浏览日志 - 昨天偶然看到一个民工的账本，我哭了！！！(转自拍砖＠豆瓣)".昨天中午，他在我们公司搬了东西，就蹲在公司的门口记东西，我看他蹲着写挺费力的，就叫他坐到我的办公位置上写。不经意间，我发现他在记帐，这倒引起了我很大的兴趣（绝对没有窥视他的隐私的意思，纯属好奇），我也就拿过来看了一下，他记账是那种流水账（其实就是一笔一笔的加上去），我大致默算的一下，整理下来，大家可以看一下，同时有一些我的解释，是我问他后记下来的。（照片
帐是12月份的总收入：770元左右（大致的，但不会超过800)

房租：50元（4个人合租了一间房）

管理费：20元（街道收的，包括10块钱的暂住费）

餐费：140元（早饭1块，中饭4块，管饱不管好的那种）

买菜：27元（4个人每天*流买菜，一起做饭吃）

买米：15元（本来自家有米，但来回的车费比买米还贵）

日用：30元（包括油、盐、纸等）

买烟：21元（0.7块钱一包的那种，1天抽一包烟）

通讯费：17元（包括10块钱CALL台服务费）

交通费：3元（日常交通基本靠走）

给儿子生活费：200元（儿子在县里读高中）

给老婆买件衣服：20元（估计是地摊上买的，"半年没给她买新衣服了"他说这话时充满愧疚）寄回家：150元（存起来给儿子念书）

给母亲看病寄去：50元（母亲药费3兄妹分摊）

意外支出：60元（一次为了抢活横穿马路被罚款10元，一次挑东西碰着了一个小青年，被敲诈了50块洗衣费）

我看着他的支出，很是心酸，他说我们公司的人都很好，经常把能卖钱的东西给他（就是废报纸，不要的包装箱，还有就是过期的宣传品），有次有个女孩还给他件衣服（就是一件宣传用的广告衫），每次在我们这里做事，都有水喝，有时候还有好烟抽（我无语，我们叫他做了事，有时会给他支烟）他最怕的就是生病，哪怕是感冒发烧都怕，最想的就是儿子能考上大学，母亲身体能好起来，最不想的就是乡干部到他家里去，去了就多是要钱。他每天6点钟就出来找活，8点中才能回去，最快乐的时光就是吃了饭跑到小卖铺去看电视。我问他为什么不在家乡承包点鱼塘、果园，他憨厚地笑着说，那不是他们能承包到的，好地方都让有关系的搞走了，他不知道什么叫公民权利，他长这么大没见过选票。

他知道WTO，新闻里常讲，但他不懂政治，也不懂经济，他只想每天能多挣10块钱，这样每个月就能有多的钱给母亲买好点的药，给儿子多寄点生活费，给老婆多买件好看的衣服。他说很怕死，因为他要为这个家奋斗，他的母亲，老婆，儿子还要他养活。他最大的愿望就是能存点钱做点小生意，能让自己的经济宽裕一些。这就是一个普通民工的月帐本和自白，全国有7亿这样的人，他在这个群体中算是中等吧，他们没有远大的理想，他们生活在这个国家的底层，他们是这个国家的基石，他们没有接受这个国家的任何资助，没有享受都这个国家的任何福利，在关键时候，他们也是最容易被遗忘的群体，我们甚至于不愿意把他们当作我们这个拥有几千年历史的文明的一个部分。我们似乎为他们考虑得太少了。

每个人可以扪心自问，你是否注意过他们，你时候考虑过他们，当一个普通民工站在你旁边，他身上的汗味飘进你的鼻孔，你是否会掩住你的口鼻。我以前会，但我不知道我以后会不会。我自认为我很爱国，但现在我认为我以前只是喊喊口号罢了，我无法帮助他们，我能做的就是跑到"爱心1帮1"活动那里捐几个小钱，帮住个失学儿童，我想，他们需要的不只是这种帮助，他

in reference to:

"昨天偶然看到一个民工的账本，我哭了！！！(转自拍砖＠豆瓣)"
- 人人网校内 - 浏览日志 - 昨天偶然看到一个民工的账本，我哭了！！！(转自拍砖＠豆瓣) (view on Google Sidewiki)

鞭尸，天才的游戏！

Helpful information about "人人网校内 - 浏览日志 - 鞭尸，天才的游戏！".全球最冷血的社区杀手、全球最性球的域名收藏家、全球最忽悠的资本商人。陈一舟同学，今日开始玩起了新的游戏 ── 鞭尸。

这种贯穿整个游戏链条的玩法堪称天才！能玩的起这种游戏的人更是天才中的天才！
让我们看看这套鞭尸游戏的玩法：

第一步：收购一个蒸蒸日上的网站；
第二步：开始抄袭概念，赚取更多资本；
第三步：奸死这蒸蒸日上的网站；
第四步：将死网站的域名封存，做成标本；
第五步：将标本从停尸间拖出来，强奸之；
第六步：继续忽悠概念，回到上面第二步。循环之 …

让我们看看杀手陈至少还可以玩几把：mop.com、xiaonei.com、dudu.com、kaixin.com、5q.com、uume.com、…

这个游戏足够好玩。我看他是不打算回家吃饭了。

转载自：http://uicom.net/blog/?p=840

in reference to:

"鞭尸，天才的游戏！"
- 人人网校内 - 浏览日志 - 鞭尸，天才的游戏！ (view on Google Sidewiki)

别告诉我你丫的上的是大学 @rt

美国, 麻省, 计算机教育, 理工, 专业清华本科5年，和许多同志们一样为着一个闯荡世界的梦想苦苦努力，98年终于在历尽千辛万苦之后，踏上北美大陆。这两年来当真是感慨万千，清华的学习生活我算是深有体会，而北美大学的学习生活我也可以算是领会了个中滋味。相信国内的许多好学上进的DDMM们还处在一种梦想和憧憬的阶段，如我两年前一样，在这里我穷一己之力，希望能作一比较和介绍，让大家在国门之内能够了解到清华与北美的学习生活之同之异与差。切入点我选择的是中国和美国的名牌老大：清华和 MIT。

从生源上讲，美国没有那个学校可以把全国各省的理科状元和前十名大半收入囊中。MIT虽然始终在理工科方面独占鳌头，但是加州理工学院、 Stanford、Berkeley等名校并不逊色太多。像Stanford更是以地处硅谷、生产杨致远型的资本家而独具吸引力。从优秀学生的聚集程度上讲，大概MIT加上Stanford都不足以与清华一拼。

TOEFL和GRE毕竟没白考，上课能听懂百分之八九十。本来用于录老师讲课内容的单放机只带了一次，也没开录，后来再也没带过。阅读课文也没有太多的困难（哪有GRE歪词那么多）。美国学生实在是基础差，又不用功，一百分的作业得不到五十分的一般都是美国人。（日本、印尼等其他国家人的英文名字读起来可没有美国人的那么顺）。但是，这里的课程负担可实在是不轻松，一般来说，选三门课上是标准（不少人只敢选两门）；能上四门课的人堪称不同凡响；五门课？最好不要想。

我在这里的第一学期上了一门叫《计算机系统设计》得本科课程，其辛苦程度真实一言难尽。十五个星期内交了十次作业，作了六次课程设计。有的设计还分几个部分，分开交设计报告。所以设计报告大概也交了有十次左右。最恐怖的是有一次，十天内要交六份作业或设计报告，而且当时正值其他几门课正在期中考试。抱怨是没有用的，老师说："我很抱歉。但这门课很重要，请大家不停的工作。"学生从一般的逻辑时序电路开始设计（数电都已忘得差不多了）；核心是自行设计"麻雀虽小五脏俱全"得ALU，单指令周期CPU（single cycle CPU）；多指令CPU（Multi-cycle CPU）；以直到最后实现流水线（pipe line）32位MIPS CPU和Cache。一门课下来，所有与计算机CPU有关的知识全部融会贯通。硬件设计水平也有了很大提高（就是太累）。

在清华的本科课程中实在找不到这样一门如此实在的集理论实践于一体的课程。计算机系的TEC-II型试验计算机几经是全国独家了，但是试验安排并没有触及核心（TEC-II机是微指令时的计算机）。缺乏动手设计环节是学生对CPU原理的了解比较肤浅。

当初我作微指令实验时曾由茅塞顿开的感觉，但是现在才发现远远不够。至于全校性选修课《微机原理》的水平就不用说了，至今我想起老师的一句话还有些心痛 -- "学通了八位机，十六位机、三十二位机的原理是一样的"，MIPS怎么可能和十六位CPU相提并论？在anford，本科生也有相似的计算机系统设计课程，但是要求学生用VHDL语言（当今IC设计标准语言）实现，清华的本科生由几个会用VHDL？在清华曾上过《操作系统》这门课，要在 Linux基础上作四个project，六个人一组，可以期末一起交。我在计算机系的同学直到期末前两三周仍叫我不必惊慌，说最后一两周内定会有牛人做出来，大家都可以搭车。我在这里的情况则不同，也是四个project，三个人一组，每三周交一个project。

如相似之处过多，当即受到质询，处罚办法抄者被抄者各扣50分（满分100），被罚者早有先例。交作业时间是某个周日晚11

in reference to:

"别告诉我你丫的上的是大学 @rt"
- 人人网校内 - 浏览日志 - 别告诉我你丫的上的是大学 @rt (view on Google Sidewiki)

Sunday, October 11, 2009

<a href="http://glocal.cn">glocal.cn</a>

Helpful information about "visualcomplexity.com | Glocal".In the context of Glocal Project, Digital Artist Jer Thorp has produced 2 abstract search tools to visually navigate through Glocal's large database of photos. The first image represents the latest tool created by Jer, entitled Glocal Similarity Map Engine, which shows the compositional similarity between a particular image (shown in the center) and other images in the Glocal Pool. The second image is from Glocal Image Breeder, a tool that allows users to breed images - and look for 'children' that may contain common elements from both images. The result is a non-goal-oriented search engine that takes the user through a myriad of possible 'relational maps' within the Glocal Database. As additional people use the Image Breeder, more and more relationships between images are exposed.

in reference to:

"In the context of Glocal Project, Digital Artist Jer Thorp has produced 2 abstract search tools to visually navigate through Glocal's large database of photos. The first image represents the latest tool created by Jer, entitled Glocal Similarity Map Engine, which shows the compositional similarity between a particular image (shown in the center) and other images in the Glocal Pool. The second image is from Glocal Image Breeder, a tool that allows users to breed images - and look for 'children' that may contain common elements from both images. The result is a non-goal-oriented search engine that takes the user through a myriad of possible 'relational maps' within the Glocal Database. As additional people use the Image Breeder, more and more relationships between images are exposed."
- visualcomplexity.com | Glocal (view on Google Sidewiki)

data beauty

mark it

in reference to: 数据之美四：20 个出色的 Infographic 网站 - 基于 COMSHARP CMS (view on Google Sidewiki)

Why Many Eyes

Helpful information about "Visual Communication Lab - VCL".transferring a large amount of information from a database into an individual's

headvisualizations become even more powerful when multiple people access them for collaborative sensemaking

in reference to: Visual Communication Lab - VCL (view on Google Sidewiki)

go to where ...

http://vizlab.nytimes.com/page/About.html
--------------->>>>>>>>
http://www.research.ibm.com/visual

in reference to: Visualization Lab: About (view on Google Sidewiki)

how to visualize in a efficient and inspired way

Every form of visualization should tell a story. Unfortunately there is limited attention and time to process all the stories. So the gist of the story, or its immediate impact, should be visible right away. The term I like to use for this principle is “glanceability.

in reference to:

"You can explore more examples of data visualization in web and print publications such as Information Aesthetics, Flowing Data, Many Eyes, Wired, The New York Times, and right here at GOOD."
- How Might We Visualize Data in More Effective and Inspiring Ways? | GOOD (view on Google Sidewiki)

Saturday, October 10, 2009

who rules the social web?

it is obvious that femal love social activity more...

completed data about many Sns website here

in reference to:

"formation Is"
- Who Rules The Social Web? | Information Is Beautiful (view on Google Sidewiki)

Friday, October 9, 2009

the call to action @obama

This morning, Michelle and I awoke to some surprising and humbling news. At 6 a.m., we received word that I'd been awarded the Nobel Peace Prize for 2009.

To be honest, I do not feel that I deserve to be in the company of so many of the transformative figures who've been honored by this prize -- men and women who've inspired me and inspired the entire world through their courageous pursuit of peace.

But I also know that throughout history the Nobel Peace Prize has not just been used to honor specific achievement; it's also been used as a means to give momentum to a set of causes.

That is why I've said that I will accept this award as a call to action, a call for all nations and all peoples to confront the common challenges of the 21st century. These challenges won't all be met during my presidency, or even my lifetime. But I know these challenges can be met so long as it's recognized that they will not be met by one person or one nation alone.

This award -- and the call to action that comes with it -- does not belong simply to me or my administration; it belongs to all people around the world who have fought for justice and for peace. And most of all, it belongs to you, the men and women of America, who have dared to hope and have worked so hard to make our world a little better.

So today we humbly recommit to the important work that we've begun together. I'm grateful that you've stood with me thus far, and I'm honored to continue our vital work in the years to come.

Thank you,

President Barack Obama

Tuesday, October 6, 2009

【你想回忆却苦不堪言】

语出陈晓卿本周推文，应该是一则短信吧：

人有三样东西是无法隐瞒的：咳嗽、贫穷和爱，你想隐瞒却欲盖弥彰；人有三样东西是不该挥霍的:身体、金钱和爱，你想挥霍却得不偿失；人有三样东西是无法挽留的：生命,时间和爱，你想挽留却渐行渐远；人有三样东西是不该回忆的：灾难、死亡和爱，你想回忆却苦不堪言。

in reference to:

"【你想回忆却苦不堪言】"
- 活生生把一奥黛丽赫本给打造成了站街妹 (view on Google Sidewiki)

数据质量

在一定程度上讲，一个国家统计数据的质量，在很大程度上能够反映出一个国家的发展水平。

OECD国家，基本上是世界上最发达的国家（当然，有几个还是有点牵强），这些国家的数据，干净清晰，规范性和可比性强，分析起来非常舒服，这也是很多的跨国研究都使用OECD国家数据的原因。

经常用中国数据的人，都会知道，中国的国家统计局和各个部委公布天量的各种数据，从数据量上说一点都不少，但问题是，你经常连最常用的数据都找不到：比如说按季度的消费，投资和净出口（最近开始公布增长率了），真正的失业率等等，这些都是最基本的宏观数据，可是国家统计局目前还存在技术性的困难统计这些数据（这不是开玩笑，这还不是说国家统计局知道但故意不公布）。中国还有很多的数据对不上，还拿就业数据说吧：美国的就业数据有根据入户调查的，也有根据企业用工调查的，两个虽然不一致，但相差的并不多，可是中国按照入户调查的就业数据要比根据企业用工调查的就业数据多出好几亿人。

但中国的数据质量再差，还是比很多别的发展中国家强多了，我就拿我最近看到的某发展中国家的数据说事吧，让我乐坏了：

我先想知道这个国家的人均GDP是多少，然后我发现这个国家在某一年人均GDP翻了一倍，我再仔细一看，原来这个国家的人口在那一年由原来的N万，一下变成的N/2万，这个国家没有发生战争，瘟疫之类的东西，就是他们的统计的人口数在一年之间被对半砍了一半。

然后我想看看这个国家国际收支的情况，让我忍俊不禁的是，这个国家的国际收支最重要的那一项是：误差和遗漏。换句话说，这个国家根本搞不清钱和物是怎么进进出出这个国家的，所以最后只能把所有的东西归于误差和遗漏。

好吧，那就看看货币和银行吧。我发现这个国家有非常完整的利率序列，还是月度的，让我高兴了一番，然后我就做了一幅图，让我惊异的是，利率是一条水平的直线，我这才发现这个国家的利率，存款也好，贷款也好，已经很多年没动过了，好吧，利率完全没有任何信息。那就看看货币量吧，M1, M2之类的，这个国家确实公布，但我看了一些，觉得怎么都不对劲，这个国家的经济据说在增长，可是货币量却在下降，这不是很合理。然后我发现，原来这个国家有大量别国货币在境内流通，但中央银行完全不知道有多少在流通，所以M1和M2也没有任何信息。

这个国家的数据还不是最差的，非洲一些国家的数据据说还要更恐怖。看完这个国家的数之后，再去看中国的数，我感觉好多了。

出处牛博国际　　http://www.bullogger.com/blogs/kaiecon/archives/343549.aspx

in reference to:

"在一定程度上讲，一个国家统计数据的质量，在很大程度上能够反映出一个国家的发展水平。

OECD国家，基本上是世界上最发达的国家（当然，有几个还是有点牵强），这些国家的数据，干净清晰，规范性和可比性强，分析起来非常舒服，这也是很多的跨国研究都使用OECD国家数据的原因。

经常用中国数据的人，都会知道，中国的国家统计局和各个部委公布天量的各种数据，从数据量上说一点都不少，但问题是，你经常连最常用的数据都找不到：比如说按季度的消费，投资和净出口（最近开始公布增长率了），真正的失业率等等，这些都是最基本的宏观数据，可是国家统计局目前还存在技术性的困难统计这些数据（这不是开玩笑，这还不是说国家统计局知道但故意不公布）。中国还有很多的数据对不上，还拿就业数据说吧：美国的就业数据有根据入户调查的，也有根据企业用工调查的，两个虽然不一致，但相差的并不多，可是中国按照入户调查的就业数据要比根据企业用工调查的就业数据多出好几亿人。

但中国的数据质量再差，还是比很多别的发展中国家强多了，我就拿我最近看到的某发展中国家的数据说事吧，让我乐坏了：

我先想知道这个国家的人均GDP是多少，然后我发现这个国家在某一年人均GDP翻了一倍，我再仔细一看，原来这个国家的人口在那一年由原来的N万，一下变成的N/2万，这个国家没有发生战争，瘟疫之类的东西，就是他们的统计的人口数在一年之间被对半砍了一半。

然后我想看看这个国家国际收支的情况，让我忍俊不禁的是，这个国家的国际收支最重要的那一项是：误差和遗漏。换句话说，这个国家根本搞不清钱和物是怎么进进出出这个国家的，所以最后只能把所有的东西归于误差和遗漏。

好吧，那就看看货币和银行吧。我发现这个国家有非常完整的利率序列，还是月度的，让我高兴了一番，然后我就做了一幅图，让我惊异的是，利率是一条水平的直线，我这才发现这个国家的利率，存款也好，贷款也好，已经很多年没动过了，好吧，利率完全没有任何信息。那就看看货币量吧，M1, M2之类的，这个国家确实公布，但我看了一些，觉得怎么都不对劲，这个国家的经济据说在增长，可是货币量却在下降，这不是很合理。然后我发现，原来这个国家有大量别国货币在境内流通，但中央银行完全不知道有多少在流通，所以M1和M2也没有任何信息。

这个国家的数据还不是最差的，非洲一些国家的数据据说还要更恐怖。看完这个国家的数之后，再去看中国的数，我感觉好多了。"
- 数据质量 (view on Google Sidewiki)

Monday, October 5, 2009

GFW　fucked　very　very　much　！！！

cheer　　ssh　

Tuesday, October 27, 2009

Monday, October 26, 2009

Friday, October 23, 2009

Thursday, October 22, 2009

Tuesday, October 20, 2009

Friday, October 16, 2009

Thursday, October 15, 2009

Wednesday, October 14, 2009

Sunday, October 11, 2009

Saturday, October 10, 2009

Friday, October 9, 2009

Tuesday, October 6, 2009

Monday, October 5, 2009

Blog Archive

About Me