Word Cloud in R

本文参考自One R Tip A Day- Word Cloud in R


研究CRAN中R Packages所关注的最热门的领域

以下的例子将从Available CRAN Packages页面上抓取目前CRAN上所有的R包,提取当中的title,及各个R包的简介,用来分析R包所涵盖的热门领域。

1
2
3
4
5
6
7
8
9
# 载入必要的R包
require(XML)
require(tm)
u = "http://cran.r-project.org/web/packages/available_packages_by_date.html"
# 这个readHTMLTable可以直接读取网页中的table并转换为data frame
t = readHTMLTable(u)[[1]]
# 观察一下这个table
head(t)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## Date Package
## 1 2013-11-21 bbmle
## 2 2013-11-21 BSagri
## 3 2013-11-21 dbstats
## 4 2013-11-21 Hmisc
## 5 2013-11-21 lestat
## 6 2013-11-21 ncf
## Title
## 1 Tools for general maximum likelihood estimation
## 2 Statistical methods for safety assessment in agricultural field\ntrials
## 3 Distance-based statistics (dbstats)
## 4 Harrell Miscellaneous
## 5 A package for LEarning STATistics
## 6 spatial nonparametric covariance functions
1
2
3
4
5
6
7
8
9
10
11
12
# 利用Title建立语料库
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(t[, 3]))))
# 可以用tm_map函数对语料库中的词汇进行批量修改 移除标点
ap.corpus <- tm_map(ap.corpus, removePunctuation)
# 英文字母转小写
ap.corpus <- tm_map(ap.corpus, tolower)
# 删除一些stopwords
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
# 可以看看stopwords都是什么
head(stopwords("english"))
1
## [1] "i" "me" "my" "myself" "we" "our"

下面要把语料库转换成一个Term-Document Matrix[1]。一个document可能是一句话,可以可能是一篇文章。Term-Document Matrix的row表示某个一个document,column表示该document各个词汇出现的次数。下面是一个Term-Document Matrix的简单例子:

  • D1 = “I like databases”
  • D2 = “I hate databases”

则Term-Document Matrix将是这样:

I like hate databases
D1 1 1 0 1
D2 1 0 1 1
1
2
3
4
5
6
7
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
# 计算Term-Document Matrix中每个单词的词频,并且按照词频降序排序
ap.v <- sort(rowSums(ap.m), decreasing = TRUE)
ap.d <- data.frame(word = names(ap.v), freq = ap.v)
head(ap.d)
1
2
3
4
5
6
7
## word freq
## data data 716
## analysis analysis 594
## models models 430
## functions functions 331
## package package 288
## regression regression 246
1
2
3
4
5
6
7
8
9
require(wordcloud)
require(RColorBrewer)
# 设置一个调色板
pal2 <- brewer.pal(8, "Dark2")
# 导出成png
png("wordcloud_packages.png", width = 1280, height = 800)
wordcloud(ap.d$word, ap.d$freq, scale = c(8, 0.2), min.freq = 3, max.words = Inf,
random.order = FALSE, rot.per = 0.15, colors = pal2)
dev.off()
1
2
## pdf
## 2

最终的输出效果如图:

wordcloud


Reference:

1.Term-Document Matrix

喜欢就分享一下吧