本文参考自One R Tip A Day- Word Cloud in R
研究CRAN中R Packages所关注的最热门的领域
以下的例子将从Available CRAN Packages页面上抓取目前CRAN上所有的R包,提取当中的title,及各个R包的简介,用来分析R包所涵盖的热门领域。
1 2 3 4 5 6 7 8 9
| require(XML) require(tm) u = "http://cran.r-project.org/web/packages/available_packages_by_date.html" t = readHTMLTable(u)[[1]] head(t)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| ## Date Package ## 1 2013-11-21 bbmle ## 2 2013-11-21 BSagri ## 3 2013-11-21 dbstats ## 4 2013-11-21 Hmisc ## 5 2013-11-21 lestat ## 6 2013-11-21 ncf ## Title ## 1 Tools for general maximum likelihood estimation ## 2 Statistical methods for safety assessment in agricultural field\ntrials ## 3 Distance-based statistics (dbstats) ## 4 Harrell Miscellaneous ## 5 A package for LEarning STATistics ## 6 spatial nonparametric covariance functions
|
1 2 3 4 5 6 7 8 9 10 11 12
| ap.corpus <- Corpus(DataframeSource(data.frame(as.character(t[, 3])))) ap.corpus <- tm_map(ap.corpus, removePunctuation) ap.corpus <- tm_map(ap.corpus, tolower) ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english"))) head(stopwords("english"))
|
1
| ## [1] "i" "me" "my" "myself" "we" "our"
|
下面要把语料库转换成一个Term-Document Matrix[1]。一个document可能是一句话,可以可能是一篇文章。Term-Document Matrix的row表示某个一个document,column表示该document各个词汇出现的次数。下面是一个Term-Document Matrix的简单例子:
- D1 = “I like databases”
- D2 = “I hate databases”
则Term-Document Matrix将是这样:
|
I |
like |
hate |
databases |
D1 |
1 |
1 |
0 |
1 |
D2 |
1 |
0 |
1 |
1 |
1 2 3 4 5 6 7
| ap.tdm <- TermDocumentMatrix(ap.corpus) ap.m <- as.matrix(ap.tdm) ap.v <- sort(rowSums(ap.m), decreasing = TRUE) ap.d <- data.frame(word = names(ap.v), freq = ap.v) head(ap.d)
|
1 2 3 4 5 6 7
| ## word freq ## data data 716 ## analysis analysis 594 ## models models 430 ## functions functions 331 ## package package 288 ## regression regression 246
|
1 2 3 4 5 6 7 8 9
| require(wordcloud) require(RColorBrewer) pal2 <- brewer.pal(8, "Dark2") png("wordcloud_packages.png", width = 1280, height = 800) wordcloud(ap.d$word, ap.d$freq, scale = c(8, 0.2), min.freq = 3, max.words = Inf, random.order = FALSE, rot.per = 0.15, colors = pal2) dev.off()
|
最终的输出效果如图:
Reference:
1.Term-Document Matrix