2016-05-24

数据分析笔记

MIT:The Analytics Edge 笔记07－可视化

MIT课程 15.071x The Analytics Edge 第七单元的学习记录。

Visualization

第七单元的主题是可视化。

1.简介

plot和ggplot2的比较
plot：只有简单的点和线，不容易添加其他元素。
ggplot2：引入图层，很容易添加其他元素

ggplot2

ggplot2三要素：

Data
数据，使用data.frame。
Aesthetic mapping
指定如何将 data.frame里的变量映射到图形属性上。比如，颜色，形状，比例，x／y坐标，分组等等。
Geometric objects
决定数据以什么样的形式显示。比如，点，线，箱线图，条形图，多边形等等。

结合下面这条命令，参数WHO就是提供数据的data.frame，参数aes()就是Aesthetic mapping，后面用加号连结的类似geom_point()就是Geometric objects。

# 形式
# ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
# 例子
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()

ase()即可以作为ggplot()的参数，又可以作为geom_XXXX()的参数

Aesthetic mapping

坐标相关

aes(x, y, xmin, xmax, ymin, ymax, xend, yend)
# 当然就是x，y坐标分别指定data.frame的某一列

注：坐标相关的，一般作为ggplot()的参数，其他的都可以作为geom()的参数。

Geometric objects

颜色相关

aes(colour, fill, alpha)
# colour 颜色
# fill   填充指标，data.frame的某一列。也类似于分类，比如该列有两个因子，那么会用两种不同的颜色填充
# alpha  透明度，0到1之间的小数

分组相关

aes(group)
# group 分组指标，可以指定为1，那所有数据都在1组。也可以指定data.frame的某一列

形态相关

aes(linetype, size, shape)
# linetype 即lty，线段的类型
# size     点的大小，线的粗细。指定整数数值。
# shape    图形的类型

图形的类型，即geom_point(shape = n)中n的取值
shapes

线段的类型，即geom_point(lty = n)中n的取值
line-types

描绘形状

geom_point()  点
geom_line()   线
geom_tile()   条形图
geom_bar()    直方图
geom_ploygen()多边形

注：
binwidth = 5 :粒度？
geom_bar(stat=”identity”) :use the value of the y variable as is
geom_histogram(position = “identity”) :not to stack the histograms

2.实战

绘图

# Read in data
WHO = read.csv("WHO.csv")
str(WHO)

# Plot from Week 1
plot(WHO$GNI, WHO$FertilityRate)

# Let's redo this using ggplot 
# Install and load the ggplot2 library:
install.packages("ggplot2")
library(ggplot2)

# Create the ggplot object with the data and the aesthetic mapping:
scatterplot = ggplot(WHO, aes(x = GNI, y = FertilityRate))

# Add the geom_point geometry
scatterplot + geom_point()

# Make a line graph instead:
scatterplot + geom_line()

# Switch back to our points:
scatterplot + geom_point()

# Redo the plot with blue triangles instead of circles:
scatterplot + geom_point(color = "blue", size = 3, shape = 17) 

# Another option:
scatterplot + geom_point(color = "darkred", size = 3, shape = 8) 

# Add a title to the plot:
scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")

分组

# 因子，以颜色区分    
# Color the points by region: 
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()

# 数值，以颜色深浅区分
# Color the points according to life expectancy:
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()

拟合

# Is the fertility rate of a country was a good predictor of the percentage of the population under 15?
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()

# Let's try a log transformation:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()

# Simple linear regression model to predict the percentage of the population under 15, using the log of the fertility rate:
mod = lm(Under15 ~ log(FertilityRate), data = WHO)
summary(mod)

# Add this regression line to our plot:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() +     stat_smooth(method = "lm")

# 99% confidence interval
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)

# No confidence interval in the plot
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = FALSE)

# Change the color of the regression line:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", colour = "orange")

热力图

热力图（数据越多颜色越深）的效果，依靠scale_fill_gradient()来实现，可以通过low和high指定深浅区域的颜色，然后自动形成渐变效果。旁边的图例通过参数guide = “legend”来指定。
最终的命令如下，如何生成数据的，就不啰嗦了。

# Change the color scheme
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())

地理热力图

顾名思义，地理热力图就是在地图上显示热力图。
包map内置了美国地图、世界地图、法国地图、意大利地图等。地图的原理跟图片类似，图片就是按照某个粒度分成很多个像素点，然后保存像素点的颜色信息；地图就是按照经纬度分成很多点，保存每个点的信息（比如这个点位于哪个州，这样就形成一个美国地图）。
对比刚才的 ggplot() + geom_tile() + scale_fill_gradient()
我们现在使用 ggmap() + geom_point() + scale_fill_gradient()

# Install and load two new packages:
install.packages("maps")
install.packages("ggmap")
library(maps)
library(ggmap)

# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)

# Look at the map
ggmap(chicago)

# Plot the first 100 motor vehicle thefts:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))

# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))

str(LatLonCounts)

# Convert our Longitude and Latitude variable to numbers:
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))

# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))

# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")

# We can also use the geom_tile geometry
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")

云图

# 先准备下数据，我们需要很多单词。
# 跟文本处理类似，依旧使用tweets推文，只是我们这次不抽取词干。
library(tm)
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
frequencies = DocumentTermMatrix(corpus)
allTweets = as.data.frame(as.matrix(frequencies))

# 我们需要的单词就是列名
colnames(allTweets)
# 我们需要的另一个指标是单词的频率
colSums(allTweets)

# 现在加载wordcloud这个包
library(wordcloud)
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, .25))

# 参数 scale 指定了文字的大小
# scale=c(2, .25) 表示出现频率最高的单词，显示的字号为2，出现频率最小的单词，显示的字号为0.25
wordcloud(colnames(allTweets), colSums(allTweets))
# 等效于
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(4, 0.5))

# min.freq
# 只显示出现频率大于指定值的单词

# max.words
# 最多只显示指定数目的单词

# random.order == FALSE
# 最先显示出现频率最高的单词

# rot.per = 0.5
# 有一半的单词垂直显示。默认值是0.1。

# random.color == TRUE
# 使用随机颜色

颜色
包RColorBrewer支持下面这些调色板，可以输入 display.brewer.all() 看到下面这张图。

ibrary(RColorBrewer)
display.brewer.all()

# 像这样使用
colors=brewer.pal(9, "Blues")[5:9]
wordcloud(colnames(allTweets), colSums(allTweets), colors)

保存

# Save our plot:
fertilityGNIplot = scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
pdf("MyPlot.pdf")
print(fertilityGNIplot)
dev.off()

附录

R中星期的显示

在中文系统上，weekdays()返回的结果是 “星期二星期六星期日星期三星期四星期五星期一”，如果希望输出的结果是“Friday Monday Saturday Sunday Thursday Tuesday Wednesday”，应该怎么做？

# Convert the Date variable to a format that R will recognize:
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")
mvt$Weekday = weekdays(mvt$Date)

table(mvt$Weekday)
星期二 星期六 星期日 星期三 星期四 星期五 星期一 
26791  27118  26316  27416  27319  29284  27397 

Sys.getlocale()
"zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8"
Sys.setlocale("LC_TIME", "en_US.UTF-8")
"en_US"
Sys.getlocale()
"zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/en_US.UTF-8/zh_CN.UTF-8"

table(mvt$Weekday)
Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
29284     27397     27118     26316     27319     26791     27416

另外注意到，不管是中文还是英文，都是按照字母表顺序排列的，不是按照实际中有意义的顺序排列的。

WeekdayCounts = as.data.frame(table(mvt$Weekday))
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))

factor转数字

先把factor转成character，再转成数字

# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))

参考：
形状和线段的类型
 颜色)