MIT:The Analytics Edge 笔记06-集群

MIT课程 15.071x The Analytics Edge 第六单元的学习记录。


Clustering

第六单元的主题是集群。它用来找到数据内的相似性。

1.理论

Recommendation Systems

Collaborative filtering:
过滤出用户间的共同特征/相似性。只使用了用户信息,跟电影内容本身无关。

Content filtering:
利用电影本身的信息,过滤出有共同导演/演员/类别的电影。跟其他用户无关。

clustering

clustering 集群是一种非监督学习,”unsupervised learning”,将有共同特征的数据分在同一组。

Hierarchical clustering

Hierarchical clustering的步骤:

  1. 计算距离
  2. 生成集群
  3. 生成cutree

注意1:计算距离时,有可能造成内存溢出。计算每两点间的距离,得到的结果是n*(n-1)/2个,我们需要保存这个结果,如果n很大,保存结果的矩阵也很大,可能会导致内存溢出。
注意2:计算距离的三种方法:
Euclidean distance:点与点之间的欧几里得距离
Manhattan Distance:绝对值之和
Maximum Coordinate:偏离最严重的点

K-means clustering

K-means clustering的步骤:

  1. 指定集群数目k
  2. 随机分配所有的点
  3. 计算每个集群的中心点
  4. 计算每个点到这些中心点的距离,选择最近的,重新分配点到离他最近的集群
  5. 重新计算每个集群的中心点
  6. 重复4和5多次,直到没有提升

注意:centroid distance 集群中所有点的平均值间的距离。

normalize

如果不同列的数值不是同样的数量级,那么运算后较小的值可能会被忽略,所以需要调整到同样的数量级。

library(caret)
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)

效果就是,所有列的平均值都是0。

2.建模和评估

Hierarchical clustering

# After following the steps in the video, load the data into R
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
# Add column names
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
# Remove unnecessary variables
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
# Remove duplicates
movies = unique(movies)

# Compute distances
distances = dist(movies[2:20], method = "euclidean")

# Hierarchical clustering
# clusterMovies = hclust(distances, method = "ward") 
clusterMovies = hclust(distances, method = "ward.D")

# Plot the dendrogram
plot(clusterMovies)

# Assign points to clusters
clusterGroups = cutree(clusterMovies, k = 10)
# Create a new data set with just the movies from cluster 2
cluster2 = subset(movies, clusterGroups==2)

K-means clustering

healthy = read.csv("healthy.csv", header=FALSE)
# 注意
# data.frame->matrix->vector 变成一个2500的vector
# data.frame->vector 还是一个50*50的data.frame
healthyMatrix = as.matrix(healthy)
healthyVector = as.vector(healthyMatrix)

# Specify number of clusters
k = 5
# Run k-means
set.seed(1)
KMC = kmeans(healthyVector, centers = k, iter.max = 1000)

# Extract clusters
healthyClusters = KMC$cluster

# Plot the image with the clusters
dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix))

image(healthyClusters, axes = FALSE, col=rainbow(k))

# Apply to a test image
tumor = read.csv("tumor.csv", header=FALSE)
tumorMatrix = as.matrix(tumor)
tumorVector = as.vector(tumorMatrix)

# Apply clusters from before to new image, using the flexclust package
# kcca K-Centroids Cluster Analysis
install.packages("flexclust")
library(flexclust)
KMC.kcca = as.kcca(KMC, healthyVector)
tumorClusters = predict(KMC.kcca, newdata = tumorVector)

# Visualize the clusters
dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix))
image(tumorClusters, axes = FALSE, col=rainbow(k))