内容简介:此项目基于UCI上的开放数据github地址:主要做了如下分析:
此项目基于UCI上的开放数据 adult.data
github地址: AdultBase - Truedick23
主要做了如下分析:
- 婚姻状况
- 学历
- 婚姻状况与学历的关系(博士学位获得者的婚姻状况)
- 代表数据的获得(K-Means聚类分析)
- Spark版本: spark-2.3.1-bin-hadoop2.7
- 语言:Scala 2.11.8
- 数据地址: Adult Data Set
- sbt的内容:注意scalaVersion、导入的spark类jar包以及libraryDependency一定严格对应,spark-2.3仅支持scala2.11
name := "AdultBase" version := "0.1" scalaVersion := "2.11.8" // https://mvnrepository.com/artifact/org.apache.spark/spark-core libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local libraryDependencies += "org.apache.spark" %% "spark-mllib-local" % "2.3.1" // https://mvnrepository.com/artifact/org.scalanlp/breeze-viz libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
分析数据格式
我们先给出adult.data中的前三行代码,分析可知每一行有15个元素,与典型csv格式的区别在于它分隔符为“, ”,即逗号+空格
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K ....
对数据进行读取和简单格式化
因为我们不是在spark shell内编写,需要首先建立一个SparkContext类来读取数据
import org.apache.spark.SparkContext val sc = new SparkContext("local[2]", "Stumble Upon") val raw_data = sc.textFile("./data/machine-learning-databases/adult.data") val data = raw_data.map(line => line.split(", ")).filter(fields => fields.length == 15) data.cache()
- 特别要注意filter的使用,用于判断每组元素个数是否为15个,可以避免数组越界(ArrayIndexOutOfBoundsException)的问题
- cache函数将data放入内存中,减少了spark语言惰性求值带来的时间花销
基础数据分析
首先用spark练练手,做一些有趣的统计,在此我们先建立三个数据表,分别统计婚姻状况、学历状况、以及有博士学位人群的婚姻状况
val marriage_data = data.map { fields => (fields(5), 1) }.reduceByKey((x, y) => x + y) val education_data = data.map { fields => (fields(3), 1)}.reduceByKey((x, y) => x + y) val doc_marriage_data = data.filter(fields => fields(3) == "Doctorate").map(fields => (fields(5), 1)).reduceByKey((x, y) => x + y)
这三行代码很好体现了scala语言的函数型功能
- map函数可以对数据表中的每一个数据分别进行操作,此处分别通过索引号得到了婚姻状况、教育水平等数据,得到(婚姻状况(学历),1)的键值对,以便统计数目
- reduceByKey十分好用,用于对具有相同key值的元素进行聚合,此处定义的功能实现在聚合时对value元素进行求和,达到了统计词频的效果
- filter函数用于过滤,此处实现了仅将学历为“Doctorate”即博士的数据传到后面
输出结果:
println("Marriage data:") marriage_data.foreach(println) println("Education data:") education_data.foreach(println) println("Doctor Marriage data:") doc_marriage_data.foreach(println)
foreach函数对数据集中所有元素实现括号内函数,在此不多讲
运行结果
因为scala语言可视化做的不如 python 完善,后期会补充用matplotlib制作的hist
首先是婚姻状况:
Marriage data: (Married-spouse-absent,418) (Married-AF-spouse,23) (Divorced,4443) (Never-married,10683) (Married-civ-spouse,14976) (Widowed,993) (Separated,1025)
想读懂的话我们需要先普及一下婚姻状况的标准化分类: ACS Data Definitions
由于年龄段跨度很大,可以理解未婚率这么高,结合年龄的分析会后期补充,还是仁者见仁智者见智吧
学历结果:
Education data: (1st-4th,168) (Preschool,51) (Doctorate,413) (12th,433) (Bachelors,5355) (9th,514) (Masters,1723) (10th,933) (Assoc-acdm,1067) (Prof-school,576) (HS-grad,10501) (Some-college,7291) (7th-8th,646) (11th,1175) (Assoc-voc,1382) (5th-6th,333)
可以看出高中毕业和大学辍学者居多
博士群体的婚姻状况:
Doctor Marriage data: (Married-spouse-absent,7) (Divorced,33) (Never-married,73) (Married-civ-spouse,286) (Widowed,7) (Separated,7)
数据索引化处理
首先提取各个特征值,通过distinct函数返回互不相同的数据集合
val number_set = data.map(fields => fields(2).toInt).collect().toSet val education_types = data.map(fields => fields(3)).distinct.collect() val marriage_types = data.map(fields => fields(5)).distinct.collect() val family_condition_types = data.map(fields => fields(7)).distinct.collect() val occupation_category_types = data.map(fields => fields(1)).distinct.collect() val occupation_types = data.map(fields => fields(6)).distinct.collect()
定义一个函数,用于根据数据集合构建映射集
def acquireDict(types: Array[String]): Map[String, Int] = { var idx = 0 var dict: Map[String, Int] = Map() for (item <- types) { dict += (item -> idx) idx += 1 } dict }
通过调用函数来生成特征值映射
val education_dict = acquireDict(education_types) val marriage_dict = acquireDict(marriage_types) val family_condition_dict = acquireDict(family_condition_types) val occupation_category_dict = acquireDict(occupation_category_types) val occupation_dict = acquireDict(occupation_types) val sex_dict = Map("Male" -> 1, "Female" -> 0)
通过映射来索引化数据值,便于聚合处理
import org.apache.spark.mllib.linalg.Vectors val data_set = data.map { fields => val number = fields(2).toInt val education = education_dict(fields(3)) val marriage = marriage_dict(fields(5)) val family_condition = family_condition_dict(fields(7)) val occupation_category = occupation_category_dict(fields(1)) val occupation = occupation_dict(fields(6)) val sex = sex_dict(fields(9)) Vectors.dense(number, education, marriage, family_condition, occupation)}
效果如下:
[77516.0,11.0,3.0,0.0,3.0] [83311.0,11.0,4.0,4.0,11.0] [215646.0,1.0,2.0,0.0,9.0] [234721.0,4.0,4.0,4.0,9.0] [338409.0,11.0,4.0,1.0,1.0] [284582.0,13.0,4.0,1.0,11.0] ......
K-Mean聚合处理
理论方面请移步维基 K-平均算法-维基百科
首先建立一个模型,用于训练data_set,在此设置4个子集,迭代次数最多为80
import org.apache.spark.mllib.clustering.KMeans val kMeansModel = KMeans.train(data_set, k = 4, maxIterations = 80) kMeansModel.clusterCenters.foreach { println }
输出的聚点如下:
[190629.99686269276,5.085174554435619,3.435484947600294,2.743007809892531,5.400173553167345] [87510.50314879218,5.166369019644703,3.448538396465833,2.7395431901494502,5.49920105273052] [510475.8384236453,5.2738916256157635,3.419704433497537,2.716256157635468,5.548768472906404] [316299.05332433345,5.1130610867364155,3.4340195747553155,2.6560917988525143,5.421532230847115]
我们无法从这些数字中获取有效的信息,故定义函数来获取聚点附近的数据:
def nearestNumber(cores: Set[Int], numbers: Set[Int]): Set[Int] = { var arr: Set[Int] = Set() for (core: Int <- cores) { var r = 0 val num = core.toInt while(!numbers.contains(num+r) && !numbers.contains(num-r)) { r += 1 } if(numbers.contains(num+r)) arr = arr + (num-r) else arr = arr + (num-r) } arr }
调用函数并得到据点附近的数据点并输出:
val numbers = kMeansModel.clusterCenters.map(centers => centers(0).toInt).toSet val core_numbers = nearestNumber(numbers, number_set) val core_data = data.filter(fields => core_numbers.contains(fields(2).toInt)) for (core <- core_data) { for (data <- core) { print(data + ", ") } println() }
输出如下:
58, Self-emp-not-inc, 87510, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K, 40, Private, 510072, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K, 25, Private, 190628, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Columbia, <=50K, 59, Private, 87510, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K, 42, Private, 510072, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K, 41, Federal-gov, 510072, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K, 36, Private, 316298, Bachelors, 13, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K,
可知私营企业、高中或大学毕业、与配偶同居等特征是比较常见的
参考资料:
- 《Spark快速大数据分析》
- 《Spark数据分析》
- K-平均算法-维基百科
- Marital Status - American Community Survey
- Spark机器学习2:K-Means聚类算法
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
腾讯网UED体验设计之旅
任婕 等 / 电子工业出版社 / 2015-4 / 99.00元
《腾讯网UED体验设计之旅》是腾讯网UED的十年精华输出,涵盖了丰富的案例、极富冲击力的图片,以及来自腾讯网的一手经验,通过还原一系列真实案例的幕后设计故事,从用户研究、创意剖析、绘制方法、项目管理等实体案例出发,带领读者经历一场体验设计之旅。、 全书核心内容涉及网媒用户分析与研究方法、门户网站未来体验设计、H5技术在移动端打开的触控世界、手绘原创设计、改版迭代方法、文字及信息图形化设计、媒......一起来看看 《腾讯网UED体验设计之旅》 这本书的介绍吧!