内容简介:此项目基于UCI上的开放数据github地址:主要做了如下分析:
此项目基于UCI上的开放数据 adult.data
github地址: AdultBase - Truedick23
主要做了如下分析:
- 婚姻状况
- 学历
- 婚姻状况与学历的关系(博士学位获得者的婚姻状况)
- 代表数据的获得(K-Means聚类分析)
- Spark版本: spark-2.3.1-bin-hadoop2.7
- 语言:Scala 2.11.8
- 数据地址: Adult Data Set
- sbt的内容:注意scalaVersion、导入的spark类jar包以及libraryDependency一定严格对应,spark-2.3仅支持scala2.11
name := "AdultBase" version := "0.1" scalaVersion := "2.11.8" // https://mvnrepository.com/artifact/org.apache.spark/spark-core libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local libraryDependencies += "org.apache.spark" %% "spark-mllib-local" % "2.3.1" // https://mvnrepository.com/artifact/org.scalanlp/breeze-viz libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
分析数据格式
我们先给出adult.data中的前三行代码,分析可知每一行有15个元素,与典型csv格式的区别在于它分隔符为“, ”,即逗号+空格
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K ....
对数据进行读取和简单格式化
因为我们不是在spark shell内编写,需要首先建立一个SparkContext类来读取数据
import org.apache.spark.SparkContext val sc = new SparkContext("local[2]", "Stumble Upon") val raw_data = sc.textFile("./data/machine-learning-databases/adult.data") val data = raw_data.map(line => line.split(", ")).filter(fields => fields.length == 15) data.cache()
- 特别要注意filter的使用,用于判断每组元素个数是否为15个,可以避免数组越界(ArrayIndexOutOfBoundsException)的问题
- cache函数将data放入内存中,减少了spark语言惰性求值带来的时间花销
基础数据分析
首先用spark练练手,做一些有趣的统计,在此我们先建立三个数据表,分别统计婚姻状况、学历状况、以及有博士学位人群的婚姻状况
val marriage_data = data.map { fields => (fields(5), 1) }.reduceByKey((x, y) => x + y) val education_data = data.map { fields => (fields(3), 1)}.reduceByKey((x, y) => x + y) val doc_marriage_data = data.filter(fields => fields(3) == "Doctorate").map(fields => (fields(5), 1)).reduceByKey((x, y) => x + y)
这三行代码很好体现了scala语言的函数型功能
- map函数可以对数据表中的每一个数据分别进行操作,此处分别通过索引号得到了婚姻状况、教育水平等数据,得到(婚姻状况(学历),1)的键值对,以便统计数目
- reduceByKey十分好用,用于对具有相同key值的元素进行聚合,此处定义的功能实现在聚合时对value元素进行求和,达到了统计词频的效果
- filter函数用于过滤,此处实现了仅将学历为“Doctorate”即博士的数据传到后面
输出结果:
println("Marriage data:") marriage_data.foreach(println) println("Education data:") education_data.foreach(println) println("Doctor Marriage data:") doc_marriage_data.foreach(println)
foreach函数对数据集中所有元素实现括号内函数,在此不多讲
运行结果
因为scala语言可视化做的不如 python 完善,后期会补充用matplotlib制作的hist
首先是婚姻状况:
Marriage data: (Married-spouse-absent,418) (Married-AF-spouse,23) (Divorced,4443) (Never-married,10683) (Married-civ-spouse,14976) (Widowed,993) (Separated,1025)
想读懂的话我们需要先普及一下婚姻状况的标准化分类: ACS Data Definitions
由于年龄段跨度很大,可以理解未婚率这么高,结合年龄的分析会后期补充,还是仁者见仁智者见智吧
学历结果:
Education data: (1st-4th,168) (Preschool,51) (Doctorate,413) (12th,433) (Bachelors,5355) (9th,514) (Masters,1723) (10th,933) (Assoc-acdm,1067) (Prof-school,576) (HS-grad,10501) (Some-college,7291) (7th-8th,646) (11th,1175) (Assoc-voc,1382) (5th-6th,333)
可以看出高中毕业和大学辍学者居多
博士群体的婚姻状况:
Doctor Marriage data: (Married-spouse-absent,7) (Divorced,33) (Never-married,73) (Married-civ-spouse,286) (Widowed,7) (Separated,7)
数据索引化处理
首先提取各个特征值,通过distinct函数返回互不相同的数据集合
val number_set = data.map(fields => fields(2).toInt).collect().toSet val education_types = data.map(fields => fields(3)).distinct.collect() val marriage_types = data.map(fields => fields(5)).distinct.collect() val family_condition_types = data.map(fields => fields(7)).distinct.collect() val occupation_category_types = data.map(fields => fields(1)).distinct.collect() val occupation_types = data.map(fields => fields(6)).distinct.collect()
定义一个函数,用于根据数据集合构建映射集
def acquireDict(types: Array[String]): Map[String, Int] = { var idx = 0 var dict: Map[String, Int] = Map() for (item <- types) { dict += (item -> idx) idx += 1 } dict }
通过调用函数来生成特征值映射
val education_dict = acquireDict(education_types) val marriage_dict = acquireDict(marriage_types) val family_condition_dict = acquireDict(family_condition_types) val occupation_category_dict = acquireDict(occupation_category_types) val occupation_dict = acquireDict(occupation_types) val sex_dict = Map("Male" -> 1, "Female" -> 0)
通过映射来索引化数据值,便于聚合处理
import org.apache.spark.mllib.linalg.Vectors val data_set = data.map { fields => val number = fields(2).toInt val education = education_dict(fields(3)) val marriage = marriage_dict(fields(5)) val family_condition = family_condition_dict(fields(7)) val occupation_category = occupation_category_dict(fields(1)) val occupation = occupation_dict(fields(6)) val sex = sex_dict(fields(9)) Vectors.dense(number, education, marriage, family_condition, occupation)}
效果如下:
[77516.0,11.0,3.0,0.0,3.0] [83311.0,11.0,4.0,4.0,11.0] [215646.0,1.0,2.0,0.0,9.0] [234721.0,4.0,4.0,4.0,9.0] [338409.0,11.0,4.0,1.0,1.0] [284582.0,13.0,4.0,1.0,11.0] ......
K-Mean聚合处理
理论方面请移步维基 K-平均算法-维基百科
首先建立一个模型,用于训练data_set,在此设置4个子集,迭代次数最多为80
import org.apache.spark.mllib.clustering.KMeans val kMeansModel = KMeans.train(data_set, k = 4, maxIterations = 80) kMeansModel.clusterCenters.foreach { println }
输出的聚点如下:
[190629.99686269276,5.085174554435619,3.435484947600294,2.743007809892531,5.400173553167345] [87510.50314879218,5.166369019644703,3.448538396465833,2.7395431901494502,5.49920105273052] [510475.8384236453,5.2738916256157635,3.419704433497537,2.716256157635468,5.548768472906404] [316299.05332433345,5.1130610867364155,3.4340195747553155,2.6560917988525143,5.421532230847115]
我们无法从这些数字中获取有效的信息,故定义函数来获取聚点附近的数据:
def nearestNumber(cores: Set[Int], numbers: Set[Int]): Set[Int] = { var arr: Set[Int] = Set() for (core: Int <- cores) { var r = 0 val num = core.toInt while(!numbers.contains(num+r) && !numbers.contains(num-r)) { r += 1 } if(numbers.contains(num+r)) arr = arr + (num-r) else arr = arr + (num-r) } arr }
调用函数并得到据点附近的数据点并输出:
val numbers = kMeansModel.clusterCenters.map(centers => centers(0).toInt).toSet val core_numbers = nearestNumber(numbers, number_set) val core_data = data.filter(fields => core_numbers.contains(fields(2).toInt)) for (core <- core_data) { for (data <- core) { print(data + ", ") } println() }
输出如下:
58, Self-emp-not-inc, 87510, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K, 40, Private, 510072, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K, 25, Private, 190628, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Columbia, <=50K, 59, Private, 87510, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K, 42, Private, 510072, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K, 41, Federal-gov, 510072, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K, 36, Private, 316298, Bachelors, 13, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K,
可知私营企业、高中或大学毕业、与配偶同居等特征是比较常见的
参考资料:
- 《Spark快速大数据分析》
- 《Spark数据分析》
- K-平均算法-维基百科
- Marital Status - American Community Survey
- Spark机器学习2:K-Means聚类算法
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Open Data Structures
Pat Morin / AU Press / 2013-6 / USD 29.66
Offered as an introduction to the field of data structures and algorithms, Open Data Structures covers the implementation and analysis of data structures for sequences (lists), queues, priority queues......一起来看看 《Open Data Structures》 这本书的介绍吧!