Spark ML 基于Iris数据集进行数据建模及回归聚类综合分析-Spark商业ML实战

栏目: 编程工具 · 发布时间: 7年前

内容简介：本套技术专栏是作者（秦凯新）平时工作的总结和升华，通过从真实商业环境抽取案例进行总结和分享，并给出商业应用的调优建议和集群环境容量规划等内容，请持续关注本套博客。版权声明：禁止转载，欢迎学习。QQ邮箱地址：1120746959@qq.com，如有任何商业交流，可随时联系。

1 Iris数据集（开灶做饭）

Iris数据集是常用的分类实验数据集，由Fisher于1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。
数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。iris以鸢尾花的特征作为数据来源，常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的，后两个种类是非线性可分离的。

一般通过以下4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。四个属性：

Sepal.Length（花萼长度）， 单位是cm;
  Sepal.Width （花萼宽度）， 单位是cm;
  Petal.Length（花瓣长度）， 单位是cm;
  Petal.Width （花瓣宽度）， 单位是cm;
复制代码

三个种类：

Iris Setosa（山鸢尾）
  Iris Versicolour（杂色鸢尾）
  Iris Virginica（维吉尼亚鸢尾）
复制代码

2 数据集展示

Sepal.Length	  Sepal.Width	  Petal.Length	   Petal.Width	          Species
    5.1	            3.5	            1.4	                0.2	             Iris-setosa
    4.9	            3	            1.4	                0.2	             Iris-setosa
    4.7	            3.2	            1.3	                0.2	             Iris-setosa
    4.6	            3.1	            1.5	                0.2	             Iris-setosa
    5	            3.6	            1.4	                0.2	             Iris-setosa
    5.4	            3.9	            1.7	                0.4	             Iris-setosa
    4.6	            3.4	            1.4	                0.3	             Iris-setosa
复制代码

3 数据预处理和分析

3.1 通过CSV文件进行预处理

1 数据读入处理
val df = spark.read.format("csv") .option("sep", ",").option("inferSchema", "true").option("header", "true") .load("/data/iris.csv")
+------------+-----------+------------+-----------+-----------+
|Sepal.Length|Sepal.Width|Petal.Length|Petal.Width|    Species|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|

2 特征索引转化 
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("Species").setOutputCol("categoryIndex")
val model =indexer.fit(df)
val indexed = model.transform(df)
model.show()

+------------+-----------+------------+-----------+-----------+-------------+
|Sepal.Length|Sepal.Width|Petal.Length|Petal.Width|    Species|categoryIndex|
+------------+-----------+------------+-----------+-----------+-------------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|          0.0|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|          0.0|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|          0.0|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|          0.0|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|          0.0|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|          0.0|
复制代码

3.2 通过txt文件进行预处理

1 数据读入处理
case class Iris(Sepal_Length:Double, Sepal_Width:Double, Petal_Length:Double, Petal_Width:Double, Species:String)
val data = sc.textFile("/data/iris.txt")
val header = data.first
val df2  = data.filter(_ != header).map(_.split("\t")).map(l => Iris(l(0).toDouble,l(1).toDouble,l(2).toDouble,l(3).toDouble,l(4).toString)).toDF

+------------+-----------+------------+-----------+-----------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|    Species|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|

2 特征索引转化 
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("Species").setOutputCol("categoryIndex")
val model =indexer.fit(df2)
//val indexed = model.transform(df2).filter(!$"Species".equalTo("Iris-virginica"))
val indexed = model.transform(df2).filter("categoryIndex<2.0")
//indexed.show()

+------------+-----------+------------+-----------+-----------+-------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|    Species|categoryIndex|
+------------+-----------+------------+-----------+-----------+-------------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|          0.0|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|          0.0|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|          0.0|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|          0.0|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|          0.0|
复制代码

4 Iris数据集回归分析

1 得到特征值和便签的索引
val features = List("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width").map(indexed.columns.indexOf(_))
features: List[Int] = List(0, 1, 2, 3)
 
val targetInd = indexed.columns.indexOf("categoryIndex") 
targetInd: Int = 5 

2 特征转换成向量
import org.apache.spark.mllib.linalg.{Vector, Vectors}  
import org.apache.spark.mllib.regression.LabeledPoint
val labeledPointIris = indexed.rdd.map(r => LabeledPoint(r.getDouble(targetInd),Vectors.dense(features.map(r.getDouble(_)).toArray)))

scala> labeledPointIris.foreach(println)
(0.0,[5.1,3.5,1.4,0.2])
(0.0,[4.9,3.0,1.4,0.2])
(0.0,[4.7,3.2,1.3,0.2])
(0.0,[4.6,3.1,1.5,0.2])
(0.0,[5.0,3.6,1.4,0.2])
(0.0,[5.4,3.9,1.7,0.4])
(0.0,[4.6,3.4,1.4,0.3])
(0.0,[5.0,3.4,1.5,0.2])
(0.0,[4.4,2.9,1.4,0.2])
(0.0,[4.9,3.1,1.5,0.1])

scala> println(labeledPointIris.first.features)
[5.1,3.5,1.4,0.2]
scala> println(labeledPointIris.first.label)
0.0

3 测试集与训练集分开
val splits = labeledPointIris.randomSplit(Array(0.8, 0.2), seed = 11L)
val trainingData = splits(0).cache
val testData = splits(1).cache


4 线性回归算法预测1-LogisticRegressionWithSGD

import org.apache.spark.mllib.classification.{LogisticRegressionWithSGD,LogisticRegressionWithLBFGS}
import org.apache.spark.mllib.classification.LogisticRegressionModel
import org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm

 val lr = new LogisticRegressionWithSGD().setIntercept(true)
 lr.optimizer.setStepSize(10.0).setRegParam(0.0).setNumIterations(20).setConvergenceTol(0.0005)
  
 scala>      val model = lr.run(trainingData)
 model: org.apache.spark.mllib.classification.LogisticRegressionModel =  org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = -0.24895905804746296, numFeatures = 4, numClasses = 2, threshold = 0.5  


5 线性回归算法预测2-LogisticRegressionWithLBFGS  
 val numiteartor = 2
 val model = new LogisticRegressionWithLBFGS().setNumClasses(numiteartor).run(trainingData)
 
 val labelAndPreds = testData.map { point => val prediction = model.predict(point.features)   (point.label, prediction)}
 
 model: org.apache.spark.mllib.classification.LogisticRegressionModel =
 org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 4, numClasses = 2, threshold = 0.5
复制代码

以上所述就是小编给大家介绍的《Spark ML 基于Iris数据集进行数据建模及回归聚类综合分析-Spark商业ML实战》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

机器学习基础教程

（英）Simon Rogers,、Mark Girolami / 郭茂祖、王春宇刘扬刘晓燕、刘扬、刘晓燕 / 机械工业出版社 / 2014-1 / 45.00

本书是一本机器学习入门教程，包含了数学和统计学的核心技术，用于帮助理解一些常用的机器学习算法。书中展示的算法涵盖了机器学习的各个重要领域：分类、聚类和投影。本书对一小部分算法进行了详细描述和推导，而不是简单地将大量算法罗列出来。本书通过大量的MATLAB/Octave脚本将算法和概念由抽象的等式转化为解决实际问题的工具，利用它们读者可以重新绘制书中的插图，并研究如何改变模型说明和参数取值。......一起来看看《机器学习基础教程》这本书的介绍吧!

码农工具