Spark ML 5.特征转换 1


1. 分词器

1.1 算法介绍

  • 类别:transformer【转换器】
  • Tokenizer

Tokenization 将文本划分为单词。下面例子将展示如何把句子划分为单词。

RegexTokenizer基于正则表达式提供了更多的划分选项。默认情况下,参数“pattern”用来指定划分文本的分隔符。或者用户可以指定参数“gaps”设置为false 来指明正则 “patten” 表示获取的内容,而不是分隔符, 这样来为分词结果找到所有可能匹配的情况。

如下面的例子中我们是设置patten 分别为“\w+”与“\W” 的区别如下两幅图所示:

\W \w+

1.2 代码示例

package hnbian.spark.ml.feature.transforming

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}

/**
  * @author hnbian
  * @ Description 特征转换 分词器代码示例
  * @ Date 2018/12/27 15:02
  **/
object Tokenizer extends App {
  val conf = new SparkConf().setAppName("Tokenizer")
  //设置master local[4] 指定本地模式开启模拟worker线程数
  conf.setMaster("local[4]")
  //创建sparkContext文件
  val sc = new SparkContext(conf)
  val spark = SparkSession.builder().getOrCreate()
  sc.setLogLevel("Error")


  //定义数据,每一行堪称一个文档
  val sentenceDF = spark.createDataFrame(Seq(
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
  )).toDF("label", "sentence")

  sentenceDF.show(false)
  /**
    * +-----+-----------------------------------+
    * |label|sentence                           |
    * +-----+-----------------------------------+
    * |0    |Hi I heard about Spark             |
    * |1    |I wish Java could use case classes |
    * |2    |Logistic,regression,models,are,neat|
    * +-----+-----------------------------------+
    */

  //定义默认分词器设置输入输出字段
  val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
  //使用分词器(转换器)对数据进行转换
  val tokenized = tokenizer.transform(sentenceDF)
  //查看转换后的结果
  tokenized.select("words", "label").take(3).foreach(println)
  /**
    * [WrappedArray(hi, i, heard, about, spark),0]
    * [WrappedArray(i, wish, java, could, use, case, classes),1]
    * [WrappedArray(logistic,regression,models,are,neat),2]
    */
  tokenized.show(false)

  /**
    * +-----+-----------------------------------+------------------------------------------+
    * |label|sentence                           |words                                     |
    * +-----+-----------------------------------+------------------------------------------+
    * |0    |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
    * |1    |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
    * |2    |Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |
    * +-----+-----------------------------------+------------------------------------------+
    */

  //定义一个正则分词器并设置输入输出与分隔符
  val regexTokenizer = new RegexTokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
    .setPattern("\\W") // 或者使用: .setPattern("\\w+").setGaps(false)


  //使用正则分词器进行转换
  val regexTokenized = regexTokenizer.transform(sentenceDF)
  //查看转换后的数据
  regexTokenized.select("words", "label").take(3).foreach(println)
  /**
    * [WrappedArray(hi, i, heard, about, spark),0]
    * [WrappedArray(i, wish, java, could, use, case, classes),1]
    * [WrappedArray(logistic, regression, models, are, neat),2]
    */
  regexTokenized.show(false)
  /**
    * +-----+-----------------------------------+------------------------------------------+
    * |label|sentence                           |words                                     |
    * +-----+-----------------------------------+------------------------------------------+
    * |0    |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
    * |1    |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
    * |2    |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |
    * +-----+-----------------------------------+------------------------------------------+
    */
}

2. 停用词移除

2.1 算法介绍

  • 类别:transformer【转换器】
  • StopWordsRemover

停用词是在文档中频繁出现,但是没有太多实际意义的词语,这些词语不应该被包含在算法中。

StopWordsRemover的输入为一系列字符串(如分词器输入),输出删除了所有的停用词,停用词表示由StopWords参数提供。一些语言的默认停用词可以通过StopWordsRemover.loadDefaultStopWords(language)调用。布尔类型的参数caseSensitive指明是否区分大小写(默认为否,不区分)

假设我们有如下DataFrame, 有id和raw两列

id raw
0 [I,saw, the, red, baloon]
1 [Mary, had, a, little, lamb]

通过对raw列调用StopWordsRemover,我们可以筛选出结果列如下:

id raw filtered
0 [I,saw, the, red, baloon] [saw, red, baloon]
1 [Mary, had, a, little, lamb] [Mary, little, lamb]

其中,“I”, “the”, “had”以及“a”被移除。

2.2 代码示例

package hnbian.spark.ml.feature.transforming

import hnbian.spark.SparkUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

/**
  * @author hnbian
  *         @ Description 特征转换 停用词移除代码示例
  *         @ Date 2018/12/27 15:37
  **/
object StopWordsRemover extends App {
  //获取的SparkSession
  val spark = SparkUtils.getSparkSession("StopWordsRemover",4)

  import org.apache.spark.ml.feature.StopWordsRemover

  val remover = new StopWordsRemover()
    .setInputCol("raw")
    .setOutputCol("filtered")
    //.setStopWords(Array("saw"))//指定停用词并且只过滤掉停用词
    //.setCaseSensitive(true) //是否区分大小写,默认false 不区分

  //定义数据集
  val dataDF = spark.createDataFrame(Seq(
    (0, Seq("I", "saw", "the", "red", "baloon")),
    (1, Seq("Mary", "had", "a", "little", "lamb"))
  )).toDF("id", "raw")

  dataDF.show(false)
  //转换数据并查看结果
  val modelDF = remover.transform(dataDF)

  modelDF.show(false)
  /**
    * +---+----------------------------+--------------------+
    * |id |raw                         |filtered            |
    * +---+----------------------------+--------------------+
    * |0  |[I, saw, the, red, baloon]  |[saw, red, baloon]  |
    * |1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
    * +---+----------------------------+--------------------+
    */
}

3. n-gram

3.1 算法介绍

类别:transformer【转换器】

一个n-gram是一个长度为整数 n 的单词序列。 NGram可以用来将输入转换为n-gram
NGram的输入为一系列字符串(如分词器输入)。 参数n决定每个n-gram 包含的对象个数。 结果包含一系列n-gram, 其中每个n-gram代表一个空格分割的n个连续字符。 如果输入少于n个字符串,将没有输出结果。

3.2 调用示例

package hnbian.spark.ml.feature.transforming

import hnbian.spark.SparkUtils
import org.apache.spark.ml.feature.NGram
/**
  * @author hnbian
  *         @ Description 
  *         @ Date 2018/12/27 16:16
  **/
object NGram extends App {

  val spark = SparkUtils.getSparkSession("NGram",4)
  val wordDF = spark.createDataFrame(Seq(
    (0, Array("Hi", "I", "heard", "about", "Spark")),
    (1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
    (2, Array("Logistic", "regression", "models", "are", "neat"))
  )).toDF("label", "words")

  wordDF.show(false)
  /**
    * +-----+------------------------------------------+
    * |label|words                                     |
    * +-----+------------------------------------------+
    * |0    |[Hi, I, heard, about, Spark]              |
    * |1    |[I, wish, Java, could, use, case, classes]|
    * |2    |[Logistic, regression, models, are, neat] |
    * +-----+------------------------------------------+
    */
  val ngram = new NGram()
    .setInputCol("words")
    .setOutputCol("ngrams")
    //.setN(3) //默认2,设置连接单词的个数

  val ngramDF = ngram.transform(wordDF)

  ngramDF.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)

  ngramDF.show(false)
  /**
    * +-----+------------------------------------------+------------------------------------------------------------------+
    * |label|words                                     |ngrams                                                            |
    * +-----+------------------------------------------+------------------------------------------------------------------+
    * |0    |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |
    * |1    |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|
    * |2    |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |
    * +-----+------------------------------------------+------------------------------------------------------------------+
    */
}

4. 二值化

4.1 算法介绍

类别:transformer【转换器】

二值化(Binarizer )是根据阈值将连续数值特征转换为0-1的特征的过程。

Binarizer参数有:输入、输出、阈值。特征值大于阈值将映射为1.0,否则映射为0.0

4.2调用示例

package hnbian.spark.ml.feature.transforming

import hnbian.spark.SparkUtils
import org.apache.spark.ml.feature.Binarizer

/**
  * @author hnbian
  *         @ Description 
  *         @ Date 2018/12/27 16:39
  **/
object Binarizer extends App {
  val spark = SparkUtils.getSparkSession("Binarizer", 4)

  val dataDF = spark
    .createDataFrame(Array((0, 0.1), (1, 0.8), (2, 0.2), (3, 0.6), (4, 0.5)))
    .toDF("label", "feature")
  dataDF.show(false)
  /**
    * +-----+-------+
    * |label|feature|
    * +-----+-------+
    * |0    |0.1    |
    * |1    |0.8    |
    * |2    |0.2    |
    * |3    |0.6    |
    * |4    |0.5    |
    * +-----+-------+
    */

  val binarizer: Binarizer = new Binarizer()
    .setInputCol("feature")
    .setOutputCol("binarized_feature")
    .setThreshold(0.5) // 设置阈值 if( value > 0.5 ) 1.0 else 0.0


  val binarizedDF = binarizer.transform(dataDF)
  binarizedDF.show(false)
  /**
    * +-----+-------+-----------------+
    * |label|feature|binarized_feature|
    * +-----+-------+-----------------+
    * |0    |0.1    |0.0              |
    * |1    |0.8    |1.0              |
    * |2    |0.2    |0.0              |
    * |3    |0.6    |1.0              |
    * |4    |0.5    |0.0              |
    * +-----+-------+-----------------+
    */
  val binarizedFeatures = binarizedDF.select("binarized_feature")
  binarizedFeatures.collect().foreach(println)
  /**
    * [0.0]
    * [1.0]
    * [0.0]
    * [1.0]
    * [0.0]
    */
  binarizedFeatures.show(false)
  /**
    * +-----------------+
    * |binarized_feature|
    * +-----------------+
    * |0.0              |
    * |1.0              |
    * |0.0              |
    * |1.0              |
    * |0.0              |
    * +-----------------+
    */
}

5. 降维,主成分分析

5.1 算法介绍

类别:estimator【评估器】

PCA (principal component analysis) ,即主成分分析方法,是一种广泛使用的数据降维方法。PCA可以找出特征中最主要的特征,把原来的n个特征用k(k < n)个特征代替,去除噪音和冗余即降维。降维是对数据高纬度特征的一种预处理方法,

保留下高纬度数据中最重要的一些特征,去除噪声和不重要的特征,从而实现提升数据处理速度的目的。在实际的生产和应用中,在一定的信息损失范围内对数据做降维处理,可以为我们节省大量的时间和成本。降维也成为了非常广泛的数据预处理方法:

降维具有如下优点:

1.是的数据集更容易使用

2.降低算法计算时的开销

3.去除噪声

4.使结果更容易理解

在PCA中,数据从原来的坐标系转换到新的坐标系,由数据本身决定。转换坐标系使,以方差最大的方向作为坐标轴方向,因为数据的最大方差给出了数据的最重要的信息。第一个新坐标轴选择的是原始数据中方差最大的方法,第二个新坐标轴选择的是与第一个新坐标轴正交且方差次大的方向。重复该过程,重复次数为原始数据的特征维数。

通过这种方式获得的新坐标系,我们发现,大部分方差都包含在前面加个坐标中,后面的坐标轴所包含的方差几乎为0。于是我们可以忽略余下的坐标轴,只保留前面的几个包含有绝大部分方差的坐标轴。事实上,这样也就相当于只保留包含大部分方差的维度特征,而忽略包含方差几乎为0的特征维度,也就实现了对数据特征的降维处理。

Spark ML 中 pac的实现思路如下:

原始数据 3 行 4 列经过转换得到矩阵 $A_{3*4}$

得到矩阵 $A_{3*4}$ 的协方差矩阵 $B_{4*4}$

得到协方差矩阵 $B_{4*4}$ 的右特征向量

选取 K (如 k=2) 个大的特征值对应的特征向量,得到矩阵 $C_{4*2}$

对矩阵 $A_{3*4}$ 降维得到 $ A_{3*4}^{‘} $ = $ A_{3·4} $ * $ C_{4·2} $

5.2 代码示例

package hnbian.spark.ml.feature.transforming

import hnbian.spark.SparkUtils
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors


/**
  * @author hnbian
  *         @ Description 
  *         @ Date 2018/12/27 17:40
  **/
object PCA extends App {
  val spark = SparkUtils.getSparkSession("PCA", 4)

  val data = Array(
    Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
    Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
    Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
  )

  val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
  df.show(false)
  /**
    * +---------------------+
    * |features             |
    * +---------------------+
    * |(5,[1,3],[1.0,7.0])  |
    * |[2.0,0.0,3.0,4.0,5.0]|
    * |[4.0,0.0,0.0,6.0,7.0]|
    * +---------------------+
    */
   val pca = new PCA()
    .setInputCol("features")
    .setOutputCol("pcaFeatures")
    .setK(3) //转换为3维主成分向量
    .fit(df)

  pca.transform(df).show(false)

  /**
    * +---------------------+-----------------------------------------------------------+
    * |features             |pcaFeatures                                                |
    * +---------------------+-----------------------------------------------------------+
    * |(5,[1,3],[1.0,7.0])  |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
    * |[2.0,0.0,3.0,4.0,5.0]|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
    * |[4.0,0.0,0.0,6.0,7.0]|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
    * +---------------------+-----------------------------------------------------------+
    */
}

6. 升维,多项式展开

6.1 算法介绍

类别:transformer【转换器】

多项式扩展(PolynomialExpansion)通过产生n维组合将原始特征扩展到多项式空间。

下面示例将会介绍将特征集扩展到3维多项式空间

6.2 代码示例:

package hnbian.spark.ml.feature.transforming
import org.apache.spark.ml.feature.PolynomialExpansion
import org.apache.spark.ml.linalg.Vectors
import hnbian.spark.SparkUtils


object PolynomialExpansion extends App {
  val spark = SparkUtils.getSparkSession("PolynomialExpansion",4)
  //定义一个数 每行两个元素
  val data = Array(
    Vectors.dense(-2.0, 2.3),
    Vectors.dense(0.0, 0.0),
    Vectors.dense(0.6, -1.1)
  )
  //创建数据集
  val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
  //打印数据集
  df.show(false)
  /**
    * +----------+
    * |features  |
    * +----------+
    * |[-2.0,2.3]|
    * |[0.0,0.0] |
    * |[0.6,-1.1]|
    * +----------+
    */
  //定义一个多项式展开(升维)转换器,
  val polynomialExpansion = new PolynomialExpansion()
    .setInputCol("features")
    .setOutputCol("polyFeatures")
    .setDegree(3) //扩展到3维向量

  //调用转换器transform()方法对数据集进行转换
  val polyDF = polynomialExpansion.transform(df)
  //查看转换后的结果
  polyDF.show(false)

  /**
    * +----------+--------------------------------------------------------------------------------------------+
    * |features  |polyFeatures                                                                                |
    * +----------+--------------------------------------------------------------------------------------------+
    * |[-2.0,2.3]|[-2.0,4.0,-8.0,2.3,-4.6,9.2,5.289999999999999,-10.579999999999998,12.166999999999996]       |
    * |[0.0,0.0] |[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]                                                       |
    * |[0.6,-1.1]|[0.6,0.36,0.216,-1.1,-0.66,-0.396,1.2100000000000002,0.7260000000000001,-1.3310000000000004]|
    * +----------+--------------------------------------------------------------------------------------------+
    */
}

文章作者: hnbian
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 hnbian !
评论
 上一篇
Spark ML 6.特征转换 2 Spark ML 6.特征转换 2
1. 离散余弦变换1.1 算法介绍类别:transformer【转换器】 离散余弦变(DCT, Discrete Cosine Transform )换是与傅里叶变换相关的一种变换, 它类似于离散傅里叶变换, 但是只使用实数。 离散余
2019-01-01
下一篇 
Spark ML 4.特征提取 Spark ML 4.特征提取
1. 特征处理介绍特征处理主要分三部分: 特征提取:从原始数据中提取特征 特征转换:特征的维度、特征的转化、特征的修改 特征选取:从大规模特征中选取一个子集 Spark 特征提取提供三种算法:分别是 TF-IDF、 Word2Ve
2018-12-27
  目录