DataFrame能够方便处理大规模结构化数据。在Scala API中,DataFrame只是Dataset [Row]的类型别名。(参考原文)
下面展示几个DataFrame的基础用法,适合小白入门,包括:
- 创建DataFrame
- 设置新的字段名
- 添加新列
- 改变元素类型
- 选择列
例程
import java.io.File
import org.apache.spark.sql.SparkSession
object DataFrame_test {
def main(args: Array[String]): Unit = {
println("-------------------------------------通过文件直接创建DataFrame-------------------------------------------")
val path = "F:/ScalaProject/test/collaborativeFilter/src/main/resources/S1LQG.cvs"
val spark = SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate
val df = spark.read.format("csv").option("header", "true").load(path)
df.show()
println("-----------------------------------------设置新的字段名--------------------------------------------------")
val newNames = List.range(0, 17).mkString(",").split(",") // 从0到16的字段
val dfRename = df.toDF(newNames: _*)
dfRename.show()
println("------------------------------------------添加一个新列---------------------------------------------------")
val df2 = dfRename.withColumn("newColumn", dfRename("2") * 2)
df2.show()
println("--------------------------------------改变列的元素类型-------------------------------------------")
val df3 = df2.withColumn("newColumn", df2("newColumn").cast("int")) // 修改为int型
val df4 = df3.select("newColumn") //选择要返回的列
df4.show()
}
}
输出
-------------------------------------通过文件直接创建DataFrame-------------------------------------------
+---------+---------+-------+--------+---------+-----+--------+---------+---------+-------+------+-------+--------+---------+------+-------+-------+
|受端设备侧主轨低频|受端设备侧主轨电压|送端电缆侧电流|接收入口主轨低频|受端电缆侧主轨低频| 功出低频|接收入口主轨电压|受端电缆侧主轨载频|受端电缆侧主轨电压|送端电缆侧载频| 功出电压|送端电缆侧电压|接收入口主轨载频|受端设备侧主轨载频| 功出电流| 功出载频|送端电缆侧低频|
+---------+---------+-------+--------+---------+-----+--------+---------+---------+-------+------+-------+--------+---------+------+-------+-------+
| 235.0| 62340.0| 2460.0| 235.0| 235.0|235.0| 4860.0| 14380.0| 199.0|14380.0|1386.0| 793.0| 14390.0| 14380.0|3490.0|14390.0| 235.0|
| 235.0| 60740.0| 2450.0| 235.0| 235.0|235.0| 4860.0| 14380.0| 199.0|14380.0|1386.0| 793.0| 14390.0| 14380.0|3490.0|14390.0| 235.0|
| 235.0| 60740.0| 2450.0| 235.0| 235.0|235.0| 4860.0| 14380.0| 199.0|14380.0|1386.0| 793.0| 14390.0| 14380.0|3490.0|14390.0| 235.0|
-----------------------------------------设置新的字段名--------------------------------------------------
+-----+-------+------+-----+-----+-----+------+-------+-----+-------+------+-----+-------+-------+------+-------+-----+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16|
+-----+-------+------+-----+-----+-----+------+-------+-----+-------+------+-----+-------+-------+------+-------+-----+
|235.0|62340.0|2460.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0|
|235.0|60740.0|2450.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0|
|235.0|60740.0|2450.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0|
------------------------------------------添加一个新列---------------------------------------------------
+-----+-------+------+-----+-----+-----+------+-------+-----+-------+------+-----+-------+-------+------+-------+-----+---------+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16|newColumn|
+-----+-------+------+-----+-----+-----+------+-------+-----+-------+------+-----+-------+-------+------+-------+-----+---------+
|235.0|62340.0|2460.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0| 4920.0|
|235.0|60740.0|2450.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0| 4900.0|
|235.0|60740.0|2450.0|235.0|235.0|235.0|4860.0|14380.0|199.0|14380.0|1386.0|793.0|14390.0|14380.0|3490.0|14390.0|235.0| 4900.0|
--------------------------------------改变列的元素类型-------------------------------------------
+---------+
|newColumn|
+---------+
| 4920|
| 4900|
| 4900|