1.读取本地txt文件

读取本地文件时,需要在文件路径前加上  file://  ,如下代码

[WBQ@westgisB068 ~]$ spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

23/04/17 11:09:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Spark context Web UI available at http://westgisB068:4040

Spark context available as 'sc' (master = local[*], app id = local-1681700998461).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 3.3.2

/_/

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)

Type in expressions to have them evaluated.

Type :help for more information.

scala> import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession

scala> val df = sc.textFile("/home/WBQ/1.txt")

df: org.apache.spark.rdd.RDD[String] = /home/WBQ/1.txt MapPartitionsRDD[1] at textFile at :24

scala> df.first()

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)

at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)

at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)

at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)

at org.apache.spark.rdd.RDD.take(RDD.scala:1443)

at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1484)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)

at org.apache.spark.rdd.RDD.first(RDD.scala:1484)

... 47 elided

Caused by: java.io.IOException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)

... 67 more

scala> val df = sc.textFile("file:///home/WBQ/1.txt")

df: org.apache.spark.rdd.RDD[String] = file:///home/WBQ/1.txt MapPartitionsRDD[3] at textFile at :24

scala> df.first()

res2: String = 1 2 3 gfv 4 ff1 2 3 gfv 4 ff

scala>

2.读取本地csv文件

2.1数据传输

将数据传输到当前节点

[WBQ@westgisB064 soft]$ scp -r traffic/ WBQ@10.103.105.68:/home/WBQ/soft/

2.2spark-shell打开终端进行操作

[WBQ@westgisB068 ~]$ spark-shell

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

23/04/17 11:23:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Spark context Web UI available at http://westgisB068:4040

Spark context available as 'sc' (master = local[*], app id = local-1681701827307).

Spark session available as 'spark'.

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 3.3.2

/_/

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_271)

Type in expressions to have them evaluated.

Type :help for more information.

scala> import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession

scala> val df = spark.read.format("csv").option("header",value="true").option("encoding","utf-8").load("file:///home/WBQ/soft/data/traffic/part-*")

df: org.apache.spark.sql.DataFrame = [卡号: string, 交易时间: string ... 2 more fields]

scala> df.show(5)

+-------+--------------------+--------+--------+

| 卡号| 交易时间|线路站点|交易类型|

+-------+--------------------+--------+--------+

|3697647|2018-10-01T18:47:...| 大新|地铁出站|

|3697647|2018-10-01T18:35:...|宝安中心|地铁入站|

|3697647|2018-10-01T13:49:...| 大新|地铁入站|

|3697647|2018-10-01T14:03:...|宝安中心|地铁出站|

|5344820|2018-10-17T09:34:...| 罗湖|地铁入站|

+-------+--------------------+--------+--------+

only showing top 5 rows

scala>

3.读取本地文件失败原因

错误提示:输入路径不存在

检查:确实不存在

解决:将数据从其他节点传输到本地节点

scala> df.first()

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)

at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)

at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)

at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)

at org.apache.spark.rdd.RDD.take(RDD.scala:1443)

at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1484)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)

at org.apache.spark.rdd.RDD.first(RDD.scala:1484)

... 47 elided

Caused by: java.io.IOException: Input path does not exist: hdfs://westgisB064:8020/home/WBQ/1.txt

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)

... 67 more

scala>

文章链接

评论可见,请评论后查看内容,谢谢!!!评论后请刷新页面。