Kafka mysql kafka 【头歌实训】PySpark Streaming 数据源

文章目录

第1关：MySQL 数据源任务描述相关知识PySpark JDBC 概述PySpark JDBCPySpark Streaming JDBC

编程要求测试说明答案代码

第2关：Kafka 数据源任务描述相关知识Kafka 概述Kafka 使用基础PySpark Streaming Kafka

编程要求测试说明答案代码

第1关：MySQL 数据源

任务描述

本关任务：读取套接字流数据，完成词频统计，将结果写入 Mysql 中。

相关知识

为了完成本关任务，你需要掌握：

PySpark JDBC 概述；PySpark JDBC；PySpark Streaming JDBC。

PySpark JDBC 概述

在 PySpark 中支持通过 JDBC 的方式连接到其他数据库获取数据生成 DataFrame，当然也同样可以使用 Spark SQL 去读写数据库。除了 JDBC 数据源外，还支持 Parquet、JSON、Hive 等数据源。

PySpark JDBC

在学习 PySpark Streaming JDBC 之前，我们先来了解一下在 PySpark 中如何使用 JDBC。

需求：

读取 Mysql 中的数据；往 Mysql 中写入数据。

首先，打开右侧命令行窗口，等待连接后，进入 MySQL，任意创建一个库，在该库中任意创建一张表，任意写入一些数据。

# 启动 mysql 服务

service mysql start

# 进入 mysql

mysql -uroot -p123123

# 创建 test 库

create database if not exists test;

# 创建表

use test;

create table if not exists student(

id int,

name varchar(50),

class varchar(50));

# 数据写入

insert into student values(1,"zhangsan","A");

insert into student values(2,"lisi","B");

insert into student values(3,"wangwu","C");

创建完成后，进入 python3 shell 界面。

python3

开始编写程序，第一步，先导入相关包

from findspark import init

init()

from pyspark import SparkConf, SparkContext

from pyspark.sql import SparkSession

第二步，创建 Spark 对象

spark = SparkSession.builder.appName("read_mysql").master("local[*]").getOrCreate()

第三步，读取 Mysql 中的数据

dataFrame = spark.read.format("jdbc").option("driver", "com.mysql.jdbc.Driver") .option("url", "jdbc:mysql://localhost:3306/test").option("dbtable", "student") .option("user", "root").option("password", "123123").load()

第四步，输出读取的数据

# 注意，show() 方法默认只会显示前 20 行数据。

dataFrame.show()

输出结果如图所示：

第五步，将读取的数据以追加的方式写入库中

dataFrame.write.format("jdbc").option("driver", "com.mysql.jdbc.Driver") .option("url", "jdbc:mysql://localhost:3306/test").option("dbtable", "student").option("user", "root").option("password", "123123").mode(saveMode="append").save()

进入 Mysql 中查看结果：

# 进入 Mysql

mysql -uroot -p123123

# 查询数据

select * from test.student;

PySpark Streaming JDBC

通过对 PySpark JDBC 的学习，我们了解了在 Python 中是如何使用 JDBC 的，现在来学习 PySpark Streaming JDBC 的连接方式。

需求：通过读取套接字流，进行词频统计，将数据写入 Mysql 中。

首先，打开右侧命令行窗口，等待连接后，进入 MySQL，创建 spark 库，在该库中创建 wordcount 表。

# 启动 mysql 服务

service mysql start

# 进入 mysql

mysql -uroot -p123123

# 创建 test 库

create database if not exists spark;

# 创建表

use spark;

create table if not exists wordcount(

word varchar(50),

count int);

创建完成后，进入主目录 /root，创建代码文件 mysql.py，对其进行编辑。

cd /root

vi mysql.py

开始编写程序，第一步，先导入相关包

from findspark import init

init()

import time

import pymysql

from pyspark import SparkContext

from pyspark.streaming import StreamingContext

第二步，创建 Spark 环境与检查点

sc = SparkContext(appName="mysql_streaming", master="local[*]")

ssc = StreamingContext(sc, 10)

# 设置套接字流信息

inputStream = ssc.socketTextStream("localhost", 7777)

# 设置检查点

ssc.checkpoint("/usr/local/spark")

第三步，对数据进行相关操作

# 累加器(状态更新)

def updateFunction(newValues, runningCount):

if runningCount is None:

runningCount = 0

return sum(newValues, runningCount)

pairs = inputStream.flatMap(lambda x: x.split(" ")).filter(lambda x: x != "").map(lambda x: (x, 1))

wordCounts = pairs.updateStateByKey(updateFunction)

wordCounts.pprint(100)

第四步，写入 Mysql 处理

def dbfunc(records):

db = pymysql.connect("localhost", "root", "123123", "spark")

cursor = db.cursor()

def doinsert(p):

sql = "insert into wordcount(word,count) values ('%s', '%s')" % (str(p[0]), str(p[1]))

try:

cursor.execute(sql)

db.commit()

except:

db.rollback()

for item in records:

doinsert(item)

def func(rdd):

repartitionedRDD = rdd.repartition(3)

repartitionedRDD.foreachPartition(dbfunc)

wordCounts.foreachRDD(func=func)

第五步，启动与停止

ssc.start()

time.sleep(30)

ssc.stop()

第六步，新增一个命令行窗口，启动数据流服务

nc -l -p 7777

第七步，返回代码文件窗口，运行程序

python3 /root/mysql.py

第八步，程序启动后，切换到数据流服务窗口，输入如下数据：

hello pyspark

hello pyspark streaming

hello jdbc

程序结束后，进入 Mysql 中查看结果：

# 进入 Mysql

mysql -uroot -p123123

# 查询数据

select distinct(word),count from spark.wordcount;

结果如图所示：

编程要求

打开右侧代码文件窗口，在 Begin 至 End 区域补充代码，执行程序，读取套接字流数据，按空格进行分词，完成词频统计。在 Mysql 中创建 work 数据库，在该库中创建表 wordcount，添加字段 word（字符型），字段 count（整型），将词频统计结果写入该表中。

代码文件目录： /data/workspace/myshixun/project/step1/work.py

套接字流相关信息：

地址：localhost端口：8888输入数据：

待程序启动后（5s），请在 60 秒内写入数据，如果需要调整时间，你可以通过修改代码文件中 time.sleep(60) 来指定时间。

When summer comes, people like to go to the beach and play in the seawater.

It is such a good way to drive away the hotness.

But it has been reported that many people drawn while they were swimming on the beach.

The people who died were good at swimming, the reason they got killed was the invisible demon under the seawater.

In the afternoon, there are some vortexes under the seawater, which people can’t see.

When people go swimming, they will be absorbed by the vortexes, even though they are good at swimming, they can’t resist the strong power.

So when we go to play in the beach, we must take care.

输入内容后，注意按回车。

Mysql 信息：

账号：root密码：123123地址：localhost端口：3306

请在程序运行完成后再进行评测，否则会影响最终结果。

测试说明

平台将对你编写的代码进行评测，如果与预期结果一致，则通关，否则测试失败。

答案代码

from findspark import init

init()

import time

import pymysql

from pyspark import SparkContext

from pyspark.streaming import StreamingContext

sc = SparkContext(appName="mysql_streaming", master="local[*]")

ssc = StreamingContext(sc, 10)

# 设置检查点

ssc.checkpoint("/usr/local/work")

# 累加器(状态更新)

def updateFunction(newValues, runningCount):

if runningCount is None:

runningCount = 0

return sum(newValues, runningCount)

# 设置套接字流

############### Begin ###############

inputStream = ssc.socketTextStream("localhost", 8888)

############### End ###############

pairs = inputStream.flatMap(lambda x: x.split(" ")).filter(lambda x: x != "").map(lambda word: (word, 1))

wordCounts = pairs.updateStateByKey(updateFunction)

wordCounts.pprint(100)

def dbfunc(records):

# 根据传入的 records 参数，完成数据写入 Mysql 操作

############### Begin ###############

# 连接 MySQL 数据库

connection = pymysql.connect(

host='localhost',

user='root',

password='123123',

database='work',

port=3306,

)

with connection.cursor() as cursor:

# 根据传入的 records 参数，完成数据写入 Mysql 操作

for record in records:

word, count = record

cursor.execute('INSERT INTO wordcount (word, count) VALUES (%s, %s)', (word, count))

connection.commit()

connection.close()

############### End ###############

# 分区设置

def func(rdd):

repartitionedRDD = rdd.repartition(3)

repartitionedRDD.foreachPartition(dbfunc)

wordCounts.foreachRDD(func=func)

ssc.start()

time.sleep(60)

ssc.stop()

打开一个命令行窗口

# 启动 mysql 服务

service mysql start

# 进入 mysql

mysql -uroot -p123123

# 创建 test 库

create database if not exists work;

# 创建表

use work;

create table if not exists wordcount(

word varchar(50),

count int

);

# 退出 mysql

exit

# 创建检查点目录

mkdir -p /usr/local/work/

nc -l -p 8888

再打开一个窗口

chmod 777 /data/workspace/myshixun/project/step1/work.py

python3 /data/workspace/myshixun/project/step1/work.py # 现在开始运行代码文件，请在 60 秒内创建文件并写入下面数据

回到第一个窗口，把下面数据粘贴上去再打一个回车

When summer comes, people like to go to the beach and play in the seawater.

It is such a good way to drive away the hotness.

But it has been reported that many people drawn while they were swimming on the beach.

The people who died were good at swimming, the reason they got killed was the invisible demon under the seawater.

In the afternoon, there are some vortexes under the seawater, which people can’t see.

When people go swimming, they will be absorbed by the vortexes, even though they are good at swimming, they can’t resist the strong power.

So when we go to play in the beach, we must take care.

第2关：Kafka 数据源

任务描述

本关任务：读取 Kafka 生产的数据，完成输出。

夸智网

Kafka mysql kafka 【头歌实训】PySpark Streaming 数据源

大数据分布式 [spark] SaveMode

c语言-指针运算

发表评论取消回复

夸智网

Kafka mysql kafka 【头歌实训】PySpark Streaming 数据源

大数据 分布式 [spark] SaveMode

c语言-指针运算

相关文章

发表评论取消回复

大数据分布式 [spark] SaveMode