一、安装ubuntu以及vmtools

安装时点击其他选项,

第一个分区为512mb的交换空间

第二个分区为剩余的空间,挂载点是/

按住alt键往上拖

安装完重启

安装vmtools工具

弹出来之后,右键打开命令提示符

右键解压到home文件夹,

解压完./运行.pl文件

重启生效

二、配置用户信息

1新建Hadoop用户

sudo useradd -m hadoop -s /bin/bash

2为Hadoop用户设置密码

sudo passwd hadoop

3为Hadoop用户增加权限

sudo adduser hadoop sudo

4切换到Hadoop用户

sudo -su hadoop

5安装vim命令

sudo apt install vim

6安装ssh server

sudo apt-get install openssh-server

三、配置ssh

1查看IP地址

ifconfig

ssh localhost

Ubuntu设置静态IP

步骤1:

输入sudo vi /etc/network/interfaces,打开这个文件:改为

auto lo

iface to inet loopback

auto ens33

iface ens33 inet static

address 192.168.52.141

netmask 255.255.255.0

gataway 192.168.52.1

dns-nameservers 114.114.114.114

参照上图进行填写,其中网卡名称、静态IP地址、网关需要根据实际进行填写,填写好后保存退出。

reboot重启系统,系统重新起来后就是用的设置的静态IP地址了。

2克隆两台机器,要创建完整克隆

3加权(三台机器都加权)

sudo -su root

cd /usr/local

chown -R hadoop:hadoop hadoop/

sudo -su hadoop

4修改主机名

sudo vim /etc/hostname

改为master、slave1、slave2

5修改hosts文件(三台机器都要修改成一样的)

sudo vim /etc/hosts

127.0.0.1 localhost

192.168.52.141 master

192.168.52.142 slave1

192.168.52.143 slave2

6生成公钥

ssh-keygen -t rsa -P “”    

7自己接收公钥

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

8通过scp命令传到slave1,slave2

scp ~/.ssh/id_rsa.pub hadoop@slave1:/home/hadoop/

scp ~/.ssh/id_rsa.pub hadoop@slave2:/home/hadoop/

9在slave1、slave2里分别执行接收命令

cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

rm ~/id_rsa.pub

四、JDK安装配置

1打开压缩包所在的位置

cd /mnt/hgfs/share/

2创建解压后的文件夹

sudo mkdir -p /usr/lib/jvm

3解压

sudo tar -zxvf jdk-8u162-linux-x64.tar.gz -C /usr/lib/jvm/

4解压完成之后,jdk重命名

cd /usr/lib/jvm

sudo mv jdk1.8.0_162/ jdk

5编辑~/.bashrc文件

sudo vim ~/.bashrc

末尾添加如下内容

export JAVA_HOME=/usr/lib/jvm/jdk

export JRE_HOME=${JAVA_HOME}/jre

    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export PATH=${JAVA_HOME}/bin:$PATH

6接着让环境变量生效,执行如下代码

source ~/.bashrc

7然后在slave1,slave2上面重复上面的操作

或打包发送

cd /usr/lib/jvm

tar -zcf ~/jdk.master.tar.gz ./jdk

cd ~

scp ./jdk.master.tar.gz hadoop@slave1:/home/hadoop

scp ./jdk.master.tar.gz hadoop@slave2:/home/hadoop

8在1、2中新建文件夹,解压

sudo mkdir -p /usr/lib/jvm

sudo tar -zxf ~/jdk.master.tar.gz -C /usr/lib/jvm

9编辑~/.bashrc文件

sudo vim ~/.bashrc

末尾添加如下内容

export JAVA_HOME=/usr/lib/jvm/jdk

export JRE_HOME=${JAVA_HOME}/jre

    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

export PATH=${JAVA_HOME}/bin:$PATH

10接着让环境变量生效,执行如下代码

source ~/.bashrc

五、安装Hadoop

(一)打开压缩包所在的位置

cd /mnt/hgfs/share/

(二)解压

sudo tar -zxvf hadoop-2.7.1.tar.gz -C /usr/local/

(三)进入目标文件夹 /usr/local

cd /usr/local

sudo mv hadoop-2.7.1/ hadoop

(四)编辑~/.bashrc文件

sudo vim ~/.bashrc

末尾添加如下内容

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

(五)进入/usr/local/hadoop/etc/hadoop目录,修改slaves

cd /usr/local/hadoop/etc/hadoop

sudo vim slaves

改为

slave1

slave2

(六)修改Hadoop配置文件

1先复制再修改core-site.xml

sudo cp core-site.xml core-site-cp.xml

sudo vim core-site.xml

修改为

     

          hadoop.tmp.dir

          file:/usr/local/hadoop/tmp

          Abase for other temporary directories.

     

     

          fs.defaultFS

          hdfs://master:9000

     

 

2修改hdfs-site.xml,修改之前先备份

sudo cp hdfs-site.xml hdfs-site-cp.xml

sudo vim hdfs-site.xml

 

   

        dfs.replication

        3

   

   

            dfs.namenode.name.dir

            file:/usr/local/hadoop/tmp/dfs/name

   

   

            dfs.datanode.data.dir

            file:/usr/local/hadoop/tmp/dfs/data

   

3修改mapred-site.xml(复制mapred-site.xml.template,再修改文件名)

sudo cp mapred-site.xml.template mapred-site.xml

sudo vim mapred-site.xml

 

   

        mapreduce.framework.name

        yarn

   

 

4修改yarn-site.xml,修改之前先备份

sudo cp yarn-site.xml yarn-site-cp.xml

sudo vim yarn-site.xml

 

     

          yarn.nodemanager.aux-services

          mapreduce_shuffle

     

     

          yarn.resourcemanager.hostname

          master

     

 

5修改sudo vim hadoop-env.sh ,在末尾增加

export JAVA_HOME=/usr/lib/jvm/jdk

(七)把改好的配置文件上传到slave中

cd /usr/local/

删除临时文件

rm -rf ./hadoop/tmp

删除日志文件

rm -rf ./hadoop/logs/*

sudo tar -zcf ~/hadoop.master.tar.gz ./hadoop

cd ~

scp ./hadoop.master.tar.gz slave1:/home/hadoop

scp ./hadoop.master.tar.gz slave2:/home/hadoop

在slave1,slave2上分别执行

sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/local

在master主机上执行

cd /usr/local/hadoop

(八)格式化

!!!(建议拍个快照)!!!

bin/hdfs namenode -format

sbin/start-all.sh

验证

jps

---------环境变量--------------

export JAVA_HOME=/usr/lib/jvm/jdk

export JRE_HOME=${JAVA_HOME}/jre

    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

    export PATH=${JAVA_HOME}/bin:$PATH

export HADOOP_HOME=/usr/local/hadoop

# export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

export SPARK_HOME=/usr/local/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

export PYSPARK_PYTHON=python3

----------------------

六、安装Spark

1打开压缩包所在的位置

cd /mnt/hgfs/share/

解压

sudo tar -zxvf spark-2.4.0-bin-without-hadoop.tgz -C /usr/local/

2加权、重命名

cd /usr/local

sudo -su root

chown -R hadoop:hadoop spark-2.4.0-bin-without-hadoop/

mv spark-2.4.0-bin-without-hadoop spark

sudo -su Hadoop

3编辑~/.bashrc文件

sudo vim ~/.bashrc

末尾添加如下内容

 export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

     export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"

        export SPARK_HOME=/usr/local/spark

        export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

    export PYSPARK_PYTHON=python3

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

4使环境变量生效

source ~/.bashrc

5配置spark-env.sh文件

cd /usr/local/spark/conf

sudo cp spark-env.sh.template spark-env.sh

sudo vim spark-env.sh

编辑spark-env.sh,添加如下内容:

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

export SPARK_MASTER_IP=192.168.52.131

上面改为自己的IP

6配置好后,将Master主机上的/usr/local/spark文件夹复制到各个节点上在Master主机上执行如下命令:

cd /usr/local/

tar -zcf ~/spark.master.tar.gz ./spark

cd ~

scp ./spark.master.tar.gz hadoop@slave1:/home/hadoop

scp ./spark.master.tar.gz hadoop@slave2:/home/hadoop

在slave1,slave2节点上分别执行下面同样的操作:

sudo tar -zxf ~/spark.master.tar.gz -C /usr/local

然后检查一下,你的spark文件夹所属的权限是否是hadoop,如果不是在赋一下权限

sudo chown -R hadoop /usr/local/spark

实验一

1验证spark是否安装成功,在开启SPARK之前 要先开启HADOOP,只需在master节点执行

在hadoop用户下面

cd /usr/local/hadoop

sbin/start-all.sh

查看每个节点进程是否都已启动成功

Jps

Master节点有四个

8036 ResourceManager

7876 SecondaryNameNode

7657NameNode

3349 Jps

slave1有三个

3352 Jps

3144 DataNode

3241 NodeManager

slave2有三个

2单机模式验证

不需要启动spark在master上面直接运行:

bin/run-example SparkPi

输出信息太多,我们用管道命令过滤一下有用信息

bin/run-example SparkPi 2>&1 | grep "Pi is roughly"

会输出    Pi is roughly 3.1460957304786525

要采用本地模式,在4个CPU核心上运行pyspark

cd /usr/local/spark

./bin/pyspark --master local[4]

3开发spark独立应用程序

编写程序

打开一个Linux终端或在Windows中写好上传,

sudo vim /usr/local/spark/mycode/python/WordCount.py

具体内容如下:

#/usr/local/spark/mycode/python/WordCount.py

if __name__ == '__main__':

  from pyspark import SparkConf, SparkContext

  conf = SparkConf().setMaster("local").setAppName("My App")

  sc = SparkContext(conf = conf)

  logFile = "file:///usr/local/spark/README.md"

  logData = sc.textFile(logFile, 2).cache()

  numAs = logData.filter(lambda line: 'a' in line).count()

  numBs = logData.filter(lambda line: 'b' in line).count()

  print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))

------------------------

运行上面的python代码

cd /usr/local/spark/mycode/python

python3 WordCount.py

七、独立集群启动Spark

1先启动hadoop

首先我们修改两个地方(仅在master节点修改就可,修改完分发至其他的worker节点)

2第一个修改spark-env.sh文件

cd /usr/local/spark/conf

sudo vim spark-env.sh

在末尾添加:

export JAVA_HOME=/usr/lib/jvm/jdk

3第二个改slaves

sudo cp slaves.template slaves

sudo vim slaves

在里面添加:

slave1

slave2

4然后分发至slave1和slave2节点:

scp spark-env.sh slaves hadoop@slave1:/usr/local/spark/conf

scp spark-env.sh slaves hadoop@slave2:/usr/local/spark/conf

5启动以hadoop用户操作master节点:

cd /usr/local/spark

sbin/start-master.sh

输出如下信息:

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-master.out

6然后在master节点执行:

sbin/start-slaves.sh

输出如下信息:

slave2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-slave2.out

slave1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-slave1.out

7启动成功之后检查,master节点信息:

jps

master节点

8816 Master

8337 SecondaryNameNode

8900 Jps

8149 NameNode

8519 ResourceManager

slave1节点:

5777 Worker

5463 DataNode

5596 NodeManager

5838 Jps

slave2:

5472 NodeManager

5715 Jps

5652 Worker

5338 DataNode

开始实验

(1)在集群中运行应用程序

向独立集群管理器提交应用,需要把spark://master:7077作为主节点参数递给spark-submit

cd /usr/local/spark/

(两行是同一个命令)

bin/spark-submit --master spark://master:7077 /usr/local/spark/mycode/python/WordCount.py 2>&1 | grep "Lines"

(2)在集群中运行pyspark

也可以用pyspark连接到独立集群管理器上

首先输入以下内容:

cd /usr/local/hadoop

./bin/hdfs dfs -put /usr/local/spark/README.md /

之后

cd /usr/local/spark/

bin/pyspark --master spark://master:7077

然后输入以下内容

textFile = sc.textFile("hdfs://master:9000/README.md")

textFile.count()

输出105

textFile.first()

输出‘# Apache Spark’

至此,Spark安装完成。

下集预告:RDD编程        (点击跳转)

文章链接

评论可见,请评论后查看内容,谢谢!!!评论后请刷新页面。