# hive-on-spark
**Repository Path**: chenshiming2/hive-on-spark
## Basic Information
- **Project Name**: hive-on-spark
- **Description**: No description available
- **Primary Language**: Java
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 1
- **Forks**: 0
- **Created**: 2022-03-20
- **Last Updated**: 2023-10-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# hive-on-spark
## 介绍
from apache-hive:3.1.2 众所周知apache的hadoop生态圈常常出现各个组件间版本不兼容问题,本项目旨在解决hive on spark架构版本兼容问题,即对apache-hive进行代码调试从根本上解决jar包冲突等问题。
+ 编译的发行版 链接:https://pan.baidu.com/s/13F34y6NUJPXaITeCUUTCrQ?pwd=hive
提取码:hive
## 目前兼容
### hadoop:3.2.2
由于hadoop 3.3系列使用的protobuf是3.7.1,然而目前使用最广泛的protobuf版本仍是2.5.0,使用hadoop 3.3会出现不兼容问题,且hadoop 3.3系列api接口改动幅度大,从生态圈的角度 选择支持 hadoop 3.2
+ 可选用如下编译的安装包:链接:https://pan.baidu.com/s/1S0FCh92dMgRqq5vKevk3_w?pwd=hado
提取码:hado (将native所有依赖库均打包)
### spark:3.1.3
由于spark 3.2旨在兼容hadoop 3.3,而spark 3.1旨在兼容 hadoop 3.2,故在这里选用3.1系列最新的3.1.3
### hbase:2.4.10
### guava-27.0
由于hadoop:3.2.2所使用的guava包版本是27.0-jre而默认hive使用的guava包版本是19.0会发生兼容性冲突,通常做法是直接将hadoop的guava包覆盖hive的guava包,但是由于guava包版本差距太大,不少api接口发生变化,治标不治本所以在运行时可能仍然会报一些错误,故需要调整源码。
## docker创建
+ `Dockerfile`
```dockerfile
FROM ubuntu:20.04
MAINTAINER csmsoledad<2584139809@qq.com>
ENV MYPATH /opt/module
WORKDIR $MYPATH
RUN apt-get update
RUN apt-get install -y vim openssh-server net-tools
RUN sed -i '35 s/^/PermitRootLogin yes\n/' /etc/ssh/sshd_config
RUN echo service ssh start >> /root/.bashrc
ADD server-jre-8u321-linux-x64.tar.gz $MYPATH
RUN mv jdk1.8.0_321/ jdk
ENV JAVA_HOME $MYPATH/jdk
ENV CLASSPATH $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
ENV PATH=$PATH:$JAVA_HOME/bin
ADD hadoop-3.2.2.tar.gz $MYPATH
RUN mv hadoop-3.2.2 hadoop
ENV HADOOP_HOME=$MYPATH/hadoop
ENV PATH=$PATH:$HADOOP_HOME/bin
ENV PATH=$PATH:$HADOOP_HOME/sbin
RUN sed -i '55 s?^?export JAVA_HOME=/opt/module/jdk?' $HADOOP_HOME/etc/hadoop/hadoop-env.sh
ADD apache-hive-3.1.2-bin.tar.gz $MYPATH
RUN mv apache-hive-3.1.2-bin/ hive
ENV HIVE_HOME $MYPATH/hive
ENV PATH=$PATH:$HIVE_HOME/bin
RUN wget https://cdn.mysql.com/archives/mysql-connector-java-5.1/mysql-connector-java-5.1.49.tar.gz \
&& tar -xzvf mysql-connector-java-5.1.49.tar.gz \
&& cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar $HIVE_HOME/lib \
&& rm -rf mysql-connector-java-5.1.49 \
&& rm mysql-connector-java-5.1.49.tar.gz
ADD spark-3.1.3-bin-hadoop3.2.tgz $MYPATH
RUN mv spark-3.1.3-bin-hadoop3.2/ spark
ENV SPARK_HOME=$MYPATH/spark
ENV PATH=$PATH:$SPARK_HOME/bin
RUN cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh \
&& sed -i '1 a export JAVA_HOME=/opt/module/jdk' spark/conf/spark-env.sh \
&& sed -i '2 a export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop' spark/conf/spark-env.sh \
&& sed -i '3 a export SPARK_DIST_CLASSPATH=$(hadoop classpath)' spark/conf/spark-env.sh \
```
+ 镜像打包
```
docker build -t hive-on-spark:1.0 .
```
+ 创建实例
```shell
docker run -it --name HiveSpark --net bigdata --ip 192.168.1.7 hive-on-spark:1.0
```
## 基本配置文件
+ 将hadoop集群中的`etc/hadoop`目录下的所有配置文件同步复制到该容器的`$HADOOP_HOME/etc/hadoop`,以供spark读取
+ 在`$HIVE_HOME/conf`目录下的新建`hive-site.xml`
```xml
javax.jdo.option.ConnectionURL
jdbc:mysql://mysqlmock:3306/metastore?useSSL=false&useUnicode=true&characterEncoding=UTF-8
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
root
javax.jdo.option.ConnectionPassword
0000
hive.metastore.schema.verification
false
hive.metastore.event.db.notification.api.auth
false
hive.metastore.warehouse.dir
hdfs://hadbase1:8020/hive/warehouse
hive.metastore.uris
thrift://HiveSpark:9083
hive.server2.thrift.bind.host
HiveSpark
hive.server2.thrift.port
10000
hive.metastore.event.db.notification.api.auth
false
hive.cli.print.header
true
hive.cli.print.current.db
true
spark.yarn.jars
hdfs://hadbase1:8020/spark-jars/*
hive.execution.engine
spark
```
+ 配置spark `vim /opt/module/hive/conf/spark-defaults.conf `,可参考`cat $SPARK_HOME/conf/spark-defaults.conf.template `
```editorconfig
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadbase1:8020/spark-history
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 2g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
```
## 初次启动
+ 配饰元数据的数据库为mysql
```sh
schematool -initSchema -dbType mysql -verbose
```
+ 连接mysql执行以下指令,解决字符集问题
```sql
use metastore;
alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;
```
+ 在hdfs系统中创建目录
```
hadoop fs -mkdir /spark-history
```
+ 在hdfs系统中上传spark的jar包,方便在yarn模式下加载spark
```
wget https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-without-hadoop.tgz
tar xzvf spark-3.1.3-bin-without-hadoop.tgz
hadoop fs -put spark-3.1.3-bin-without-hadoop/jars/* /spark-jars
```
+ 测试
启动hive客户端
```
root@fdee241c9199:/opt/module/hive/conf# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 5fdc6d56-02d9-4b8f-808e-74f976c3ff73
Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = f7b9a256-1b28-4cf5-9ca0-1496dd3c7777
hive (default)> show tables;
```
创建表
```
create table test (id int, name string);
```
向表插入数据
```shell
hive (default)> insert into table test values (1,"csm");
Query ID = root_20220320133310_46a816c9-7b9e-49bf-8898-efbebeb26537
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Running with YARN Application = application_1647759040353_0001
Kill Command = /opt/module/hadoop/bin/yarn application -kill application_1647759040353_0001
Hive on Spark Session Web UI URL: http://c39027038dd1:41831
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 1 1 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.07 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 6.07 second(s)
Loading data to table default.test
OK
col1 col2
Time taken: 28.737 seconds
```
如上则表示hive-on-spark模式配置成功
## 控制脚本
+ `vim /usr/local/bin/myhive`
```shell
#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
HIVE_LOG=/tmp/root
if [ ! -d $HIVE_LOG_DIR ]
then
mkdir -p $HIVE_LOG_DIR
fi
#检查进程是否运行正常,参数 1 为进程名,参数 2 为进程端口
function check_process()
{
pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
# ps -ef查看所有进程
# 由于执行 grep -i $1 必定会多出一行关于grep -i的进程,故用 grep -v grep排除有关信息
# awk从打印出的信息中取得第2个字符串,即进程号
ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
echo $pid
# netstat 查看端口号
# awk取出第7个字符串(xxx/java),用 cut 以/为分隔符,取出分隔后的第一位即 xxx
[[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}
function hive_start()
{
metapid=$(check_process HiveMetastore 9083)
cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
[ -z "$metapid" ] && eval $cmd && echo "Metastroe 服务已启动" || echo "Metastroe 服务已启动"
# -z 表示字符串长度是否为0,为0则执行 eval $cmd,不为0则输出"Metastroe 服务已启动"
# eval 命令表示所有参数解析完(比如$HIVE_LOG_DIR就需要解析),才执行后面的命令
server2pid=$(check_process HiveServer2 10000)
cmd="nohup hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
[ -z "$server2pid" ] && eval $cmd && echo "HiveServer2 服务正在启动,请稍后..." && eval $(hive_sleep) && echo "HiveServer2 服务已启动" || echo "HiveServer2 服务已启动"
}
function hive_sleep(){
test1='spark.HiveSparkClientFactory:'
test=$(tail -1 $HIVE_LOG/hive.log | awk '{print $4}')
while [[ ! "$test" =~ "$test1" ]]
do
sleep 1
test=$(tail -1 $HIVE_LOG/hive.log | awk '{print $4}')
done
}
function hive_stop()
{
metapid=$(check_process HiveMetastore 9083)
[ "$metapid" ] && kill -9 $metapid || echo "Metastore 服务未启动"
#[ ]判断是否存在
server2pid=$(check_process HiveServer2 10000)
[ "$server2pid" ] && kill -9 $server2pid || echo "HiveServer2 服务未启动"
}
case $1 in
"start")
hive_start
;;
"stop")
hive_stop
;;
"restart")
hive_stop
sleep 2
hive_start
;;
"status")
check_process HiveMetastore 9083 >/dev/null && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常或终止"
check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常或终止"
;;
*)
echo Invalid Args!
echo 'Usage: '$(basename $0)' start|stop|restart|status'
;;
esac
```
## 远程连接hive数据库
建议使用DataGrip工具
+ 在一个会话中初次执行插入命令比较慢大概20秒,原因是需要启动spark相关进程。
+ 在执行过程中可能会出现如下报错
> FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to fetch table students. Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决方法 打开元数据库服务 即 `hive --service metastore `
> FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask
原因hive insert在非overwrite插入数据,如果使用非overwrite模式并且表已经存在数据的时候
解决方法如下
```sql
set hive.stats.column.autogather=false;
```