# hive-on-spark **Repository Path**: chenshiming2/hive-on-spark ## Basic Information - **Project Name**: hive-on-spark - **Description**: No description available - **Primary Language**: Java - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2022-03-20 - **Last Updated**: 2023-10-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # hive-on-spark ## 介绍 from apache-hive:3.1.2 众所周知apache的hadoop生态圈常常出现各个组件间版本不兼容问题,本项目旨在解决hive on spark架构版本兼容问题,即对apache-hive进行代码调试从根本上解决jar包冲突等问题。 + 编译的发行版 链接:https://pan.baidu.com/s/13F34y6NUJPXaITeCUUTCrQ?pwd=hive 提取码:hive ## 目前兼容 ### hadoop:3.2.2 由于hadoop 3.3系列使用的protobuf是3.7.1,然而目前使用最广泛的protobuf版本仍是2.5.0,使用hadoop 3.3会出现不兼容问题,且hadoop 3.3系列api接口改动幅度大,从生态圈的角度 选择支持 hadoop 3.2 + 可选用如下编译的安装包:链接:https://pan.baidu.com/s/1S0FCh92dMgRqq5vKevk3_w?pwd=hado 提取码:hado (将native所有依赖库均打包) ### spark:3.1.3 由于spark 3.2旨在兼容hadoop 3.3,而spark 3.1旨在兼容 hadoop 3.2,故在这里选用3.1系列最新的3.1.3 ### hbase:2.4.10 ### guava-27.0 由于hadoop:3.2.2所使用的guava包版本是27.0-jre而默认hive使用的guava包版本是19.0会发生兼容性冲突,通常做法是直接将hadoop的guava包覆盖hive的guava包,但是由于guava包版本差距太大,不少api接口发生变化,治标不治本所以在运行时可能仍然会报一些错误,故需要调整源码。 ## docker创建 + `Dockerfile` ```dockerfile FROM ubuntu:20.04 MAINTAINER csmsoledad<2584139809@qq.com> ENV MYPATH /opt/module WORKDIR $MYPATH RUN apt-get update RUN apt-get install -y vim openssh-server net-tools RUN sed -i '35 s/^/PermitRootLogin yes\n/' /etc/ssh/sshd_config RUN echo service ssh start >> /root/.bashrc ADD server-jre-8u321-linux-x64.tar.gz $MYPATH RUN mv jdk1.8.0_321/ jdk ENV JAVA_HOME $MYPATH/jdk ENV CLASSPATH $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar ENV PATH=$PATH:$JAVA_HOME/bin ADD hadoop-3.2.2.tar.gz $MYPATH RUN mv hadoop-3.2.2 hadoop ENV HADOOP_HOME=$MYPATH/hadoop ENV PATH=$PATH:$HADOOP_HOME/bin ENV PATH=$PATH:$HADOOP_HOME/sbin RUN sed -i '55 s?^?export JAVA_HOME=/opt/module/jdk?' $HADOOP_HOME/etc/hadoop/hadoop-env.sh ADD apache-hive-3.1.2-bin.tar.gz $MYPATH RUN mv apache-hive-3.1.2-bin/ hive ENV HIVE_HOME $MYPATH/hive ENV PATH=$PATH:$HIVE_HOME/bin RUN wget https://cdn.mysql.com/archives/mysql-connector-java-5.1/mysql-connector-java-5.1.49.tar.gz \ && tar -xzvf mysql-connector-java-5.1.49.tar.gz \ && cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar $HIVE_HOME/lib \ && rm -rf mysql-connector-java-5.1.49 \ && rm mysql-connector-java-5.1.49.tar.gz ADD spark-3.1.3-bin-hadoop3.2.tgz $MYPATH RUN mv spark-3.1.3-bin-hadoop3.2/ spark ENV SPARK_HOME=$MYPATH/spark ENV PATH=$PATH:$SPARK_HOME/bin RUN cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh \ && sed -i '1 a export JAVA_HOME=/opt/module/jdk' spark/conf/spark-env.sh \ && sed -i '2 a export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop' spark/conf/spark-env.sh \ && sed -i '3 a export SPARK_DIST_CLASSPATH=$(hadoop classpath)' spark/conf/spark-env.sh \ ``` + 镜像打包 ``` docker build -t hive-on-spark:1.0 . ``` + 创建实例 ```shell docker run -it --name HiveSpark --net bigdata --ip 192.168.1.7 hive-on-spark:1.0 ``` ## 基本配置文件 + 将hadoop集群中的`etc/hadoop`目录下的所有配置文件同步复制到该容器的`$HADOOP_HOME/etc/hadoop`,以供spark读取 + 在`$HIVE_HOME/conf`目录下的新建`hive-site.xml` ```xml javax.jdo.option.ConnectionURL jdbc:mysql://mysqlmock:3306/metastore?useSSL=false&useUnicode=true&characterEncoding=UTF-8 javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword 0000 hive.metastore.schema.verification false hive.metastore.event.db.notification.api.auth false hive.metastore.warehouse.dir hdfs://hadbase1:8020/hive/warehouse hive.metastore.uris thrift://HiveSpark:9083 hive.server2.thrift.bind.host HiveSpark hive.server2.thrift.port 10000 hive.metastore.event.db.notification.api.auth false hive.cli.print.header true hive.cli.print.current.db true spark.yarn.jars hdfs://hadbase1:8020/spark-jars/* hive.execution.engine spark ``` + 配置spark `vim /opt/module/hive/conf/spark-defaults.conf `,可参考`cat $SPARK_HOME/conf/spark-defaults.conf.template ` ```editorconfig spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://hadbase1:8020/spark-history spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 2g spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" ``` ## 初次启动 + 配饰元数据的数据库为mysql ```sh schematool -initSchema -dbType mysql -verbose ``` + 连接mysql执行以下指令,解决字符集问题 ```sql use metastore; alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8; alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8; ``` + 在hdfs系统中创建目录 ``` hadoop fs -mkdir /spark-history ``` + 在hdfs系统中上传spark的jar包,方便在yarn模式下加载spark ``` wget https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-without-hadoop.tgz tar xzvf spark-3.1.3-bin-without-hadoop.tgz hadoop fs -put spark-3.1.3-bin-without-hadoop/jars/* /spark-jars ``` + 测试 启动hive客户端 ``` root@fdee241c9199:/opt/module/hive/conf# hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/module/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = 5fdc6d56-02d9-4b8f-808e-74f976c3ff73 Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true Hive Session ID = f7b9a256-1b28-4cf5-9ca0-1496dd3c7777 hive (default)> show tables; ``` 创建表 ``` create table test (id int, name string); ``` 向表插入数据 ```shell hive (default)> insert into table test values (1,"csm"); Query ID = root_20220320133310_46a816c9-7b9e-49bf-8898-efbebeb26537 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Running with YARN Application = application_1647759040353_0001 Kill Command = /opt/module/hadoop/bin/yarn application -kill application_1647759040353_0001 Hive on Spark Session Web UI URL: http://c39027038dd1:41831 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING -------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-0 ........ 0 FINISHED 1 1 0 0 0 Stage-1 ........ 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.07 s -------------------------------------------------------------------------------------- Spark job[0] finished successfully in 6.07 second(s) Loading data to table default.test OK col1 col2 Time taken: 28.737 seconds ``` 如上则表示hive-on-spark模式配置成功 ## 控制脚本 + `vim /usr/local/bin/myhive` ```shell #!/bin/bash HIVE_LOG_DIR=$HIVE_HOME/logs HIVE_LOG=/tmp/root if [ ! -d $HIVE_LOG_DIR ] then mkdir -p $HIVE_LOG_DIR fi #检查进程是否运行正常,参数 1 为进程名,参数 2 为进程端口 function check_process() { pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}') # ps -ef查看所有进程 # 由于执行 grep -i $1 必定会多出一行关于grep -i的进程,故用 grep -v grep排除有关信息 # awk从打印出的信息中取得第2个字符串,即进程号 ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1) echo $pid # netstat 查看端口号 # awk取出第7个字符串(xxx/java),用 cut 以/为分隔符,取出分隔后的第一位即 xxx [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1 } function hive_start() { metapid=$(check_process HiveMetastore 9083) cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &" [ -z "$metapid" ] && eval $cmd && echo "Metastroe 服务已启动" || echo "Metastroe 服务已启动" # -z 表示字符串长度是否为0,为0则执行 eval $cmd,不为0则输出"Metastroe 服务已启动" # eval 命令表示所有参数解析完(比如$HIVE_LOG_DIR就需要解析),才执行后面的命令 server2pid=$(check_process HiveServer2 10000) cmd="nohup hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &" [ -z "$server2pid" ] && eval $cmd && echo "HiveServer2 服务正在启动,请稍后..." && eval $(hive_sleep) && echo "HiveServer2 服务已启动" || echo "HiveServer2 服务已启动" } function hive_sleep(){ test1='spark.HiveSparkClientFactory:' test=$(tail -1 $HIVE_LOG/hive.log | awk '{print $4}') while [[ ! "$test" =~ "$test1" ]] do sleep 1 test=$(tail -1 $HIVE_LOG/hive.log | awk '{print $4}') done } function hive_stop() { metapid=$(check_process HiveMetastore 9083) [ "$metapid" ] && kill -9 $metapid || echo "Metastore 服务未启动" #[ ]判断是否存在 server2pid=$(check_process HiveServer2 10000) [ "$server2pid" ] && kill -9 $server2pid || echo "HiveServer2 服务未启动" } case $1 in "start") hive_start ;; "stop") hive_stop ;; "restart") hive_stop sleep 2 hive_start ;; "status") check_process HiveMetastore 9083 >/dev/null && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常或终止" check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常或终止" ;; *) echo Invalid Args! echo 'Usage: '$(basename $0)' start|stop|restart|status' ;; esac ``` ## 远程连接hive数据库 建议使用DataGrip工具 + 在一个会话中初次执行插入命令比较慢大概20秒,原因是需要启动spark相关进程。 + 在执行过程中可能会出现如下报错 > FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to fetch table students. Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient 解决方法 打开元数据库服务 即 `hive --service metastore ` > FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask 原因hive insert在非overwrite插入数据,如果使用非overwrite模式并且表已经存在数据的时候 解决方法如下 ```sql set hive.stats.column.autogather=false; ```