1 背景
线上Hive任务偶尔出现hang住的现象.
经排查确认是触发了Hive的bug, 该bug在Hive-10569(https://issues.apache.org/jira/browse/HIVE-10569中已修复.
2 现象
用户beeline任务运行至Ended Job之后,一直未结束,查看yarn日志,该job已经运行成功.
INFO : 2020-03-20 20:25:21,401 Stage-36 map = 100%, reduce = 0%, Cumulative CPU 3.96 sec
INFO : 2020-03-20 20:25:30,224 Stage-36 map = 100%, reduce = 100%, Cumulative CPU 9.44 sec
INFO : MapReduce Total cumulative CPU time: 9 seconds 440 msec
INFO : Ended Job = job_1582793079899_13676666INFO : 2020-03-20 20:25:21,401 Stage-36 map = 100%, reduce = 0%, Cumulative CPU 3.96 sec
INFO : 2020-03-20 20:25:30,224 Stage-36 map = 100%, reduce = 100%, Cumulative CPU 9.44 sec
INFO : MapReduce Total cumulative CPU time: 9 seconds 440 msec
INFO : Ended Job = job_1582793079899_13676666
3 排查步骤
3.1 查看hiveserver2日志
找到该job在hive server端的日志,没有发现任何异常.
3.2 抓取现场
分别通过jstack -l <pid>
和jmap -dump:[live,]format=b,file=<filename> <pid>
获取server端的堆栈信息.
3.3 定位排查
-
首先, 确定session ID.
根据 job id 定位到该MR对应的hive session id(session id会在hive conf中记录); -
然后, 确定线程ID.
通过MAT分析工具打开之前获取到的heap文件, 根据session id查找到对应的HiveSessionImpl对象, 进而追踪到对应的线程ID; -
最后, 确定线程栈.
在jstack文件中找到对应的线程,从而定位到具体的线程栈.
本案例中线程栈如下:
"HiveServer2-Background-Pool: Thread-10663039" #10663039 prio=5 os_prio=0 tid=0x00007fcd94c87000 nid=0x213ed in Object.wait() [0x00007fcbec2fb000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.hive.ql.DriverContext.pollFinished(DriverContext.java:118)
- locked <0x00007fd29eff91d0> (a org.apache.hadoop.hive.ql.DriverContext)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1574)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1113)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1108)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:157)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:209)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1923)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:221)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
3.4 源码剖析
查看源码位置, 分析可知runner.isRunning状态未更新会导致hang住不结束.
/**
* Polls running tasks to see if a task has ended.
*
* @return The result object for any completed/failed task
*/
public synchronized TaskRunner pollFinished() throws InterruptedException {
while (!shutdown) {
Iterator<TaskRunner> it = running.iterator();
while (it.hasNext()) {
TaskRunner runner = it.next();
if (runner != null && !runner.isRunning()) {
it.remove();
return runner;
}
}
wait(SLEEP_TIME);
}
return null;
}
4 解决方案
查找社区issue发现, 该bug已经在1.3版本中进行了修复.(https://issues.apache.org/jira/browse/HIVE-10569
后续 backport 该patch即可.