前言

问题来自于 某java交流群 

来自于 做 flink 开发的一位朋友 

然后 这个问题的排查 也是费了不少的时间, 主要是 他那边经常会去客户现场什么的 造成沟通不便 

但是 好在最后 问题是解决了 

主要的业务是从 tdengine 读取一批次数据, 然后进行数据处理, 然后 最后采集一个结果输出到客户端, 都是一些 较为平常的操作 

 

 

问题现象

问题现象如下 

使用 bin/flink 向 flink 集群抛了一个 analysis.jar 的驱动包 让其执行 

然后 过了两三分钟之后, 这个任务 报错如下 

2024-03-21 14:42:29,846 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - CepOperator -> Sink: Print to Std. Out (1/1) (4ae89192645b022da9401c80342b93f2_90bea66de1c231edf33913ecd54406c1_0_0) switched from RUNNING to FAILED on localhost:36777-01398c @ localhost (dataPort=43443).

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id localhost:36777-01398c timed out.

at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1464) ~[flink-dist-1.18.1.jar:1.18.1]

at org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor.run(DefaultHeartbeatMonitor.java:158) ~[flink-dist-1.18.1.jar:1.18.1]

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_402]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_402]

at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) ~[flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-dist-1.18.1.jar:1.18.1]

at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451) ~[flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218) ~[flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85) ~[flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168) ~[flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) [flink-rpc-akkad7858f66-cf4a-427d-b780-4a9fcc376868.jar:1.18.1]

at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_402]

at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_402]

at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_402]

at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_402]

 

 

问题的排查

从上下文的异常可以知道, 大概的原因是 taskmanager 这边向 resourcemanager 这边回传心跳的信息哪里出现问题, 然后 导致了 这个报错 

是发送端 resourcemanager 的问题, 还是 接受反馈端 taskmanager 的问题呢? 

查看一下 resourcemanager 这边的日志, 可以看到 resourcemanager 这边是一直在定时 向 taskmanager, jobmaster 这边发送心跳请求 

然后 resourcemanager 和 jobmaster 这边的交互是正常的, 有来有回 

然后 resourcemanager 这边接受到 taskmanager 的最后一个心跳是 2024-03-21 14:41:39,838

, 可以搜索日志 "Received heartbeat from localhost:36777-01398c." 

 

 

然后 我们看一下 taskmanager 这边的日志 

可以看到, 直到 2024-03-21 14:41:57,997 貌似都还能正常处理请求 

然后 接着下一行日志, 就是在 2024-03-21 14:44:25,899, 这三分钟 taskmanager 这边在干什么? 

然后 我们尝试 jmap 向 taskmanager 发送请求, 然后获取 jvm 堆的相关信息, 但是 attach 失败了 

大致的情况是这个 taskmanager hang 住了 

 

 

问题的确认

由于总总原因, 问题的确认是在 2024.03.27 

可以看到截图开始, taskmanager 就已经在开始不断地 fullgc 了, 并且每一次 fullgc 也没有释放出多少资源 

然后 taskmanager 这边大部分时间都花在了一次又一次的 fullgc, 整个进程就 hang 住了 

这个一般来说 就是由于 spark/flink 这边和 数据库服务器这边交互的时候, 迭代交互出现了问题, 配置成 多次交互, 每次交互一批量, 一般来说 就可以解决这个问题 

类似的问题可以参见 记一次 spark 读取大数据表 OOM OutOfMemoryError: GC overhead limit exceeded

 

 

问题的解决

 

 

 

 

完 

 

 

 

 

 

推荐链接

评论可见,请评论后查看内容,谢谢!!!评论后请刷新页面。