HBase节点时间不同步导致的异常


1. 背景

今早来公司发现HBase集群异常,看日志发现HMaster与一个RegionServer连接失败,于是找一下什么原因。

2. 问题日志

RegionServer 日志

2019-11-28 10:01:26,051 INFO  [regionserver/node3:16020] hbase.ChoreService: Chore service for: regionserver/node3:16020 had [[ScheduledChore: Name: CompactedHFilesCleaner Period: 120000 Unit: MILLISEC
ONDS], [ScheduledChore: Name: CompactionThroughputTuner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region node3,16020,1551923919360 Period: 120000 Unit: MI
LLISECONDS], [ScheduledChore: Name: MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS]] on shutdown
2019-11-28 10:01:26,051 INFO  [regionserver/node3:16020.logRoller] regionserver.LogRoller: LogRoller exiting.
2019-11-28 10:01:26,051 INFO  [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Split Thread to finish...
2019-11-28 10:01:26,052 INFO  [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Large Compaction Thread to finish...
2019-11-28 10:01:26,052 INFO  [regionserver/node3:16020] regionserver.CompactSplit: Waiting for Small Compaction Thread to finish...
2019-11-28 10:01:26,053 INFO  [regionserver/node3:16020] ipc.NettyRpcServer: Stopping server on /ip:16020
2019-11-28 10:01:26,106 INFO  [regionserver/node3:16020] zookeeper.ZooKeeper: Session: 0x36952745d40005a closed
2019-11-28 10:01:26,106 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2019-11-28 10:01:26,106 INFO  [regionserver/node3:16020] regionserver.HRegionServer: Exiting; stopping=node3,16020,1551923919360; zookeeper connection closed.
2019-11-28 10:01:26,106 INFO  [shutdown-hook-0] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2019-11-28 10:01:26,107 INFO  [shutdown-hook-0] regionserver.ShutdownHook: Shutdown hook finished.

HMaster日志如下

2019-11-28 10:01:23,193 INFO  [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=20, retries=106, started=229940 ms ago, cancelled=false, ms
g=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on node3,16020,1551923919360
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:01:26,078 INFO  [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [node4,16020,1551922317034]
......
......
2019-11-28 10:04:42,834 INFO  [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=106, started=4349 ms ago, cancelled=false, msg=C
all to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: nod
e3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:04:46,867 INFO  [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=7, retries=106, started=8382 ms ago, cancelled=false, msg=C
all to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: nod
e3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1
2019-11-28 10:04:56,955 INFO  [RpcServer.default.FPBQ.Fifo.handler=99,queue=9,port=16000] client.RpcRetryingCallerImpl: Call exception, tries=8, retries=106, started=18470 ms ago, cancelled=false, msg=
Call to node3/ip:16020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: 拒绝连接: no
de3/ip:16020, details=row 'ATLAS_ENTITY_AUDIT_EVENTS' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=node3,16020,1551860939615, seqNum=-1

3. 问题定位

经过排查确定是HBase节点间时间不同步导致的,可能是当初部署集群的时候忘记做时间同步了,那么我们加上时间同步即可。

4. 解决办法

需要将各节点时间同步即可,那面列出使用ntp将节点时间同步的方法。

  1. 安装ntp服务
yum install ntp
  1. 设置ntp为开机启动
chkconfig ntpd on

3.启动ntp服务

service ntpd start
  1. 查看ntpd的状态
service ntpd status

5.联网情况:
同步互联网的时间(可自行找一个时间服务器)在所有节点下执行下面命令

ntpdate ntp1.aliyun.com

6.离线情况

以其中一台最接近当前网络时间的服务器作为时间服务器,然后其他机器将时间同步到与该机器一致。

6.1 作为时间服务器的那台机器需要开启ntpd服务,其他机器不用开启,命令如下

service ntpd start

6.2 其它机器依次执行同步命令

ntpdate 时间服务器的ip

执行完上述步骤便完成时间同步了。


文章作者: hnbian
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 hnbian !
评论
 上一篇
yarn-cluster 和 yarn-client 区别 yarn-cluster 和 yarn-client 区别
1. 介绍我们都知道spark支持在yarn上运行,但是spark on yarn 又分为两种模式,yarn-cluster和yarn-client,它们究竟有什么区别与关联呢? spark支持可插拔的集群管理模式(standalone,
2019-12-06
下一篇 
SparkSQL通过SHC高效读写访问HBase SparkSQL通过SHC高效读写访问HBase
一、概述Apache Spark 和Apache HBase 是两个使用比较广泛的大数据组件。很多场景需要使用Spark分析/查询Hbase中的数据,而目前Spark内置是支持很多数据源的,其中就包括了HBase,但是内置的读取数据源还是使
2019-11-27
  目录