HBase 数据迁移的几种方式

hbase

发布日期: 2019-07-26

文章字数: 3.2k

阅读时长: 15 分

阅读次数:

1. CopyTable

Base 的 CopyTable 是一个用于复制一个表到另一个表的实用工具。它可以在同一个 HBase 集群内复制表，也可以在不同的 HBase 集群间复制表。CopyTable 可以根据时间戳范围、版本数量和过滤条件来筛选需要复制的数据。

CopyTable 的主要用途包括：

表备份：将一个表的数据复制到另一个表，以便在需要时恢复数据。
数据迁移：在不同的 HBase 集群之间复制表，以进行数据迁移或表级别的备份。

[root@node1 ~]# hbase org.apache.hadoop.hbase.mapreduce.CopyTable
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

参数说明

rs.class	目标集群的hbase.regionserver.class，仅在夸集群时使用
rs.impl	目标集群的 hbase.regionserver.impl
startrow	结束行
stoprow	开始行
starttime	时间范围的开始时间(unixtime 的毫秒值) 如果没有指定结束时间则从开始时间到末尾。
endtime	时间范围的结束时间，如果未指定启动时间，则忽略。
versions	要复制cell的版本数
new.name	将数据复制到目标表的名称
peer.adr	按照下面的格式给出目标集群的地址 hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
families	多个列簇以逗号隔开要从cf1复制到cf2，请提供sourceCfName：destCfName。要保持相同的名称，只需提供“cfName”
all.cells	同时复制删除标记并且删除cells
bulkload	将输入写入HFile并批量加载到目标表
tablename	要拷贝数据的表名称

Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
For performance consider the following general option:
  It is recommended that you set the following to >=100. A higher value uses more memory but
  decreases the round trip time to the server and may increase performance.
    -Dhbase.client.scanner.caching=100
  The following should always be set to false, to prevent writing data twice, which may produce
  inaccurate results.
    -Dmapreduce.map.speculative=false

create 'copy2users',{NAME => 'info', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}

统计群向另一张表拷贝数据
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --families=test:users:info copy2users


hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=node1,node2,node3:2181:/hbase --families=users:info copy2users
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=node1,node2,node3:2181:/hbase-unsecure -–new.name=test:users --families=info copy2users


hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=Student copy2users


CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>


hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=Student copy2users
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=node1,node2,node3:2181:/hbase-unsecure --new.name=test:users --families=info copy2users



hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=node1,node2,node3:2181:/hbase-unsecure --new.name=test:users --families=info copy2users


hbase org.apache.hadoop.hbase.mapreduce.CopyTable  --new.name=目标表  数据源表

hbase org.apache.hadoop.hbase.mapreduce.CopyTable  --new.name=copy2users test:users


hbase org.apache.hadoop.hbase.mapreduce.CopyTable  --new.name=copy2 --families=Sage Student


hbase org.apache.hadoop.hbase.mapreduce.CopyTable --families=Sage,Sname  --new.name=copy2  Student

put 'copy2  ','Sage','grade:','5'



Map input records=8133896
    Map output records=8133896
    Input split bytes=962

create 'copy2',{NAME => 'Sage', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'Sname', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}

create 'copy1',{NAME => 'cf1', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}
put 'copy1','rk001','cf1:c1','001'
put 'copy1','rk002','cf1:c1','002'
put 'copy1','rk003','cf1:c1','003'
put 'copy1','rk004','cf1:c1','004'

create 'copy2',{NAME => 'cf1', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}

hbase org.apache.hadoop.hbase.mapreduce.CopyTable  --new.name=copy2 copy1

# 拷贝全部数据
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --families=info  --new.name=copy3 test:users

# 创建目标表
create 'copy3',{NAME => 'info', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'header', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}

create'copy3',{NAME=>'info',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true},{NAME=>'header',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true},SPLITS=>['20190621|','20190622|','20190623|','20190624|','20190625|']
create'copy4',{NAME=>'info',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true},{NAME=>'header',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true}
create'copy5',{NAME=>'info',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true}
create 'copy6',{NAME => 'info', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'header', TTL=>'604800',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true} 失败

create 'copy6',{NAME => 'info', COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'header', COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}

create 'copy7',{NAME => 'info', COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'header', COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true}
create 'copy8',{NAME => 'info', TTL=>'6048000',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true},{NAME => 'header', TTL=>'6048000',COMPRESSION => 'SNAPPY',VERSIONS => 1,BLOCKCACHE => true} 


# 清空表（会删除所有数据，慎用）
truncate 'copy3'


hbase org.apache.hadoop.hbase.mapreduce.CopyTable  --families=info  --new.name=copy3 test:users 


# 拷贝部分数据指定 start end key

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --startrow=20190621090001000-102 --stoprow=20190621090001000-115  --families=info  --new.name=copy3 test:users 

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --startrow=20190621090001000-102 --stoprow=20190621090001000-115  --families=info  --new.name=copy4 test:users 

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --startrow=20190621090001000-102 --stoprow=20190621090001000-115  --families=info  --new.name=copy5 test:users 

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --startrow=20190621090001000-102 --stoprow=20190621090001000-115  --families=info  --new.name=copy6 test:users 

# 失败原因时创建表时设置了ttl 而导入的数据超过了这个ttl时间范围所以自动删除了
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --startrow=20190621090001000-102 --stoprow=20190621090001000-115  --families=info  --new.name=copy7 test:users 

# 计算拷贝数量
hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'copy3'

2. Export and Import

在 HBase 中，Export 和 Import 是两个用于备份和恢复表数据的实用工具。它们可以将 HBase 表的数据导出到 Hadoop 分布式文件系统（HDFS）上的文件中，以便在需要时从这些文件中恢复数据。

Export： Export 是一个 MapReduce 工具，用于将 HBase 表中的数据导出到 HDFS。导出的数据将存储在 SequenceFile 格式的文件中，每个键值对表示一个 HBase 单元格。Export 可以根据时间戳范围、版本数量和过滤条件筛选需要导出的数据。
Import： Import 是一个 MapReduce 工具，用于将通过 Export 导出的数据导入到 HBase 表中。导入的数据可以插入到现有的表中，也可以插入到新建的表中。

[root@node1 ~]# hbase org.apache.hadoop.hbase.mapreduce.Export --help
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
ERROR: Wrong number of arguments: 1

Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]

  Note: -D properties will be applied to the conf used.

-D mapreduce.output.fileoutputformat.compress=true	是否压缩数据
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec	压缩算法
-D mapreduce.output.fileoutputformat.compress.type=BLOCK	被输出的SequenceFiles应该如何压缩？应该是NONE，RECORD或BLOCK

设置SCAN参数

-D hbase.mapreduce.scan.column.family=,, …	要导出的列簇
-D hbase.mapreduce.include.deleted.rows=true
-D hbase.mapreduce.scan.row.start=	开始rowkey
-D hbase.mapreduce.scan.row.stop=	结束rowkey
-D hbase.client.scanner.caching=100	客户端缓存条数
-D hbase.export.visibility.labels=	设置可见性标签

对于行宽很宽的表，请考虑将批量大小设置如下

-D hbase.export.scanner.batch=10	每个批次取数据量
-D hbase.export.scanner.caching=100
-D mapreduce.job.name=jobName	使用指定的mapreduce作业名称进行导出

MR设置

-D mapreduce.map.speculative=false	map推断
-D mapreduce.reduce.speculative=false	reduce推断

2.1 数据导出

hbase org.apache.hadoop.hbase.mapreduce.Export <tableName> <ouput_hdfs_path> <versions> <starttime> <endtime> 

hbase org.apache.hadoop.hbase.mapreduce.Export copy5 /user/root/copytable/copy5

查看是否导出成功

2.1 数据导入

[root@node1 ~]# hbase org.apache.hadoop.hbase.mapreduce.Import --help
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
ERROR: Wrong number of arguments: 1

Usage:

 Import [options] <tablename> <inputdir>


# 默认情况下，Import会将数据直接加载到HBase中。 要生成HFile数据以准备批量数据加载，请传递选项：
  -Dimport.bulk.output=/path/for/output # 输出文件的路径

# 如果有一个大的结果包含太多的Cell，这可能是由于reducer中的memery排序引起的OOME，请传递选项：
  -D import.bulk.hasLargeResult=true

 要将通用org.apache.hadoop.hbase.filter.Filter应用于输入，请使用
  -Dimport.filter.class=<name of filter class>
  -Dimport.filter.args=<comma separated list of args for filter

 注意：在通过HBASE_IMPORTER_RENAME_CFS属性进行密钥重命名之前，将应用过滤器。 此外，过滤器将仅使用Filter＃filterRowKey（byte [] buffer，int offset，int length）方法来识别当前行是否需要完全被忽略以进行处理和Filter＃filterCell（Cell）方法来确定Cell是否应该 被添加; Filter.ReturnCode＃INCLUDE和#INCLUDE_AND_NEXT_COL将被视为包括Cell。

要导入从HBase 0.94导出的数据，请使用
  -Dhbase.import.version=0.94
  -D mapreduce.job.name=jobName  使用指定的mapreduce作业名称进行导入

有关性能，请考虑以下选项：
  -Dmapreduce.map.speculative=false # map推断
  -Dmapreduce.reduce.speculative=false # reduce推断
  -Dimport.wal.durability=<Used while writing data to hbase. Allowed values are the supported durability values like SKIP_WAL/ASYNC_WAL/SYNC_WAL/...> # 设置写数据时的方式



 hbase org.apache.hadoop.hbase.mapreduce.Import  copy6  /user/root/copytable/copy5

 hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/user/root/copytable/copy7/output  copy7  /user/root/copytable/copy5

truncate 'copy5'


hbase org.apache.hadoop.hbase.mapreduce.Export copy3 /user/root/copytable/copy3
8133896
8133896/246=33064.6179

1172760639/33064=35469.412
35469/60/60=9.8525

2019-07-16 16:29:43 - 2019-07-16 16:34:04,374 = 4'20''
-rw-r--r--   1 root hdfs      2.8 G 2019-07-16 16:31 /user/root/copytable/copy3/part-m-00000
-rw-r--r--   1 root hdfs      1.6 G 2019-07-16 16:34 /user/root/copytable/copy3/part-m-00001


 hbase org.apache.hadoop.hbase.mapreduce.Import copy5 /user/root/copytable/copy3
2019-07-16 16:36:14 - 2019-07-16 16:41:48 = 5'34''

create'copy9',{NAME=>'info',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true},{NAME=>'header',COMPRESSION=>'SNAPPY',VERSIONS=>1,BLOCKCACHE=>true},SPLITS=>['20190621|','20190622|','20190623|','20190624|','20190625|']

 hbase org.apache.hadoop.hbase.mapreduce.Import copy9 /user/root/copytable/copy3
2019-07-16 16:45:38 - 2019-07-16 16:55:23  10'

3. snapshot（快照）

HBase 中的 Snapshot 是一种用于备份和恢复表数据的轻量级方法。快照能够在不影响线上服务的情况下捕捉表的某个时刻状态。快照不复制数据，而是记录 HBase 表在某个时间点的元数据和数据引用。由于只存储元数据和引用，快照操作非常快速，且占用的存储空间很小。

# 1. 新建快找
snapshot 'copy8', 'snapshot_copy8'

# 2. 查看快照
list_snapshots

# 3. 数据导出
hbase class org.apache.hadoop.hbase.snapshot.tool.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8020/hbase -mappers 16

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot_copy8 -copy-to hdfs://node1:50070/apps/hbase/data  
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot_copy8 \
-copy-from hdfs://node1:8082/apps/hbase/data  \
-copy-to hdfs://node1:50070/apps/hbase/data  

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
    -snapshot MySnapshot -copy-from hdfs://srv2:8082/hbase \
    -copy-to hdfs://node1:50070/hbase -mappers 16 -bandwidth  1024\


put 'copy8','999','info:tt','999999'

# 4. 恢复快照
disable 'copy8' 
restore_snapshot 'snapshot_copy8'

usage: hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot

Options:

–snapshot	要恢复的快照.
–copy-to	将数据拷贝到的目的集群地址 hdfs://
–copy-from	数据源地址 hdfs:// (default hbase.rootdir)
–target	快照的目标名称
–no-checksum-verify	Do not verify checksum, use name+length only.
–no-target-verify	不验证导出快照的完整性
–overwrite	如果已存在，则重写快照的manifest。
–chuser	将文件的所有者更改为指定的文件
–chgroup	将文件组更改为指定的文件
–chmod	将文件的权限更改为指定的权限
–mappers	复制期间使用的map数量（mapreduce.job.maps）。
–bandwidth	将带宽限制为此值，以MB /秒为单位。

Examples:
  hbase snapshot export \
    --snapshot MySnapshot --copy-to hdfs://srv2:8082/hbase \
    --chuser MyUser --chgroup MyGroup --chmod 700 --mappers 16

  hbase snapshot export \
    --snapshot MySnapshot --copy-from hdfs://srv2:8082/hbase \
    --copy-to hdfs://node1:50070/hbase

4. distcp

HBase 使用 DistCp (分布式拷贝) 工具进行数据迁移的操作步骤如下：

停止 HBase：在进行数据迁移之前，请确保 HBase 服务已经停止。这是为了确保数据迁移期间不会有新的数据写入。
将 HBase 数据导出到 HDFS：使用 hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot 命令将 HBase 快照导出到 HDFS 中。假设快照名称为 mysnapshot，导出目录为 /hbase-export：

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot mysnapshot -copy-to hdfs://namenode:port/hbase-export

使用 DistCp 复制数据：在源集群上运行以下命令，将数据复制到目标集群的 HDFS 中。假设目标集群的 HDFS 地址为 hdfs://target-namenode:port/hbase-import：

hadoop distcp hdfs://namenode:port/hbase-export hdfs://target-namenode:port/hbase-import

将数据导入到目标集群的 HBase 中：在目标集群上，使用 hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot 命令将数据从 HDFS 导入到 HBase 中：

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot mysnapshot -copy-from hdfs://target-namenode:port/hbase-import -copy-to hdfs://target-namenode:port/hbase