Goals
- Replicate application data, metadata and system metadata to another cluster
- Cross Replication between two or more active clusters
- Replicate a single namespace or multiple spaces to different clusters
- Master-Slave replication, source of thruth truth is Master and all entities data and metadata replicated to 1 or more slaves
...
HDFS:
- Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html
Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html
HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html
How to iteratively copy data? What is data quantum to copy data iteratively? How to define replication complete ?
- distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
- distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
- There is also another -diff snapshot option to copy differences of two snapshots.
- distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/Cloudera Manager Replication is built on a hardened version of distcp. It uses the scalability and availability of MapReduce and YARN to copy files in parallel, using a specialized MapReduce job or YARN application that runs diffs and transfers only changed files from each mapper to the replica side. Files are selected for copying based on their size and checksums. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_bdr_about.html
Cloudera Manager provides some additional capabilities on top of distcp, they are: https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) :
Consistency guarantees via HDFS snapshots
Scheduled execution support
Support for multiple Kerberos realm
These three can actually be achieved from distcp also:
Constant-space parallelized copy listing (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
Selective path exclusion
HBase:
a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
How to check Replication is complete when customer is ready to switch over the cluster:Check if this replication metric can be used to determine the above:
source.sizeOfLogQueue
number of WALs to process (excludes the one which is being processed) at the Replication source
- CopyTable Usage steps from Pivotal: https://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
- Kafka:
- FileSets