Goals

Replicate application data, metadata and system metadata to another cluster
Cross Replication between two or more active clusters
Replicate a single namespace or multiple spaces to different clusters
Master-Slave replication, source of thruth truth is Master and all entities data and metadata replicated to 1 or more slaves

...

HDFS:
1. Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
2. Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html
3. Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html
4. HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html
5. How to iteratively copy data? What is data quantum to copy data iteratively? How to define replication complete ?
  - distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
  - distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
  - There is also another -diff snapshot option to copy differences of two snapshots.
  - distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/
6. Cloudera Manager Replication is built on a hardened version of distcp. It uses the scalability and availability of MapReduce and YARN to copy files in parallel, using a specialized MapReduce job or YARN application that runs diffs and transfers only changed files from each mapper to the replica side. Files are selected for copying based on their size and checksums. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_bdr_about.html
7. Cloudera Manager provides some additional capabilities on top of distcp, they are: https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
  Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) :
  - Consistency guarantees via HDFS snapshots
  - Scheduled execution support
  - Support for multiple Kerberos realm
  - These three can actually be achieved from distcp also:
    - Constant-space parallelized copy listing (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
    - Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
    - Selective path exclusion
HBase:
a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
How to check Replication is complete when customer is ready to switch over the cluster:
1. Check if this replication metric can be used to determine the above:
  1. source.sizeOfLogQueue
    number of WALs to process (excludes the one which is being processed) at the Replication source
c. HBase’s CopyTable Utility allows an efficient and scalable copy to a destination HBase cluster, while keeping current table online.
1. CopyTable Usage steps from Pivotal: https://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
Kafka:
FileSets

Versions Compared

Old Version 4

New Version Current

Key

Goals

HDFS:

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

HDFS: