Uncategorized

Hadoop – Migrate Hive Database Files

When working on a Hadoop Eco system, where you may have tables stored in HIVE, there maybe times when you need to copy/migrate HIVE data files from one cluster to another.

Simply copying the files from the HDFS file system, from source to target clusters will not work.

A simple solution here is to use command line “distcp”. This uses map-reduce to copy data from source to target.

Steps are :-

  • Create a migration script similar to the one below. You can generate this from the mySQL meta data store using SQL such as this

select concat(‘hadoop distcp ‘,db_location_uri,’/’,tbl_name,’ ‘,’hdfs://newhost:8020/apps/hive/warehouse/test.db/’,tbl_name)
from hive.DBS db,
hive.TBLS tbl
where db.DB_ID = tbl.DB_ID
and db.name = ‘test.db’

Without put now looking like this :-

hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/deps_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/deps_tmp
hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/class_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/class_tmp
hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/subclass_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/subclass_tmp
hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/rpm_zone_location_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/rpm_zone_location_tmp
hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/uda_values_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/uda_values_tmp
hadoop distcp hdfs://oldhost:8020/apps/hive/warehouse/rms.db/diff_ids_tmp hdfs://newhost:8020/apps/hive/warehouse/test.db/diff_ids_tmp

  • Insert the distcp commands into a shell script, and run in background. This will slowly copy all the files associated with each table in the database to the new Cluster.
  • Be sure you open ports 8020 and 50010 between the target and destination if firewalls are in place
  • Once the files are copied, you can create the tables in the new cluster i.e. run the Create Table DDL, this will place the schema over the files copied

You should now find all your data files have been copied, and data available for query in the new cluster via HIVE QL.

Leave a Reply

Your email address will not be published. Required fields are marked *