HDFS – Compressing Content – hilllaneconsulting.co.uk

Need to look at the other types of compression :-

https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_performance.html

Although Bzip2 is splittable and gives best compression, is also the slowest at compressing and
decompressing

More Testing

Test both Spark and Hadoop Streaming to compress 182mb and 14Gb file with codecs

Snappy : org.apache.hadoop.io.compress.SnappyCodec
LZO : com.hadoop.compression.lzo.LzoCodec

Scala Script to Iterate over Files in HDFS

https://community.hortonworks.com/questions/77130/how-to-iterate-multiple-hdfs-files-in-spark-scala.html

Some useful links to sites offering solutions for compressing existing HDFS in place content

Hadoop Streaming – compress files
https://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs

Another Streaming Example

Mass-gzip files inside HDFS using the power of Hadoop

You Could use Fuse
http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/cdh5ig_hdfs_mountable.html