Need to look at the other types of compression :-
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_performance.html
Although Bzip2 is splittable and gives best compression, is also the slowest at compressing and
decompressing
More Testing
Test both Spark and Hadoop Streaming to compress 182mb and 14Gb file with codecs
Snappy : org.apache.hadoop.io.compress.SnappyCodec
LZO : com.hadoop.compression.lzo.LzoCodec
Scala Script to Iterate over Files in HDFS
https://community.hortonworks.com/questions/77130/how-to-iterate-multiple-hdfs-files-in-spark-scala.html
Some useful links to sites offering solutions for compressing existing HDFS in place content
Hadoop Streaming – compress files
https://stackoverflow.com/questions/7153087/hadoop-compress-file-in-hdfs
Another Streaming Example
You Could use Fuse
http://www.cloudera.com/documentation/cdh/5-1-x/CDH5-Installation-Guide/cdh5ig_hdfs_mountable.html