Recent vesion of HBase (0.94 has it) come with an option to "shortcircuit" hadoop to read data. This is supposed to ingrease the performances. I tried to activate that recently and faced some issues. Therefor I have decided to share with you how to activate that propertly.
First, let's do a rowcount in a 10M line table: 11m1.013s. This will be our baseline.
Basically, there is only 2 things mandatory, and one recommanded.
The 2 things mandatory are to update you hdfs-site.xml to add something like that
Where hbase is the id running your HBase process. BUT... If you are running MR jobs under the hadoop user, and those MR jobs are using HBase too, you will have to change and run those jobs with the HBase user because they will get the access to the HDFS denied... Not doing that will give you errors like :
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Can't continue with getBlockLocalPathInfo() authorization. The user hadoop is not allowed to call getBlockLocalPathInfo
The 2nd thing you need to do is to update your HBase configuration. In the hbase-site.xml file, add this:
Depending on the way your users are configured, you might need to assign them to the other group to with something like:
usermod -a -G hbase hadoop
With those 2 entries modified, you can already restart your HBase and your Hadoop and try...
I re-ran the rowcount base line and got this respons time: 6m27.983s
It's 41% faster!!!! Significant improvment!
Now, there is another thing to look at. Hadoop is maintaining a checksum on his side, and it's recommanded to de-activate it, and move it on the HBase side.
This is done by updating hbase-site.xml to add:
This will tell to HBase to check himself for the data checksum instead of asking Hadoop to do it and will reduce IOs.
I ran a major_compaction agian to make sure all the checksum are added by HBase and the new respons time is now: 5m56.803s
Which is now 46% faster!
Based on those results, I highlighy recommand to activate this if the version of HBase and Hadoop you are using permit.