Tuesday, March 26, 2013

Compactions Q&A

On user mailing list, questions about compaction are probably the most frequently asked.

I try to summarize some answers below. They're by no means complete.

How to check if a major_compact is done?

JMX exposes metric about compaction time.
In HBASE-6033Adding some fuction to check if a table/region is in compaction, the following API was added to HBaseAdmin:

  public CompactionState getCompactionState(final String tableNameOrRegionName)
      throws IOException, InterruptedException {

Here is picture depicting compaction associated with a table.


This feature is in 0.95 and beyond.

Should custom script be written to compact regions one by one ?

Major compactions are needed if there're many writes / deletions to your table.

Since command for triggering major compaction is asynchronous, compaction storm may result if the commands are not properly issued to the regions (w.r.t. timing). Jean-Daniel suggested compacting subset of the regions at a time.
One can monitor compaction queue length on region server using JMX.

Are there new algorithms being developed to improve major compaction ?

One of the initiatives is the stripe compaction. See parent JIRA: HBASE-7667

Instead of creating table with large number of small regions, the proposal combines LevelDB ideas with many-region initiative. Basically the key space of one large region is partitioned into multiple sub-ranges which are non-overlapping and contiguous.

Here is the design doc:

Another improvement is in HBASE-7842 prior to which bulk loaded files were not handled correctly by the compaction selection algorithm. Compacted files are getting bigger and yet still picked up by compaction. This leads to longer and longer compaction time.
When all the files are chosen for compaction, minor compaction is promoted to a major compaction.

What are the config parameters that I should watch out ?

hbase.hstore.compactionThreshold (Note: in 0.95 and beyond, this becomes hbase.hstore.compaction.min)

Compaction is closely related to flushing (from memstore):


You can find explanation for the above parameters in http://hbase.apache.org/book.html