Ted Yu's blog

Tuesday, March 26, 2013

Compactions Q&A

On user mailing list, questions about compaction are probably the most frequently asked.

I try to summarize some answers below. They're by no means complete.

How to check if a major_compact is done?
(http://search-hadoop.com/m/heoc617XV29/otis+compactions&subj=Re+How+to+check+if+a+major_compact+is+done+)

JMX exposes metric about compaction time.
In HBASE-6033, Adding some fuction to check if a table/region is in compaction, the following API was added to HBaseAdmin:

public CompactionState getCompactionState(final String tableNameOrRegionName)
throws IOException, InterruptedException {

Here is picture depicting compaction associated with a table.

https://issues.apache.org/jira/secure/attachment/12528264/table_ui.png

This feature is in 0.95 and beyond.

Should custom script be written to compact regions one by one ?

Major compactions are needed if there're many writes / deletions to your table.

Since command for triggering major compaction is asynchronous, compaction storm may result if the commands are not properly issued to the regions (w.r.t. timing). Jean-Daniel suggested compacting subset of the regions at a time.
One can monitor compaction queue length on region server using JMX.

Are there new algorithms being developed to improve major compaction ?

Yes.
One of the initiatives is the stripe compaction. See parent JIRA: HBASE-7667

Instead of creating table with large number of small regions, the proposal combines LevelDB ideas with many-region initiative. Basically the key space of one large region is partitioned into multiple sub-ranges which are non-overlapping and contiguous.

Here is the design doc:
https://issues.apache.org/jira/secure/attachment/12575449/Stripe%20compactions.pdf

Another improvement is in HBASE-7842 prior to which bulk loaded files were not handled correctly by the compaction selection algorithm. Compacted files are getting bigger and yet still picked up by compaction. This leads to longer and longer compaction time.
When all the files are chosen for compaction, minor compaction is promoted to a major compaction.

What are the config parameters that I should watch out ?

hbase.hstore.compactionThreshold (Note: in 0.95 and beyond, this becomes hbase.hstore.compaction.min)
hbase.hstore.compaction.max
hbase.hregion.majorcompaction
hbase.hstore.blockingStoreFiles

Compaction is closely related to flushing (from memstore):

hbase.regionserver.global.memstore.lowerlimit
hbase.regionserver.global.memstore.upperlimit

You can find explanation for the above parameters in http://hbase.apache.org/book.html

Sunday, September 25, 2011

Streamlining patch submission

I have spent considerable time fixing up HBase builds on Apache Jenkins. I want to share a few points for HBase contributors so that it is easier to maintain stable builds.

For contributors, I understand that it takes so much time to run whole test suite that he/she may not have the luxury of doing this - Apache Jenkins wouldn't do it when you press Submit Patch button.
If this is the case, please use Eclipse (or other tool) to identify tests that exercise the classes/methods in your patch and run them. Also clearly state what tests you ran in the JIRA.

If you have a Linux box where you can run whole test suite, it would be nice to utilize such resource and run whole suite. Then please state this fact on the JIRA as well.
Considering Todd's suggestion of holding off commit for 24 hours after code review, 2 hour test run isn't that long.

Sometimes you may see the following (from 0.92 build 18):

Tests run: 1004, Failures: 0, Errors: 0, Skipped: 21
[INFO] ----------------------------------
[INFO] BUILD FAILURE
[INFO] ----------------------------------
[INFO] Total time: 1:51:41.797s

You should examine the test summary above these lines and find out which
 test(s) hung. For this case it was TestMasterFailover:

Running org.apache.hadoop.hbase.master.TestMasterFailoverRunning org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 32.265 sec
 
Write deterministic test cases
Sometimes the new tests you added passed for you but fail when committer runs them.
Here is analysis for one such case - the original version of TestRegionServerCoprocessorExceptionWithAbort.

The following call would create 25 regions for the test table:

TEST_UTIL.createMultiRegions(table, TEST_FAMILY);

If we look at HBaseTestingUtility:

public static final byte[][] KEYS = {
    HConstants.EMPTY_BYTE_ARRAY, Bytes.toBytes("bbb"),

We can see that the row in the test

final byte[] ROW = Bytes.toBytes("bbb");

actually doesn't belong to the first region.
The following call was used to track the underlying region server:
HRegionServer regionServer =
TEST_UTIL.getRSForFirstRegionInTable(TEST_TABLE);

This means regionServer which we were waiting for may not host the region where NPE happened.
I modified the above line slightly and the test passed reliably:

final byte[] ROW = Bytes.toBytes("aaa");

The takeaway is that scoping the test scenario properly and deterministically would simply patch submission.

Friday, April 22, 2011

Managing connections in HBase 0.90 and beyond

Users of HBase have complained about high number of connections to zookeeper after upgrading to 0.90. Jean-Daniel, responding to user comments, did some initial work in HBASE-3734.

In the following discussion, the term connection refers to the connection between HBase client and HBase, managed by HConnectionManager.

In the early days of 0.90 release, some decisions were made in HBASE-2925 where a new connection would be established given a Configuration instance without looking at the connection-specific properties in it.

I made a comment in HBASE-3734 at 05/Apr/11 05:20 with the following two ideas:

We should reuse connection based on connection-specific properties, such as "hbase.zookeeper.quorum"

In order for HConnectionManager.deleteConnection() to work, reference counting has to be used.

I want to thank Karthick who is brave to bite the bullet and try to nail this issue through HBASE-3777.
He and I worked together for over a week to come up with the solution - patch version 6.

We discovered a missing connection property, HConstants.ZOOKEEPER_ZNODE_PARENT, which caused TestHBaseTestingUtility to fail.

We refined the implementation several times based on the outcome of test results.

Here is summary of what we did:

Reference counting based connection sharing is implemented
There were 33 references to HConnectionManager.getConnection(), we make sure that all of those references are properly deleted (released)
Some modification in unit tests is made to illustrate the recommended approach - see TestTableMapReduce.java

For connection sharing, Karthick introduced the following:
static class HConnectionKey {
    public static String[] CONNECTION_PROPERTIES = new String[] {
        HConstants.ZOOKEEPER_QUORUM,
        HConstants.ZOOKEEPER_ZNODE_PARENT,
        HConstants.ZOOKEEPER_CLIENT_PORT,
        HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME,
        HConstants.HBASE_CLIENT_PAUSE,
        HConstants.HBASE_CLIENT_RETRIES_NUMBER,
        HConstants.HBASE_CLIENT_RPC_MAXATTEMPTS,
        HConstants.HBASE_RPC_TIMEOUT_KEY,
        HConstants.HBASE_CLIENT_PREFETCH_LIMIT,
        HConstants.HBASE_META_SCANNER_CACHING,
        HConstants.HBASE_CLIENT_INSTANCE_ID };

In a Configuration, if any of the above connection properties is unique, we would create a new connection. Otherwise an existing connection whose underlying connection properties carry the same values would be returned.

Initially attempt was made to use Java finalizer to clean up unused connections. It turned out object finalization is tricky - client may get a closed connection if multiple HTable instances are involved and some of them may go out of scope, leading to finalizer execution.

So for HTable, we expect user to explicitly call close() method once the HTable instance would no longer be used. Take a look at the modified src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java

Finally, we changed how TableOutputFormat closes connection. Previously HConnectionManager.deleteAllConnections(true) is called in TableRecordWriter() because it was an easy way to deal with connection leak. Now, calling table.close() is enough.

In order to make 0.90.3 release stable, HBASE-3777 wouldn't be integrated into 0.90.3

Epilog:
When Ramkrishna worked on HBASE-4052, he discovered a problem in TRUNK which was not in 0.90 branch. Namely, after Master failover, HBaseAdmin.getConnection() would still get the shared connection which points to the previous active master. Using this stale connection results in an IOException wrapped in UndeclaredThrowableException.

I provided fix for this problem through HBASE-4087 where HBaseAdmin constructor would detect such issue and ask HConnectionManager to remove the stale connection from its cache.

Here is the code snippet:
    for (; tries < numRetries; ++tries) {
      try {
        this.connection.getMaster();
        break;
      } catch (UndeclaredThrowableException ute) {
        HConnectionManager.deleteStaleConnection(this.connection);
        this.connection = HConnectionManager.getConnection(this.conf);
      }

Thursday, April 14, 2011

Load Balancer in HBase 0.90

Working with Stanislav Barton on Load Balancer in HBase 0.90, he asked for document on how load balancer works. This writeup would touch the internals of load balancer and how it evolved over time.

Master code (including load balancer) has been rewritten for HBase 0.90

When a region receives many writes and is split, the daughter regions are placed on the same region server as the parent region. Stan proposed to change this behavior and I summarized in HBASE-3779.

HBASE-3586 tried to solve the problem where load balancer moves inactive regions off an overloaded region server by randomly selecting regions to offload. This is to handle the potential problem of moving too many hot regions onto a region server which recently joined the cluster.

But this random selection isn't optimal. For Stan's cluster, there're around 600 regions on each region server. When 30 new regions were created on the same region server, random selector only chose 3 out of the 30 new regions for reassignment. The other region selection was from inactive (old) regions. This is expected behavior because new and old regions were selected equally probably.

Basically we traded some optimization for safety of not overloading a newly discovered region server.

So I continued enhancement using HBASE-3609 where one of the goals is to remove randomness from LoadBalancer so that we can deterministically produce near-optimal balancing actions.

If at least one region server joined the cluster just before the current balancing action, both new and old regions from overloaded region servers would be moved onto underloaded region servers. Otherwise, I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.

I also utilize the randomizer which shuffles the list of underloaded region servers.
This way we can avoid distributing offloaded regions to few region servers.

HBASE-3609 has been integrated into trunk as of Apr 18th, 2011.

HBASE-3704 would help users observe the distribution of regions. It is currently only in HBase trunk code.

Also related is HBASE-3373. Stan proposed to make it more general. A new policy for load balancer can be added to have balanced number of regions per RS per table and not in total number of regions from all tables.

If you're interested in more detail, please take a look at the javadoc for LoadBalancer.balanceCluster()

For HBase trunk, I implemented HBASE-3681 upon Jean-Daniel Cryans's request. For 0.90.2 and later, the default value of sloppiness is 0.

I am planning for the next generation of load balancer where request histogram would play an important role in deciding which regions to move. Please take a look at HBASE-3679

HBaseWD project introduced multiple scanners for bucketed writes. I plan to accommodate this new feature through HBASE-3811 where additional attributes in Scan object would allow balancer to group the scanners generated by HBaseWD.

HBASE-3943 is supposed to solve the problem where region reassignment disrupts (potentially long) compaction.

HBASE-3945 tries to give regions more stability by not reassigning region(s) in consecutive balancing actions.

Thursday, March 31, 2011

Genericizing EndpointCoprocessor

Our new release will use coprocessor framework of HBase
HBASE-1512 provided reference implementation for aggregation.

Originally value for every column is interpreted as Long.
I made the implementation more generic through introduction of ColumnInterpreter which understands the schema of the underlying table (by examining column family:column qualifier, e.g.).
Here're 3 guidelines I followed during development:

User shouldn't modify HbaseObjectWritable directly for the interpreter class which is to be executed on region server. This is achieved by making ColumnInterpreter extend Serializable
We (plan to) store objects of MeasureWritable, a relatively complex class, in HBase. Using interpreter would give us flexibility in computing aggregates.
We load AggregateProtocolImpl.class into CoprocessorHost. Interpreter feeds various values (such as Long.MIN_VALUE) of concrete type (Long) into AggregateProtocolImpl. This simplifies class loading for CoprocessorHost

During code review, we tried to distinguish the return value for the case where there is no result from a particular region by using null.

However, we got the following due to type erasure:

2011-04-26 17:55:48,229 INFO [IPC Server handler 3 on 64132] coprocessor.

AggregateImplementation(66): Maximum from this region is TestTable,,1303840188042.18ec4a1af1b0931be64fc084d2eb9309.: null
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(336): Unsupported type class java.lang.Object
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): writeClassCode
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): write
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): writeObject
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): write
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): writeObject
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): write
2011-04-26 17:55:48,229 ERROR [IPC Server handler 3 on 64132] io.HbaseObjectWritable(339): run2011-04-26 17:55:48,229 WARN [IPC Server handler 3 on 64132] ipc.HBaseServer$Handler(1122): IPC Server handler 3 on 64132 caught: java.lang.UnsupportedOperationException: No code for unexpected class java.lang.Object
        at org.apache.hadoop.hbase.io.HbaseObjectWritable.writeClassCode(HbaseObjectWritable.java:343)
        at org.apache.hadoop.hbase.io.HbaseObjectWritable$NullInstance.write(HbaseObjectWritable.java:311)
        at org.apache.hadoop.hbase.io.HbaseObjectWritable.writeObject(HbaseObjectWritable.java:449)
        at org.apache.hadoop.hbase.client.coprocessor.ExecResult.write(ExecResult.java:74)
        at org.apache.hadoop.hbase.io.HbaseObjectWritable.writeObject(HbaseObjectWritable.java:449)
        at org.apache.hadoop.hbase.io.HbaseObjectWritable.write(HbaseObjectWritable.java:284)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1092)

The solution is to apply Writable.class for null value.

We didn't consider race condition in callbacks on the client side. See HBASE-3862