Friday, April 22, 2011

Managing connections in HBase 0.90 and beyond

Users of HBase have complained about high number of connections to zookeeper after upgrading to 0.90.  Jean-Daniel, responding to user comments, did some initial work in HBASE-3734.

In the following discussion, the term connection refers to the connection between HBase client and HBase, managed by HConnectionManager.

In the early days of 0.90 release, some decisions were made in HBASE-2925 where a new connection would be established given a Configuration instance without looking at the connection-specific properties in it.

I made a comment in HBASE-3734 at 05/Apr/11 05:20 with the following two ideas:
  • We should reuse connection based on connection-specific properties, such as "hbase.zookeeper.quorum"
  • In order for HConnectionManager.deleteConnection() to work, reference counting has to be used.
I want to thank Karthick who is brave to bite the bullet and try to nail this issue through HBASE-3777.
 He and I worked together for over a week to come up with the solution - patch version 6.

We discovered a missing connection property, HConstants.ZOOKEEPER_ZNODE_PARENT, which caused TestHBaseTestingUtility to fail.

We refined the implementation several times based on the outcome of test results.

Here is summary of what we did:
  • Reference counting based connection sharing is implemented
  • There were 33 references to HConnectionManager.getConnection(), we make sure that all of those references are properly deleted (released)
  • Some modification in unit tests is made to illustrate the recommended approach - see TestTableMapReduce.java
For connection sharing, Karthick introduced the following:
  static class HConnectionKey {
    public static String[] CONNECTION_PROPERTIES = new String[] {
        HConstants.ZOOKEEPER_QUORUM,
        HConstants.ZOOKEEPER_ZNODE_PARENT,
        HConstants.ZOOKEEPER_CLIENT_PORT,
        HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME,
        HConstants.HBASE_CLIENT_PAUSE,
        HConstants.HBASE_CLIENT_RETRIES_NUMBER,
        HConstants.HBASE_CLIENT_RPC_MAXATTEMPTS,
        HConstants.HBASE_RPC_TIMEOUT_KEY,
        HConstants.HBASE_CLIENT_PREFETCH_LIMIT,
        HConstants.HBASE_META_SCANNER_CACHING,
        HConstants.HBASE_CLIENT_INSTANCE_ID };

In a Configuration, if any of the above connection properties is unique, we would create a new connection. Otherwise an existing connection whose underlying connection properties carry the same values would be returned.

Initially attempt was made to use Java finalizer to clean up unused connections. It turned out object finalization is tricky - client may get a closed connection if multiple HTable instances are involved and some of them may go out of scope, leading to finalizer execution.

So for HTable, we expect user to explicitly call close() method once the HTable instance would no longer be used. Take a look at the modified src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java

Finally, we changed how TableOutputFormat closes connection. Previously HConnectionManager.deleteAllConnections(true) is called in TableRecordWriter() because it was an easy way to deal with connection leak. Now, calling table.close() is enough.

In order to make 0.90.3 release stable, HBASE-3777 wouldn't be integrated into 0.90.3

Epilog:
When Ramkrishna worked on HBASE-4052, he discovered a problem in TRUNK which was not in 0.90 branch. Namely, after Master failover, HBaseAdmin.getConnection() would still get the shared connection which points to the previous active master. Using this stale connection results in an IOException wrapped in UndeclaredThrowableException.

I provided fix for this problem through HBASE-4087 where HBaseAdmin constructor would detect such issue and ask HConnectionManager to remove the stale connection from its cache.

Here is the code snippet:
    for (; tries < numRetries; ++tries) {
      try {
        this.connection.getMaster();
        break;
      } catch (UndeclaredThrowableException ute) {
        HConnectionManager.deleteStaleConnection(this.connection);
        this.connection = HConnectionManager.getConnection(this.conf);       
      }

Thursday, April 14, 2011

Load Balancer in HBase 0.90

Working with Stanislav Barton on Load Balancer in HBase 0.90, he asked for document on how load balancer works. This writeup would touch the internals of load balancer and how it evolved over time.

Master code (including load balancer) has been rewritten for HBase 0.90

When a region receives many writes and is split, the daughter regions are placed on the same region server as the parent region. Stan proposed to change this behavior and I summarized in HBASE-3779.

HBASE-3586 tried to solve the problem where load balancer moves inactive regions off an overloaded region server by randomly selecting regions to offload. This is to handle the potential problem of moving too many hot regions onto a region server which recently joined the cluster.

But this random selection isn't optimal. For Stan's cluster, there're around 600 regions on each region server. When 30 new regions were created on the same region server, random selector only chose 3 out of the 30 new regions for reassignment. The other region selection was from inactive (old) regions. This is expected behavior because new and old regions were selected equally probably.

Basically we traded some optimization for safety of not overloading a newly discovered region server.

So I continued enhancement using HBASE-3609 where one of the goals is to remove randomness from LoadBalancer so that we can deterministically produce near-optimal balancing actions.

If at least one region server joined the cluster just before the current balancing action, both new and old regions from overloaded region servers would be moved onto underloaded region servers. Otherwise, I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.

I also utilize the randomizer which shuffles the list of underloaded region servers.
This way we can avoid distributing offloaded regions to few region servers.

HBASE-3609 has been integrated into trunk as of Apr 18th, 2011.

HBASE-3704 would help users observe the distribution of regions. It is currently only in HBase trunk code.

Also related is HBASE-3373. Stan proposed to make it more general. A new policy for load balancer can be added to have balanced number of regions per RS per table and not in total number of regions from all tables.

If you're interested in more detail, please take a look at the javadoc for LoadBalancer.balanceCluster()

For HBase trunk, I implemented HBASE-3681 upon Jean-Daniel Cryans's request. For 0.90.2 and later, the default value of sloppiness is 0.

I am planning for the next generation of load balancer where request histogram would play an important role in deciding which regions to move. Please take a look at HBASE-3679

HBaseWD project introduced multiple scanners for bucketed writes. I plan to accommodate this new feature through HBASE-3811 where additional attributes in Scan object would allow balancer to group the scanners generated by HBaseWD.

HBASE-3943 is supposed to solve the problem where region reassignment disrupts (potentially long) compaction.

HBASE-3945 tries to give regions more stability by not reassigning region(s) in consecutive balancing actions.