Intro
At LivingSocial, one of the key components we use in the Hadoop ecosystem is Hive. I’ve been working here and seen us migrate from 0.7 up to (currently) 0.13. One of the problems I’ve encountered over the years has been HiveServer (1 or 2) or the Hive Metastore “locking up” - i.e. calls to the service just hang. Usually when this happens, someone from our warehouse team will go into the server and manually restart the init.d service (as we are not using Ambari or Cloudera Manager). However, depending on response times - this can cause issues when we have long running ETL jobs overnight.
This post addresses a new method I’ve recently discovered for emulating hive service lockups. These will probably be old hat for many java devs, but were new to me.
Background
Over the years we’ve tried various monitor scripts to attempt to check to see if Hive is no longer responding. Some of the methods we’ve used include:
- Check for excessive CPU usage (usually Hive pegging one or more cores at 100%),
- Real time scan of the log looking for errors and restart if a particular error was encountered > 20 times in a 2 min period,
- A “simple query” executed every 30 min (
select * from table limit 5
) - An every {{ unit of time }} restart of the underlying service (usually once a day, but sometimes more frequent)
These all work to varying degrees, but we still encounter the occasional lockup that slips through the various checks. It would be great to be able to detect these lockups as soon as they occur, and immediately restart.
What I found
Basically, I wanted to find a way to lock up hive in a controlled enviornment. Looking up “how to lock up a jvm” on google was…interesting, and not very fruitful. Eventually, a coworker mentioned - “why not just use Thread.sleep()?”. Which made a lot of sense to me.
But - i needed a way of injecting Thread.sleep() into the running hive-metastore process. So - I looked into JDB. At first - I tried attaching jdb to the running job. However, I quickly found out that doing so results in a read-only jdb connection.
So - I decided to try to start up the Hive Metastore using the jdb directly (click below to show how I figured out exactly what command to run)
I ran the following in jdb to “lock up” the metastore
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
By suspending the thread, I could now see how other apps would respond. I proceeded to issue a “desc table” command via beeline. It hung! So - now I’ve got something which appears to emulate a “metastore lockup”.
So - what can I do with this info?
How can I tell if the metastore locked up?
I’ve played around with rbhive and knew that “thrift_socket” was the lowest point in its stack for HS2, so why not start there? Instead of looking at thrift_socket though, I figured - let’s just try a simple network socket. My first thought was - let’s just say “hi” over a socket connection to the running metastore instance (i.e. before suspending)
1 2 3 4 5 6 7 8 |
|
hmmm - I’ve got an empty string back. Not nil. Interesting. What happens when I try to do this with the thread asleep
1 2 3 4 5 6 7 8 9 |
|
Great! Now we’ve got a socket that times out when I try to read back from the socket! I also tried shutting down the metastore and connecting to the port - ended up with Errno::ECONNREFUSED: Connection refused - connect(2) for 192.168.50.2:9083
.
So - now we’ve got some relatively simple logic to determine whether the metastore has locked up!
The rest of the way
Now - I’ve got my logic, so I wrote a simple ruby script which daemonizes the above logic, and is controlled via a sysV init script (our servers are running CentOS).
My script runs the above logic every 30 seconds, and - on timeout - attempts restart - first by shutting down via service, then via kill -15
One issue I found right after the initial deploy was that the monitor was continuously restarting the metastore (oops…). Turns out that I needed to close_write the socket after writing “hello”. After adding that to the above script, the monitor ran successfully (and has been for the last 2+ days so far).
After these changes, my code is pretty much this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Hopefully this will help us avoid additional downtime with hive-metastore.