Josh Stanfield">

Unintended Consequences

Fix Adium in Yosemite 10.10.4

I recently (finally) upgraded to Yosemite on my work laptop. One of the programs I use frequently - Adium - would not start up (froze immediately upon startup).

I found the issue - Bonjour announcement (enabled by default on Yosemite) seems to interfere with Adium. However - the fixes listed for this all applied to pre 10.10.4 Yosemite (i.e. when Yosemite was still using discoveryd).

I fixed up by editing my mDNSResponder plist file

1
2
3
cp /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist ~
sudo plutil -convert json /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist
sudo vim /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist

When you edit the plist file, add “-NoMulticastAdvertisements” to the ProgramArguments array, so it looks like ProgramArguments":["\/usr\/sbin\/mDNSResponder","-NoMulticastAdvertisements"]

Then - convert back to binary and restart mDNSResponder

1
2
sudo plutil -convert binary1 /System/Library/LaunchDaemons/com.apple.mDNSResponder.plist
sudo killall -HUP mDNSResponder

Restart Adium (you may need to do force quit/kill -9), and all should be good

Pork Tenderloin Experience

I’ve been watching Good Eats a bit of late

There’s a handful of episodes on Netflix now. So - I tried one of Alton Brown’s pork tenderloin recipes yesterday. But - I wasn’t quite sure what to serve as a side. In the past, I’ve been partial to pineapple (love some grilled pineapple!). However, I also had some leftover rice in the rice cooker, still hot from the previous day.

So - I tried something a little different. I added the rice to a mixing bowl, then added the reserved marinade from after post-grill marinating of the pork, to the rice. I then took about ½ of the grilled pineapple and cut them into chunks, adding them to the rice mixture.

I personally really enjoyed the meal. The pineapple adds a sweetness that offsets the strong sour flavor from the lime. There was also enough salt in the leftover marinade for me; this is something that would be easy to overlook though. I wouldn’t be afraid to add additional salt if I felt it necessary. The rice was a bit strong by itself (it’s quite sour - the pineapple really is necessary, err on adding more pineapple than you think you need).

For dinner, I served the rice/pineapple/marinade mix separately; however, for leftovers, I chopped up some leftover pork and added it to the rice mix (as it is a nice addition to the rice mix).

Pineapple/Pork marinade rice side recipe (makes about 4 servings)

Ingredients
  • leftover marinade from pork tenderloin (the half used to marinate after grilling the tenderloin)
  • about 3 cups cooked rice (probably could have used a little more, as the marinade is quite strong)
  • about ½ of a grilled pineapple, cut into bite size chunks
Directions
  • Mix all of the above together in a mixing bowl right before serving

Notes

  • Do not use the marinade used with the raw pork
  • I served with a slice of pork tenderloin, and the rest of the pineapple as a second side
  • I might experiment with using all of the pineapple in the rice bowl in the future
  • Trying 4-6 cups cooked rice, whole pineapple and the marinade might be a good idea
  • Reheats decently in the microwave; not the best I’ve ever had, but certainly not the worst (in terms of “how good reheated is” vs “how good original was”)

Random Burrito Filling Recipe

I love burritos

And I always have. They’re quick and delicious, though maybe not the healthiest thing in the world. Years ago I came across a recipe which has become a staple of my diet for about a decade now. Combine this with a rice cooker and you have a very hands-off way to cook up a big meal.

This recipe makes about 2.5 pounds of shredded chicken, which is great for adding to burritos, tacos and other items. I believe I got this recipe from the old Ars Technica Bachelor Chow pdf that was sent around their forums a long time ago, though I could be mistaken there (I’ve since lost the pdf).

Chicken filling Recipe

Ingredients
  • 1 2 ½ pound bag of frozen chicken breasts (you can get these at most grocery stores).
    • defrost the chicken breasts overnight for best results
  • 1 24 oz jar of salsa (I prefer a medium strength one)
  • 1 packet of taco seasoning
Equipment
  • 1 slow cooker (I’ve used 4 and 6.5 quart ones; think the 4 qt one works better, though 6.5 works quite well)
Directions
  • put the defrosted chicken, taco seasoning and salsa in the slow cooker
  • cook for 8-10 hours on low
  • shred the chicken
  • integrate some of the leftover salsa into the shredded chicken for best taste

Notes

  • I’ve used 16 oz jars of salsa in the past (more common than 24 oz jars). Personally I find 16 oz is not enough to cover the chicken, or add enough flavor. They work better in 4 qt slow cookers if you want to use this size though
  • According to this site cooked chicken will last 3-4 days before going bad, so use it up before then
  • I usually put a small amount (couple of forkfulls) in a container when reheating, and reheat for 1m30s
  • I live at high altitude (5430 feet); supposedly I should be adjusting my slow cooking for this
  • If you don’t defrost the chicken ahead of time, a lot of the chicken juice will end up in the slow cooker, watering down the salsa

Automating HiveServer/Hive Metastore Restarts After Lockup

Intro

At LivingSocial, one of the key components we use in the Hadoop ecosystem is Hive. I’ve been working here and seen us migrate from 0.7 up to (currently) 0.13. One of the problems I’ve encountered over the years has been HiveServer (1 or 2) or the Hive Metastore “locking up” - i.e. calls to the service just hang. Usually when this happens, someone from our warehouse team will go into the server and manually restart the init.d service (as we are not using Ambari or Cloudera Manager). However, depending on response times - this can cause issues when we have long running ETL jobs overnight.

This post addresses a new method I’ve recently discovered for emulating hive service lockups. These will probably be old hat for many java devs, but were new to me.

Background

Over the years we’ve tried various monitor scripts to attempt to check to see if Hive is no longer responding. Some of the methods we’ve used include:

  • Check for excessive CPU usage (usually Hive pegging one or more cores at 100%),
  • Real time scan of the log looking for errors and restart if a particular error was encountered > 20 times in a 2 min period,
  • A “simple query” executed every 30 min (select * from table limit 5)
  • An every {{ unit of time }} restart of the underlying service (usually once a day, but sometimes more frequent)

These all work to varying degrees, but we still encounter the occasional lockup that slips through the various checks. It would be great to be able to detect these lockups as soon as they occur, and immediately restart.

What I found

Basically, I wanted to find a way to lock up hive in a controlled enviornment. Looking up “how to lock up a jvm” on google was…interesting, and not very fruitful. Eventually, a coworker mentioned - “why not just use Thread.sleep()?”. Which made a lot of sense to me.

But - i needed a way of injecting Thread.sleep() into the running hive-metastore process. So - I looked into JDB. At first - I tried attaching jdb to the running job. However, I quickly found out that doing so results in a read-only jdb connection.

So - I decided to try to start up the Hive Metastore using the jdb directly (click below to show how I figured out exactly what command to run)

I ran the following in jdb to “lock up” the metastore

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Initializing jdb ...
> run
run org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-0.13.1-cdh5.3.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
>
VM Started:
> threads
Group system:
(java.lang.ref.Reference$ReferenceHandler)0x160         Reference Handler                         cond. waiting
(java.lang.ref.Finalizer$FinalizerThread)0x15f          Finalizer                                 cond. waiting
(java.lang.Thread)0x15e                                 Signal Dispatcher                         running
(java.lang.Thread)0x45e                                 process reaper                            cond. waiting
Group main:
(java.lang.Thread)0x1                                   main                                      running
(org.apache.hadoop.hive.metastore.HiveMetaStore$3)0x552 Thread-4                                  cond. waiting
(com.google.common.base.internal.Finalizer)0x72c        com.google.common.base.internal.Finalizer cond. waiting
(java.lang.Thread)0x744                                 BoneCP-keep-alive-scheduler               cond. waiting
(java.lang.Thread)0x746                                 BoneCP-pool-watch-thread                  cond. waiting
(com.google.common.base.internal.Finalizer)0x84f        com.google.common.base.internal.Finalizer cond. waiting
(java.lang.Thread)0x850                                 BoneCP-keep-alive-scheduler               cond. waiting
(java.lang.Thread)0x851                                 BoneCP-pool-watch-thread                  cond. waiting
> suspend 0x1
>

By suspending the thread, I could now see how other apps would respond. I proceeded to issue a “desc table” command via beeline. It hung! So - now I’ve got something which appears to emulate a “metastore lockup”.

So - what can I do with this info?

How can I tell if the metastore locked up?

I’ve played around with rbhive and knew that “thrift_socket” was the lowest point in its stack for HS2, so why not start there? Instead of looking at thrift_socket though, I figured - let’s just try a simple network socket. My first thought was - let’s just say “hi” over a socket connection to the running metastore instance (i.e. before suspending)

1
2
3
4
5
6
7
8
[44] pry(main)> socket = Socket.new(:INET, :STREAM)
=> #<Socket:fd 15>
[45] pry(main)> socket.connect(sockaddr)
=> 0
[46] pry(main)> socket.write("GET / HTTP/1.0\r\n\r\n")
=> 18
[47] pry(main)> socket.read
=> ""

hmmm - I’ve got an empty string back. Not nil. Interesting. What happens when I try to do this with the thread asleep

1
2
3
4
5
6
7
8
9
[71] pry(main)> Timeout::timeout(15) {
[71] pry(main)*   socket = Socket.new(:INET, :STREAM)
[71] pry(main)*   socket.connect(sockaddr)
[71] pry(main)*
[71] pry(main)*   socket.write("hello")
[71] pry(main)*   socket.read
[71] pry(main)* }
Timeout::Error: execution expired
from (pry):84:in `read'

Great! Now we’ve got a socket that times out when I try to read back from the socket! I also tried shutting down the metastore and connecting to the port - ended up with Errno::ECONNREFUSED: Connection refused - connect(2) for 192.168.50.2:9083. So - now we’ve got some relatively simple logic to determine whether the metastore has locked up!

The rest of the way

Now - I’ve got my logic, so I wrote a simple ruby script which daemonizes the above logic, and is controlled via a sysV init script (our servers are running CentOS). My script runs the above logic every 30 seconds, and - on timeout - attempts restart - first by shutting down via service, then via kill -15 , and finally via kill -9 (if needed).

One issue I found right after the initial deploy was that the monitor was continuously restarting the metastore (oops…). Turns out that I needed to close_write the socket after writing “hello”. After adding that to the above script, the monitor ran successfully (and has been for the last 2+ days so far).

After these changes, my code is pretty much this

[hive_metastore_restart.rb] [ ] (metastore_restart.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Timeout::timeout(30) do
  socket = TCPSocket.new @hive_metastore_server, @hive_metastore_port
  begin
    socket.write("hello")
    socket.close_write
    x = socket.read

    #not sure when this would happen....
    if x.nil?
      @monitored_app_state = :unknown
      ### raise something eventually
    else
      conditional_log(:running, "hive metastore appears to be running ok")
      @monitored_app_state = :running
    end
  rescue Errno::ECONNRESET,Errno::ECONNREFUSED => e
    conditional_log(:dead, "exception #{e} found. This typically occurs when hive-metastore is not running. ")
    conditional_log(:dead, 'try running `sudo service hive-metastore status`')
    @monitored_app_state = :dead
  ensure
    socket.close
  end
rescue Timeout::Error
## restart hive-metastore!!
end

Hopefully this will help us avoid additional downtime with hive-metastore.

Understanding Your Geoip Data

Intro

One problem I have encountered in my time working with “big data” has been data quality issues. There have been many times where I would need to provide some form of data cleansing to some data used in a query, or help out data scientists to clean up some data used in one of their models. This post addresses one form of data cleansing I have to perform with some regularity; I call it “The Kansas problem”.

The problem is that the geolocation data returned from a GeoIP lookup returns the GPS coordinates of (38.0, -97.0) as the location for “US”; this gives a false impression of the precision of the data point returned, relative to the intended accuracy (somewhere within the US). The accuracy can be somewhat imputed from the additional metadata contained within the geoip_locations table from MaxMind, but it is not explicitly stated. This issue is not directly documented in the source of the data used, and is little discussed online as near as I can tell, so I thought it would be useful to do a quick blog post.

Data Sources

The source data I have used in the past is a free database from MaxMind GeoLite database. This database allows you to lookup an input of an IP address, and return a set of GPS coordinates; at a high level, MaxMind provides you with data to perform [lat, long] = f(IP_ADDR). The accuracy of the geolite database is mentioned online on MaxMind’s website The database is reasonably accurate (approximately 78% match in the US is accurate to within 40 km).

For the input IP addresses, I grabbed 50000 IP addresses from the list of Wikipedia revisions from April 2011. These are all anonymous edits (as anonymous edits leave IP address rather than a username in the edit history). I then translated the IPs to 32-bit integers, and looked up the location_id from MaxMind. Using the location_id, I am then able to pull the GPS coordinates, and did so for US locations (for demonstration purposes). This dataset contained 28016 entries, of which 1848 entries resolved to (38.0, -97.0), or approximately 6.6% of the entries.

What the data tells us

If you happen to look up (38.0, -97.0) on google maps, you won’t really find much there (see below). Essentially you’re 42 miles from wichita (or 27 miles as the crow flies). The decimal points within the gps coordinates would imply that the data is accurate to within approximately 11.1 km (or approximately 6.89 miles). (link to Stack Overflow link on GPS precision).

This level of precision could include the town of Potwin, KS (Population 449), but is out of range of Whitewater, KS (Population 718), the only other town within the ~7 Mile radius of (38.0, -97.0). This seems like a somewhat unusual place for 6.6% of wikipedia edits to occur.


View Larger Map

Looking into this, the raw data does not include any state, zip code, dma code, etc. It simply says that this is part of the US. This is at odds with the precision indicated within the GPS coordinates, as mentioned above.

1
2
3
4
5
6
mysql -e "select * from my_data.geoip_locations where location_id=223"
+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+
| location_id | country | region | city | postal_code | latitude | longitude | dma_code | area_code |
+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+
|         223 | US      |        |      |             |       38 |       -97 |          |           |
+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+

It appears that this location_id is only precise to the level of “this location is somewhere in the US”. If you happen to rely upon the gps coordinates to provide the precision for you

So - we’ve got a problem - a decent chunk of our GeoIP coordinates return very high precision coordinates for some very low accuracy data. How big of a problem is this though? Is 6.6% of our data points really that much?

Here’s a couple of very quick and dirty heatmaps of GeoIP locations that hopefully illustrate the issue. The before image contains (38,-97), the after image removes these points from the dataset.

Heatmap links

With (38, -97)

Without

If you look closely at the center of kansas, you’ll see either a huge heat cluster northeast of Wichita, or none, depending on where the slider is. In the first image, this cluster is the largest in the country. Which is crazy.

So - how do I go about fixing this? There are a couple of options I can think of

  1. Remove the low-precision data points. Fixes the problem in the short term, but requires query writers/developers to be very active about knowledge dissemmination. Probably the most common scenario I’ve encountered.
  2. Provide an additional data point to indicate what precision level the data really provides. Something akin to Google’s zoom level for maps. Something akin to “this point is precise to plus or minus XX mi/km”. This would enable query writers to determine what level of accuracy they require for their particular use.
    • Note that the raw MaxMind data sort of provides this information. Combinations of blank fields appear to indicate the accuracy level of a particular location
    • Normally I would suggest that MaxMind only provide the tens digit in their data set, which would indicate that it is accurate to within ~1000km (still a bit too precise for the US, but much better than the current tenths digit). However, I am unsure how to properly represent this within the csvs provided

What lessons can we learn here?

  1. Query writers - look at your data and ask questions! I came across this issue when I noticed that a lot of the raw data points in a table at work had these particular coordinates

  2. it is important to understand if your data has differring levels of precision, and how that is represented. GPS coordinates are supposed to confer a level of precision, but in MaxMind’s case, it appears to not be the case.

    • For the MaxMind dataset, blank fields appear to indicate different levels of precision in the GeoLiteCity-Location.csv file
    • The (38,-97) entry, for example, contains only { "country": "US", "latitude": 38, "longitude": -97 }
    • Virginia (where I grew up) contains {"id":12884,"country":"US","region":"VA","city":"","postal_code":"", "latitude":37.768,"longitude":-78.2057,"dma_code":"","area_code":""} note the presense of the third and fourth decimal place, which should indicate a precision level of +/-110m & +/- 11m (respectively).
    • Whereas Reston (the town where I grew up) contains the following {"id":651,"country":"US","region":"VA","city":"Reston", "postal_code":"20190","latitude":38.9599, "longitude":-77.3428,"dma_code":"511","area_code":"703"}
    • If I were attempting to aggregate purchases to the state level of accuracy, I could include the second and third example here; However, if I wanted to aggregate purchases down to the city level, I really should only use the third example

  3. Developers - if you notice a problem like this, perhaps attempt to “override” the level of precision. A column called “accurate_to”, measured in (meters/km/miles), provided along with the gps coordinates would go a long way towards preventing bad analysis.

  4. Organizations - provide a good communication path between your query writers and developers when there are questions about how data is formed. Having people who can bridge the gap between developer and query writer (someone who knows how to code and also how work with data) goes a long way to help remedy these sorts of problems.


This post mentions GeoLite data created by MaxMind, available from MaxMind