I recently (finally) upgraded to Yosemite on my work laptop. One of the programs I use frequently - Adium - would not start up (froze immediately upon startup).
I found the issue - Bonjour announcement (enabled by default on Yosemite) seems to interfere with Adium. However - the fixes listed for this all applied to pre 10.10.4 Yosemite (i.e. when Yosemite was still using discoveryd).
When you edit the plist file, add “-NoMulticastAdvertisements” to the ProgramArguments array, so it looks like ProgramArguments":["\/usr\/sbin\/mDNSResponder","-NoMulticastAdvertisements"]
Then - convert back to binary and restart mDNSResponder
There’s a handful of episodes on Netflix now. So - I tried one of Alton Brown’s pork tenderloin recipes yesterday.
But - I wasn’t quite sure what to serve as a side.
In the past, I’ve been partial to pineapple (love some grilled pineapple!).
However, I also had some leftover rice in the rice cooker, still hot from the previous day.
So - I tried something a little different. I added the rice to a mixing bowl, then added the reserved marinade from after post-grill marinating of the pork, to the rice.
I then took about ½ of the grilled pineapple and cut them into chunks, adding them to the rice mixture.
I personally really enjoyed the meal. The pineapple adds a sweetness that offsets the strong sour flavor from the lime.
There was also enough salt in the leftover marinade for me; this is something that would be easy to overlook though.
I wouldn’t be afraid to add additional salt if I felt it necessary.
The rice was a bit strong by itself (it’s quite sour - the pineapple really is necessary, err on adding more pineapple than you think you need).
For dinner, I served the rice/pineapple/marinade mix separately; however, for leftovers, I chopped up some leftover pork and added it to the rice mix (as it is a nice addition to the rice mix).
Pineapple/Pork marinade rice side recipe (makes about 4 servings)
Ingredients
leftover marinade from pork tenderloin (the half used to marinate after grilling the tenderloin)
about 3 cups cooked rice (probably could have used a little more, as the marinade is quite strong)
about ½ of a grilled pineapple, cut into bite size chunks
Directions
Mix all of the above together in a mixing bowl right before serving
Notes
Do not use the marinade used with the raw pork
I served with a slice of pork tenderloin, and the rest of the pineapple as a second side
I might experiment with using all of the pineapple in the rice bowl in the future
Trying 4-6 cups cooked rice, whole pineapple and the marinade might be a good idea
Reheats decently in the microwave; not the best I’ve ever had, but certainly not the worst (in terms of “how good reheated is” vs “how good original was”)
And I always have. They’re quick and delicious, though maybe not the healthiest thing in the world.
Years ago I came across a recipe which has become a staple of my diet for about a decade now.
Combine this with a rice cooker and you have a very hands-off way to cook up a big meal.
This recipe makes about 2.5 pounds of shredded chicken, which is great for adding to burritos, tacos and other items.
I believe I got this recipe from the old Ars Technica Bachelor Chow pdf that was sent around their forums a long time ago, though I could be mistaken there (I’ve since lost the pdf).
Chicken filling Recipe
Ingredients
1 2 ½ pound bag of frozen chicken breasts (you can get these at most grocery stores).
defrost the chicken breasts overnight for best results
1 24 oz jar of salsa (I prefer a medium strength one)
1 packet of taco seasoning
Equipment
1 slow cooker (I’ve used 4 and 6.5 quart ones; think the 4 qt one works better, though 6.5 works quite well)
Directions
put the defrosted chicken, taco seasoning and salsa in the slow cooker
cook for 8-10 hours on low
shred the chicken
integrate some of the leftover salsa into the shredded chicken for best taste
Notes
I’ve used 16 oz jars of salsa in the past (more common than 24 oz jars). Personally I find 16 oz is not enough to cover the chicken, or add enough flavor. They work better in 4 qt slow cookers if you want to use this size though
According to this site cooked chicken will last 3-4 days before going bad, so use it up before then
I usually put a small amount (couple of forkfulls) in a container when reheating, and reheat for 1m30s
At LivingSocial, one of the key components we use in the Hadoop ecosystem is Hive. I’ve been working here and seen us migrate from 0.7 up to (currently) 0.13.
One of the problems I’ve encountered over the years has been HiveServer (1 or 2) or the Hive Metastore “locking up” - i.e. calls to the service just hang.
Usually when this happens, someone from our warehouse team will go into the server and manually restart the init.d service (as we are not using Ambari or Cloudera Manager).
However, depending on response times - this can cause issues when we have long running ETL jobs overnight.
This post addresses a new method I’ve recently discovered for emulating hive service lockups. These will probably be old hat for many java devs, but were new to me.
Background
Over the years we’ve tried various monitor scripts to attempt to check to see if Hive is no longer responding. Some of the methods we’ve used include:
Check for excessive CPU usage (usually Hive pegging one or more cores at 100%),
Real time scan of the log looking for errors and restart if a particular error was encountered > 20 times in a 2 min period,
A “simple query” executed every 30 min (select * from table limit 5)
An every {{ unit of time }} restart of the underlying service (usually once a day, but sometimes more frequent)
These all work to varying degrees, but we still encounter the occasional lockup that slips through the various checks.
It would be great to be able to detect these lockups as soon as they occur, and immediately restart.
What I found
Basically, I wanted to find a way to lock up hive in a controlled enviornment.
Looking up “how to lock up a jvm” on google was…interesting, and not very fruitful.
Eventually, a coworker mentioned - “why not just use Thread.sleep()?”. Which made a lot of sense to me.
But - i needed a way of injecting Thread.sleep() into the running hive-metastore process. So - I looked into JDB.
At first - I tried attaching jdb to the running job. However, I quickly found out that doing so results in a read-only jdb connection.
So - I decided to try to start up the Hive Metastore using the jdb directly (click below to show how I figured out exactly what command to run)
So - I ran my which commands on my hive-metastore server.
First - I looked at /etc/init.d/hive-metastore, and found the startup command for hive-metastore (which is effectively su -s /bin/bash hive -c "hive --service metastore").
From here - I looked at the hive command in vim (vim $(which hive)), which lead me to /usr/lib/hive/bin/ext/metastore.sh.
This file, it turns out, calls hadoop jar org.apache.hadoop.hive.metastore.HiveMetaStore, so I took a look at the hadoop command.
vim $(which hadoop) lead me to /usr/lib/hadoop/bin/hadoop. In here - I finally see the acutal java call. However, it used a mix of env variables
So - I decided to just print the call to stderr (in addition to calling the program as normal) rather than trace all the variables by hand.
(note that $CLASSPATH was set to the same value from above where I echoed $CLASSPATH)
I ran the following in jdb to “lock up” the metastore
123456789101112131415161718192021222324
Initializing jdb ...
> run
run org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-0.13.1-cdh5.3.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
>
VM Started:
> threads
Group system:
(java.lang.ref.Reference$ReferenceHandler)0x160 Reference Handler cond. waiting
(java.lang.ref.Finalizer$FinalizerThread)0x15f Finalizer cond. waiting
(java.lang.Thread)0x15e Signal Dispatcher running
(java.lang.Thread)0x45e process reaper cond. waiting
Group main:
(java.lang.Thread)0x1 main running
(org.apache.hadoop.hive.metastore.HiveMetaStore$3)0x552 Thread-4 cond. waiting
(com.google.common.base.internal.Finalizer)0x72c com.google.common.base.internal.Finalizer cond. waiting
(java.lang.Thread)0x744 BoneCP-keep-alive-scheduler cond. waiting
(java.lang.Thread)0x746 BoneCP-pool-watch-thread cond. waiting
(com.google.common.base.internal.Finalizer)0x84f com.google.common.base.internal.Finalizer cond. waiting
(java.lang.Thread)0x850 BoneCP-keep-alive-scheduler cond. waiting
(java.lang.Thread)0x851 BoneCP-pool-watch-thread cond. waiting
> suspend 0x1
>
By suspending the thread, I could now see how other apps would respond. I proceeded to issue a “desc table” command via beeline. It hung!
So - now I’ve got something which appears to emulate a “metastore lockup”.
So - what can I do with this info?
How can I tell if the metastore locked up?
I’ve played around with rbhive and knew that “thrift_socket” was the lowest point in its stack for HS2, so why not start there?
Instead of looking at thrift_socket though, I figured - let’s just try a simple network socket.
My first thought was - let’s just say “hi” over a socket connection to the running metastore instance (i.e. before suspending)
Great! Now we’ve got a socket that times out when I try to read back from the socket! I also tried shutting down the metastore and connecting to the port - ended up with Errno::ECONNREFUSED: Connection refused - connect(2) for 192.168.50.2:9083.
So - now we’ve got some relatively simple logic to determine whether the metastore has locked up!
The rest of the way
Now - I’ve got my logic, so I wrote a simple ruby script which daemonizes the above logic, and is controlled via a sysV init script (our servers are running CentOS).
My script runs the above logic every 30 seconds, and - on timeout - attempts restart - first by shutting down via service, then via kill -15 , and finally via kill -9 (if needed).
One issue I found right after the initial deploy was that the monitor was continuously restarting the metastore (oops…).
Turns out that I needed to close_write the socket after writing “hello”.
After adding that to the above script, the monitor ran successfully (and has been for the last 2+ days so far).
Timeout::timeout(30)dosocket=TCPSocket.new@hive_metastore_server,@hive_metastore_portbeginsocket.write("hello")socket.close_writex=socket.read#not sure when this would happen....ifx.nil?@monitored_app_state=:unknown### raise something eventuallyelseconditional_log(:running,"hive metastore appears to be running ok")@monitored_app_state=:runningendrescueErrno::ECONNRESET,Errno::ECONNREFUSED=>econditional_log(:dead,"exception #{e} found. This typically occurs when hive-metastore is not running. ")conditional_log(:dead,'try running `sudo service hive-metastore status`')@monitored_app_state=:deadensuresocket.closeendrescueTimeout::Error## restart hive-metastore!!end
Hopefully this will help us avoid additional downtime with hive-metastore.
One problem I have encountered in my time working with “big data” has been data quality issues.
There have been many times where I would need to provide some form of data cleansing to some data used in a query, or help out data scientists to clean up some data used in one of their models.
This post addresses one form of data cleansing I have to perform with some regularity; I call it “The Kansas problem”.
The problem is that the geolocation data returned from a GeoIP lookup returns the GPS coordinates of (38.0, -97.0) as the location for “US”; this gives a false impression of the precision of the data point returned, relative to the intended accuracy (somewhere within the US).
The accuracy can be somewhat imputed from the additional metadata contained within the geoip_locations table from MaxMind, but it is not explicitly stated.
This issue is not directly documented in the source of the data used, and is little discussed online as near as I can tell, so I thought it would be useful to do a quick blog post.
Data Sources
The source data I have used in the past is a free database from MaxMind GeoLite database.
This database allows you to lookup an input of an IP address, and return a set of GPS coordinates; at a high level, MaxMind provides you with data to perform [lat, long] = f(IP_ADDR).
The accuracy of the geolite database is mentioned online on MaxMind’s website
The database is reasonably accurate (approximately 78% match in the US is accurate to within 40 km).
For the input IP addresses, I grabbed 50000 IP addresses from the list of Wikipedia revisions from April 2011.
These are all anonymous edits (as anonymous edits leave IP address rather than a username in the edit history).
I then translated the IPs to 32-bit integers, and looked up the location_id from MaxMind.
Using the location_id, I am then able to pull the GPS coordinates, and did so for US locations (for demonstration purposes).
This dataset contained 28016 entries, of which 1848 entries resolved to (38.0, -97.0), or approximately 6.6% of the entries.
What the data tells us
If you happen to look up (38.0, -97.0) on google maps, you won’t really find much there (see below).
Essentially you’re 42 miles from wichita (or 27 miles as the crow flies).
The decimal points within the gps coordinates would imply that the data is accurate to within approximately 11.1 km (or approximately 6.89 miles). (link to Stack Overflow link on GPS precision).
This level of precision could include the town of Potwin, KS (Population 449), but is out of range of Whitewater, KS (Population 718), the only other town within the ~7 Mile radius of (38.0, -97.0).
This seems like a somewhat unusual place for 6.6% of wikipedia edits to occur.
Looking into this, the raw data does not include any state, zip code, dma code, etc. It simply says that this is part of the US. This is at odds with the precision indicated within the GPS coordinates, as mentioned above.
123456
mysql-e"select * from my_data.geoip_locations where location_id=223"+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+|location_id|country|region|city|postal_code|latitude|longitude|dma_code|area_code|+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+|223|US||||38|-97|||+-------------+---------+--------+------+-------------+----------+-----------+----------+-----------+
It appears that this location_id is only precise to the level of “this location is somewhere in the US”. If you happen to rely upon the gps coordinates to provide the precision for you
So - we’ve got a problem - a decent chunk of our GeoIP coordinates return very high precision coordinates for some very low accuracy data. How big of a problem is this though? Is 6.6% of our data points really that much?
Here’s a couple of very quick and dirty heatmaps of GeoIP locations that hopefully illustrate the issue.
The before image contains (38,-97), the after image removes these points from the dataset.
If you look closely at the center of kansas, you’ll see either a huge heat cluster northeast of Wichita, or none, depending on where the slider is.
In the first image, this cluster is the largest in the country. Which is crazy.
So - how do I go about fixing this? There are a couple of options I can think of
Remove the low-precision data points. Fixes the problem in the short term, but requires query writers/developers to be very active about knowledge dissemmination. Probably the most common scenario I’ve encountered.
Provide an additional data point to indicate what precision level the data really provides. Something akin to Google’s zoom level for maps. Something akin to “this point is precise to plus or minus XX mi/km”. This would enable query writers to determine what level of accuracy they require for their particular use.
Note that the raw MaxMind data sort of provides this information. Combinations of blank fields appear to indicate the accuracy level of a particular location
Normally I would suggest that MaxMind only provide the tens digit in their data set, which would indicate that it is accurate to within ~1000km (still a bit too precise for the US, but much better than the current tenths digit). However, I am unsure how to properly represent this within the csvs provided
What lessons can we learn here?
Query writers - look at your data and ask questions! I came across this issue when I noticed that a lot of the raw data points in a table at work had these particular coordinates
it is important to understand if your data has differring levels of precision, and how that is represented. GPS coordinates are supposed to confer a level of precision, but in MaxMind’s case, it appears to not be the case.
For the MaxMind dataset, blank fields appear to indicate different levels of precision in the GeoLiteCity-Location.csv file
The (38,-97) entry, for example, contains only { "country": "US", "latitude": 38, "longitude": -97 }
Virginia (where I grew up) contains
{"id":12884,"country":"US","region":"VA","city":"","postal_code":"",
"latitude":37.768,"longitude":-78.2057,"dma_code":"","area_code":""}
note the presense of the third and fourth decimal place, which should indicate a precision level of +/-110m & +/- 11m (respectively).
Whereas Reston (the town where I grew up) contains the following
{"id":651,"country":"US","region":"VA","city":"Reston",
"postal_code":"20190","latitude":38.9599,
"longitude":-77.3428,"dma_code":"511","area_code":"703"}
If I were attempting to aggregate purchases to the state level of accuracy, I could include the second and third example here; However, if I wanted to aggregate purchases down to the city level, I really should only use the third example
Developers - if you notice a problem like this, perhaps attempt to “override” the level of precision. A column called “accurate_to”, measured in (meters/km/miles), provided along with the gps coordinates would go a long way towards preventing bad analysis.
Organizations - provide a good communication path between your query writers and developers when there are questions about how data is formed. Having people who can bridge the gap between developer and query writer (someone who knows how to code and also how work with data) goes a long way to help remedy these sorts of problems.
This post mentions GeoLite data created by MaxMind, available from MaxMind