Sometimes on stat1005 the OOM killer might act due to high load user activity. After this event the /mnt/hdfs mountpoint is available but showing up no data, and consumers like the dataset1001's rsync do not get new files.
Description
Details
Event Timeline
Change 416442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::client: force remount of HDFS mountpoint
The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?
I wonder what other jobs use that mountpoint, it might be nice to find out.
Note that in at least one recent case there was no nice message from ls; instead we had
root@stat1005:/mnt/hdfs# ls -l ls: cannot access 'wmf': No such file or directory ... d????????? ? ? ? ? ? wmf
It might be enough for the moment, a down of 30 mins should be fine in my opinion, but we can always revise it later on with a cron..
I wonder what other jobs use that mountpoint, it might be nice to find out.
Note that in at least one recent case there was no nice message from ls; instead we had
root@stat1005:/mnt/hdfs# ls -l ls: cannot access 'wmf': No such file or directory ... d????????? ? ? ? ? ? wmf
Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?
Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?
+1 for checking the return code.
The other issue is that sometimes users might have fd open to /mnt/hdfs/something preventing a clean umount (without force). In this case though I'd say that a puppet failure and a manual fix is good enough.
Change 416442 abandoned by Elukey:
role::analytics_cluster::client: force remount of HDFS mountpoint
Reason:
I am not convinced about this path anymore, I'll file another code review to have an alarm.
Change 420335 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs
Change 420335 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs
Change 420639 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs
Change 420639 merged by Elukey:
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs
elukey@stat1005:~$ /usr/local/lib/nagios/plugins/check_mountpoint_readability /mnt/hdfs/ OK
Enabled it for stat1005, it seems working fine. This is the use case that we care the most at the moment (due to important rsync relying on /mnt/hdfs), but we'll probably extend the alarm to all the hosts using it.
Change 420980 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check
Change 420980 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check