Page MenuHomePhabricator

Alarm when /mnt/hdfs is mounted but showing no files/dirs in there
Closed, ResolvedPublic5 Estimated Story Points

Description

Sometimes on stat1005 the OOM killer might act due to high load user activity. After this event the /mnt/hdfs mountpoint is available but showing up no data, and consumers like the dataset1001's rsync do not get new files.

Event Timeline

Change 416442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::client: force remount of HDFS mountpoint

https://gerrit.wikimedia.org/r/416442

The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?
I wonder what other jobs use that mountpoint, it might be nice to find out.

Note that in at least one recent case there was no nice message from ls; instead we had

root@stat1005:/mnt/hdfs# ls -l
ls: cannot access 'wmf': No such file or directory
...
d?????????   ? ?    ?         ?            ? wmf

The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?

It might be enough for the moment, a down of 30 mins should be fine in my opinion, but we can always revise it later on with a cron..

I wonder what other jobs use that mountpoint, it might be nice to find out.

Note that in at least one recent case there was no nice message from ls; instead we had

root@stat1005:/mnt/hdfs# ls -l
ls: cannot access 'wmf': No such file or directory
...
d?????????   ? ?    ?         ?            ? wmf

Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?

Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?

+1 for checking the return code.

The other issue is that sometimes users might have fd open to /mnt/hdfs/something preventing a clean umount (without force). In this case though I'd say that a puppet failure and a manual fix is good enough.

Change 416442 abandoned by Elukey:
role::analytics_cluster::client: force remount of HDFS mountpoint

Reason:
I am not convinced about this path anymore, I'll file another code review to have an alarm.

https://gerrit.wikimedia.org/r/416442

Change 420335 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs

https://gerrit.wikimedia.org/r/420335

Change 420335 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs

https://gerrit.wikimedia.org/r/420335

Change 420639 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs

https://gerrit.wikimedia.org/r/420639

Change 420639 merged by Elukey:
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs

https://gerrit.wikimedia.org/r/420639

elukey@stat1005:~$ /usr/local/lib/nagios/plugins/check_mountpoint_readability /mnt/hdfs/
OK

Enabled it for stat1005, it seems working fine. This is the use case that we care the most at the moment (due to important rsync relying on /mnt/hdfs), but we'll probably extend the alarm to all the hosts using it.

elukey set the point value for this task to 5.

Change 420980 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check

https://gerrit.wikimedia.org/r/420980

Change 420980 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check

https://gerrit.wikimedia.org/r/420980