dirs in there
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Feb 12 2018, 2:02 PM

Description

Sometimes on stat1005 the OOM killer might act due to high load user activity. After this event the /mnt/hdfs mountpoint is available but showing up no data, and consumers like the dataset1001's rsync do not get new files.

Details

Subject	Repo	Branch	Lines +/-
profile::analytics::cluster::client: remove useless sudo for nrpe check	operations/puppet	production	+1 -1
role::statistics::private: enable alarm for /mnt/hdfs	operations/puppet	production	+1 -0
profile::analytics::cluster::client: add check for /mnt/hdfs	operations/puppet	production	+44 -1
role::analytics_cluster::client: force remount of HDFS mountpoint	operations/puppet	production	+14 -6

Customize query in gerrit

Event Timeline

elukey created this task.Feb 12 2018, 2:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2018, 2:02 PM

elukey added a project: Analytics-Kanban.Feb 12 2018, 2:03 PM

elukey added a subscriber: Ottomata.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Feb 16 2018, 12:01 PM

Change 416442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::client: force remount of HDFS mountpoint

https://gerrit.wikimedia.org/r/416442

gerritbot added a project: Patch-For-Review.Mar 5 2018, 2:21 PM

elukey moved this task from Next Up to In Code Review on the Analytics-Kanban board.Mar 5 2018, 2:35 PM

The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?
I wonder what other jobs use that mountpoint, it might be nice to find out.

Note that in at least one recent case there was no nice message from ls; instead we had

root@stat1005:/mnt/hdfs# ls -l
ls: cannot access 'wmf': No such file or directory
...
d?????????   ? ?    ?         ?            ? wmf

In T187073#4023938, @ArielGlenn wrote:

The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?

It might be enough for the moment, a down of 30 mins should be fine in my opinion, but we can always revise it later on with a cron..

I wonder what other jobs use that mountpoint, it might be nice to find out.

Note that in at least one recent case there was no nice message from ls; instead we had
root@stat1005:/mnt/hdfs# ls -l
ls: cannot access 'wmf': No such file or directory
...
d?????????   ? ?    ?         ?            ? wmf

Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?

In T187073#4024045, @elukey wrote:

Definitely, one alternative could be to do something like ls -l /mnt/hdfs and expect a 0 in return?

+1 for checking the return code.

The other issue is that sometimes users might have fd open to /mnt/hdfs/something preventing a clean umount (without force). In this case though I'd say that a puppet failure and a manual fix is good enough.

elukey claimed this task.Mar 7 2018, 5:04 PM

Change 416442 abandoned by Elukey:
role::analytics_cluster::client: force remount of HDFS mountpoint

Reason:
I am not convinced about this path anymore, I'll file another code review to have an alarm.

https://gerrit.wikimedia.org/r/416442

elukey moved this task from In Code Review to In Progress on the Analytics-Kanban board.Mar 15 2018, 11:01 AM

Change 420335 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs

https://gerrit.wikimedia.org/r/420335

Change 420335 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: add check for /mnt/hdfs

https://gerrit.wikimedia.org/r/420335

Change 420639 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs

https://gerrit.wikimedia.org/r/420639

Change 420639 merged by Elukey:
[operations/puppet@production] role::statistics::private: enable alarm for /mnt/hdfs

https://gerrit.wikimedia.org/r/420639

elukey@stat1005:~$ /usr/local/lib/nagios/plugins/check_mountpoint_readability /mnt/hdfs/
OK

Enabled it for stat1005, it seems working fine. This is the use case that we care the most at the moment (due to important rsync relying on /mnt/hdfs), but we'll probably extend the alarm to all the hosts using it.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Mar 20 2018, 7:05 AM

elukey set the point value for this task to 5.

elukey moved this task from Analytics Backlog to Done on the User-Elukey board.Mar 20 2018, 10:18 AM

Change 420980 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check

https://gerrit.wikimedia.org/r/420980

Change 420980 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: remove useless sudo for nrpe check

https://gerrit.wikimedia.org/r/420980

• Nuria closed this task as Resolved.Mar 26 2018, 9:27 PM

Alarm when /mnt/hdfs is mounted but showing no files/dirs in thereClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Event Timeline

Alarm when /mnt/hdfs is mounted but showing no files/dirs in there
Closed, ResolvedPublic5 Estimated Story Points
Actions