Page MenuHomePhabricator

Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/multidesc/index.php?raw&image=File%3AFagaras.jpg

<br />
<b>Warning</b>:  parse_ini_file(/data/project//replica.my.cnf): failed to open stream: No such file or directory in <b>/data/project/magnustools/public_html/php/common.php</b> on line <b>57</b><br />
<br />
<b>Warning</b>:  mysqli::mysqli(): (HY000/1045): Access denied for user ''@'10.68.18.25' (using password: NO) in <b>/data/project/magnustools/public_html/php/common.php</b> on line <b>126</b><br />
<br />
<b>Fatal error</b>:  Call to a member function query() on boolean in <b>/data/project/multidesc/public_html/index.php</b> on line <b>171</b><br />

parse_ini_file(/data/project//replica.my.cnf) suggests the tool is looking in the wrong place (maybe using an empty 'project' variable?)

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJprorama
Resolvedaborrero
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolveddduvall
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolvedyuvipanda
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolved Bstorm
ResolvedSecurityaborrero
Resolvedaborrero
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolvedaborrero
OpenNone
Resolved Bstorm
Resolved bd808
Invalidaborrero
Resolved bd808
Resolved bd808
ResolvedSecurity Bstorm
Resolvedaborrero
Resolved bd808
DuplicateNone
Resolved Bstorm
Resolved bd808
Resolved bd808

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2017-06-03T15:59:41Z] <bd808> Restarted webservice to see if PHP user detection error is transient (T166949)

Restarting the kubernetes webservice process seems to have fixed the app. It looks like the PHP [[https://secure.php.net/get_current_user|get_current_user()]] function started returning an empty string. get_current_user() checks the ownership of the script, so maybe this was caused by some sort of NFS communication error.

It might be more robust to find the tool name based on the running process:

$ webservice --backend=kubernetes php5.6 shell
$ php -a
Interactive mode enabled

php > echo posix_getpwuid(posix_geteuid())['name'];
tools.multidesc
php >

I ran into the same problem with tools.orphantalk. https://tools.wmflabs.org/orphantalk/ was consistently failing for over a week (according to user reports) with the following error:

Connection to 's7.labsdb' failed: Failed to fetch credentials from replica.my.cnf

Today I looked into it and couldn't reproduce it from tools-login using php -a. The file is definitely there, readable, and working. I even tried webservice shell to try it from within the same Kubernetes container, and it worked there too, without issue. The live web service was still failing consistently. webservice restart fixed the problem.

Code in question:

	public function getLocalHome() {
		$info = posix_getpwuid(posix_geteuid());
		return $info['dir'];
	}
/* .. */
	protected function fetchDbCredentials() {
		$file = $this->getLocalHome() . '/replica.my.cnf';
		if ( !is_readable( $file ) || !is_file( $file ) ) {
			throw new Exception( 'Failed to fetch credentials from replica.my.cnf' );
		}
		$cnf = parse_ini_file( $file );
		if ( !$cnf || !$cnf['user'] || !$cnf['password'] ) {
			throw new Exception( 'Failed to fetch credentials from replica.my.cnf' );
		}

Looks like it was probably the same issue. Somehow the Kubernetes pod was no longer able to resolve the UID or something like that. It didn't seem to be intermittent. It was working and then at some point something caused it to break and stay broken.

Krinkle renamed this task from multidesc: cannot connect to database due to reading wrong replica.my.cnf file to Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf).Jun 16 2017, 2:29 PM
Krinkle edited projects, added Tools-Kubernetes, Cloud-VPS; removed Tools.
Krinkle triaged this task as High priority.Jun 16 2017, 2:29 PM

Yes, I have been restarting at least a dozen tools because of this over the last week or two (?).

Restarting the kubernetes webservice process seems to have fixed the app. It looks like the PHP [[https://secure.php.net/get_current_user|get_current_user()]] function started returning an empty string. get_current_user() checks the ownership of the script, so maybe this was caused by some sort of NFS communication error.

Possibly an LDAP issue as well?

I can consider migrating this towards using the HOME env variable instead. Is that something we may expect to be set correctly for all grid jobs, webservices and k8s pods in Toolforge?

Possibly an LDAP issue as well?

This is actually much more likely than my earlier attempt scapegoat NFS.

I can consider migrating this towards using the HOME env variable instead. Is that something we may expect to be set correctly for all grid jobs, webservices and k8s pods in Toolforge?

I believe that is a safe assumption for the current generation of Toolforge. I'm not sure if it will always be the case in the future, but I think I can assure you that any change that removes a valid $HOME from the environment will be treated either as a planned feature (in which case we will announce it and have a migration period) or as a regression (in which case we would obviously try to fix it).

Mentioned in SAL (#wikimedia-cloud) [2019-07-09T20:55:19Z] <Krinkle> Restarting webservice of GUC to hopefully fix fatal errors from reading replica.my.cnf, ref T166949

Change 521690 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[labs/tools/guc@master] Settings: Switch from posix_getpwuid() to getenv('HOME')

https://gerrit.wikimedia.org/r/521690

Change 521690 merged by jenkins-bot:
[labs/tools/guc@master] Settings: Switch from posix_getpwuid() to getenv('HOME')

https://gerrit.wikimedia.org/r/521690

Well, apparently the environment variable can go missing the same way. The tool now regularly returns:

Error: Unable to find HOME directory

Here's the full output of getenv() in PHP from a web request in the php7.2 k8s pod

array (size=36)
  'PATH' => string '/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games' (length=56)
  'SHELL' => string '/bin/bash' (length=9)
  'USER' => string 'tools.guc' (length=9)
  'PHP_FCGI_CHILDREN' => string '2' (length=1)
  'PHP_FCGI_MAX_REQUESTS' => string '500' (length=3)
  'HTTP_COOKIE' => string ###
  'HTTP_ACCEPT_LANGUAGE' => string 'en-GB,en-US;q=0.9,en;q=0.8,nl;q=0.7' (length=35)
  'HTTP_ACCEPT_ENCODING' => string 'gzip, deflate, br' (length=17)
  'HTTP_ACCEPT' => string ##
  'HTTP_USER_AGENT' => string##
  'HTTP_UPGRADE_INSECURE_REQUESTS' => string '1' (length=1)
  'HTTP_X_ORIGINAL_URI' => string '/guc/'
  'HTTP_X_FORWARDED_PROTO' => string 'https' (length=5)
  'HTTP_HOST' => string 'tools.wmflabs.org' (length=17)
  'HTTP_CONNECTION' => string 'close' (length=5)
  'REMOTE_PORT' => string '51312' (length=5)
  'REMOTE_ADDR' => string '172.16.6.39' (length=11)
  'SERVER_NAME' => string 'tools.wmflabs.org' (length=17)
  'SERVER_ADDR' => string '172.16.1.68' (length=11)
  'SERVER_PORT' => string '49117' (length=5)
  'GATEWAY_INTERFACE' => string 'CGI/1.1' (length=7)
  'SERVER_SOFTWARE' => string 'lighttpd/1.4.45' (length=15)
  'SERVER_PROTOCOL' => string 'HTTP/1.1' (length=8)
  'REQUEST_METHOD' => string 'GET' (length=3)
  'DOCUMENT_ROOT' => string '/data/project/guc/public_html/' (length=30)
  'SCRIPT_FILENAME' => string '/data/project/guc/public_html//index.php' (length=40)
  'SCRIPT_NAME' => string '/guc/index.php' (length=14)
  'REDIRECT_STATUS' => string '200' (length=3)
  'REQUEST_URI' => string '/guc/?by=date&user=KrinkleBot&krinkle=1' (length=39)
  'QUERY_STRING' => string 'by=date&user=KrinkleBot&krinkle=1' (length=33)
  'CONTENT_LENGTH' => string '0' (length=1)
  'FCGI_ROLE' => string 'RESPONDER' (length=9)
  'PHP_SELF' => string '/guc/index.php' (length=14)
  'REQUEST_TIME_FLOAT' => float 1562711051.215
  'REQUEST_TIME' => int 1562711051

Change 521787 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[labs/tools/guc@master] Revert "Settings: Switch from posix_getpwuid() to getenv('HOME')"

https://gerrit.wikimedia.org/r/521787

Change 521787 merged by jenkins-bot:
[labs/tools/guc@master] Revert "Settings: Switch from posix_getpwuid() to getenv('HOME')"

https://gerrit.wikimedia.org/r/521787

While HOME exists via webservice shell for the php7.2 image, it seems consistently undefined via PHP CGI, even after numerous webservice restarts, stops and starts. So I guess that's not a race condition, but an issue elsewhere.

For now reverting to posix_getpwuid(), which means GUC will continue to sometimes get stuck until I or a Cloud admin can restart the tools-guc webservice.

This is supposed to be set in the container spec: https://github.com/wikimedia/operations-software-tools-webservice/blob/master/toollabs/webservice/backends/kubernetesbackend.py#L438

But I can confirm that it is indeed not set in currently running pods.

A describe on a running pod that displays that would tell us the pod spec....OH but you CAN confirm it isn't. Got it.

The replag tool seems to be having this issue or some variation of it right now. Some investigation:

$ kubectl get po
NAME                      READY     STATUS    RESTARTS   AGE
replag-2249201953-tpzb9   1/1       Running   0          10d
$ kubectl describe pod replag-2249201953-tpzb9
Name:           replag-2249201953-tpzb9
Namespace:      replag
Node:           tools-worker-1012.tools.eqiad.wmflabs/172.16.3.78
Start Time:     Thu, 04 Jul 2019 23:53:48 +0000
Labels:         name=replag
                pod-template-hash=2249201953
                tools.wmflabs.org/webservice=true
                tools.wmflabs.org/webservice-version=1
Status:         Running
IP:             192.168.216.24
Controllers:    ReplicaSet/replag-2249201953
Containers:
  webservice:
    Container ID:       docker://efeb999fb8c6b25638d121be5b92c9022adc4726f03cd92bf17886cfa7cfa040
    Image:              docker-registry.tools.wmflabs.org/toollabs-php72-web:latest
    Image ID:           docker://sha256:e988f3376ad73854dbf5521e84ee9e890d9749d6280880cb0d52fd6faf111f20
    Port:               8000/TCP
    Command:
      /usr/bin/webservice-runner
      --type
      lighttpd
      --port
      8000
    Limits:
      cpu:      2
      memory:   2Gi
    Requests:
      cpu:              125m
      memory:           256Mi
    State:              Running
      Started:          Thu, 04 Jul 2019 23:54:59 +0000
    Ready:              True
    Restart Count:      0
    Volume Mounts:
      /data/project/ from home (rw)
      /data/scratch/ from scratch (rw)
      /etc/ldap.conf from etcldap-conf-s4l4s (rw)
      /etc/ldap.yaml from etcldap-yaml-c75xw (rw)
      /etc/novaobserver.yaml from etcnovaobserver-yaml-72svq (rw)
      /etc/wmcs-project from wmcs-project (rw)
      /mnt/nfs/ from nfs (rw)
      /public/dumps/ from dumps (rw)
      /var/run/nslcd/socket from varrunnslcdsocket-xuipu (rw)
    Environment Variables:
      HOME:     /data/project/replag/
Conditions:
  Type          Status
  Initialized   True
  Ready         True
  PodScheduled  True
Volumes:
  dumps:
    Type:       HostPath (bare host directory volume)
    Path:       /public/dumps/
  home:
    Type:       HostPath (bare host directory volume)
    Path:       /data/project/
  wmcs-project:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/wmcs-project
  nfs:
    Type:       HostPath (bare host directory volume)
    Path:       /mnt/nfs/
  scratch:
    Type:       HostPath (bare host directory volume)
    Path:       /data/scratch/
  etcnovaobserver-yaml-72svq:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/novaobserver.yaml
  varrunnslcdsocket-xuipu:
    Type:       HostPath (bare host directory volume)
    Path:       /var/run/nslcd/socket
  etcldap-conf-s4l4s:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/ldap.conf
  etcldap-yaml-c75xw:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/ldap.yaml
QoS Class:      Burstable
Tolerations:    <none>
No events.
$ kubectl exec -it replag-2249201953-tpzb9 -- /bin/bash
$ id
uid=52774 gid=0(root) groups=0(root),52774
$ whoami
whoami: cannot find name for user ID 52774: No such file or directory
$ ls -lh
total 8.9M
-rw-r--r-- 1 52774 52774 8.6M Jul 15 14:53 access.log
-rw-r--r-- 1 52774 52774 245K Jul 15 14:53 error.log
drwxrwxr-x 2 52774 52774 4.0K Dec  6  2016 logs
lrwxrwxrwx 1 52774 52774   18 Sep 13  2017 public_html -> tool-replag/public
-r-------- 1 52774 52774   52 Nov 27  2015 replica.my.cnf
-rw-r--r-- 1 52774 52774  22K May 25  2016 service.log
-rw-r--r-- 1 52774 52774  135 Mar 24 20:37 service.manifest
drwxr-sr-x 4 52774 52774 4.0K Sep 13  2017 tool-replag
$ echo $HOME
/data/project/replag/
$ ls -lh /var/run/nslcd/socket
srw-rw-rw- 0 root root 0 Apr  9 17:20 /var/run/nslcd/socket
$ ls -lh /etc/nslcd.conf
-rw-r----- 1 root nslcd 685 May 18 09:07 /etc/nslcd.conf
$ exit

We automount /var/run/nslcd/socket from the host that is running the pod (T134748). In this case, that host is tools-worker-1012. On tools-worker-1012:

$ ls -lh /var/run/nslcd/socket
srw-rw-rw- 1 root root 0 Jul  9 17:25 /var/run/nslcd/socket=

This timestamp is after the pod was launched. My hunch is that when nsldc gets restarted the inode of /var/run/nslcd/socket changes on the host but that this is not updated in the running pods. It would be interesting to revisit T134748: Setup NSS inside containers used in Tool Labs to see if sssd could be used instead of nscld and if that would change anything here.

Mentioned in SAL (#wikimedia-cloud) [2019-07-15T15:46:17Z] <bd808> Restarted webservice to clear problem with nss lookup of user and homedir (T166949)

OH but you CAN confirm it isn't. Got it.

Sorry I guess I was ambiguous. I meant the environment variable was not set. I did not do describe.

The 2020 Kubernetes cluster is using sssd for NSS lookups rather than nslcd. We believe this class of problems will go away when the legacy k8s cluster is shutdown.

Bstorm claimed this task.

I believe this is now fixed. Please re-open if I'm wrong @bd808