Page MenuHomePhabricator

`webservice restart` regression with backend=kubernetes in webservice 0.51
Closed, ResolvedPublic

Description

Using webservice 0.47 from inside a pod works as expected:

$ webservice restart
******************************************************************************
Note that access.log is no longer enabled by default (see https://w.wiki/9go)
******************************************************************************
Restarting webservice...
$ 

Using webservice 0.51 from tools-sgebastion-08:

$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 230, in <module>
    start(job, 'Your job is not running, starting')
  File "/usr/local/bin/webservice", line 95, in start
    job.request_start()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 642, in request_start
    pykube.Deployment(self.api, self._get_deployment()).create()
  File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
    self.api.raise_for_status(r)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
    raise HTTPError(payload["message"])
pykube.exceptions.HTTPError: deployments.extensions "fourohfour" already exists

The pod actually seems to be restarted as hoped. It looks like the delete of the Deployment that is done in KubernetesBackend.request_stop() is failing however. It also seems that the guard checking for an existing Deployment in KubernetesBackend.request_start() is failing, so maybe the root problem is that something changed such that KubernetesBackend._find_obj(pykube.Deployment, self.webservice_label_selector) always fails?

Related, but separate: Docker images seem to have not been updated to use the latest webservice package. Actually kind of nice in this instance as it let me make this comparison and see that this is a regression in webservice and not some other problem with the legacy k8s cluster.

Event Timeline

This seems to affect webservice --backend=kubernetes ... start as well:

# verify that nothing is currently running
$ kubectl get deployments
$ kubectl get replicasets
$ kubectl get pods

# start up a python3.5 webservice
$ webservice --backend=kubernetes python3.5 start
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 218, in <module>
    start(job, 'Starting webservice')
  File "/usr/local/bin/webservice", line 95, in start
    job.request_start()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 647, in request_start
    pykube.Service(self.api, self._get_svc()).create()
  File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
    self.api.raise_for_status(r)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
    raise HTTPError(payload["message"])
pykube.exceptions.HTTPError: services "fourohfour" already exists

# Check on the state of k8s
$ kubectl get deployments
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
fourohfour   1         1         1            1           9s
$ kubectl get replicasets
NAME                    DESIRED   CURRENT   READY     AGE
fourohfour-2046109565   1         1         1         15s
$ kubectl get pods
NAME                          READY     STATUS    RESTARTS   AGE
fourohfour-2046109565-nom8v   1/1       Running   0          21s

So things were created as expected, but webservice blew up in the process. And this crash has another very unfortunate side effect of preventing $HOME/service.manifest from being written/updated. This means that subsequent webservice reload calls will also blow up, even if done from inside a pod where there is a working webservice command. It also means that webservice status ends up really confused.

Having run this every which way repeatedly in testing, I now think I *did* see this, but I thought it was an odd one-off because it didn't seem consistent.
What happens isn't about the deployment in the second case, it's the service object in the old cluster only. If you do a kubectl delete service --all it will clear up. That's why you are missing it in the start phase--didn't check for services, just deployments and its decendents. It also shows up in that traceback above for start.

It seems that somehow the service find_obj doesn't correctly find the service or doesn't delete it. I honestly cannot quite tell why because it is using the same code in pykube for every object...and it works on the newer cluster. The only difference is an additional label. Now that you've pointed this out it is even more baffling to me because it also missed a deployment in the restart.

We should probably roll back the deployment in tools (there must be some way to do that in aptly), and figure out what is the issue with service objects. The labels are set object-wide, so it shouldn't be possible for them to be different unless there is a regressing in pykube with this number of labels--or maybe the way we concatenate them into a label matcher?

I can try to figure out how to rollback now.

Wait...maybe that's it, you restarted, but the labels it looks for are different because it doesn't use the name of the object, it uses labels. There is absolutely no reason not to use the name of the object unless pykube is incapable of it.

Yes, that makes sense, by adding a new label, I broke deletion of old things that didn't have the new label because it looks for the ENTIRE list of labels. Maybe easiest fix is a quick patch and deploy.

Also, by running again with the old version, you changed the labels a second time. Yes, it all makes sense. Fix coming.

Workaround for users until deploy:
Delete ALL existing objects in the webservice:

  • kubectl delete service <toolname>
  • kubectl delete deployment <toolname>
  • kubectl delete rs --all -- only if the webservice are the only replicaset they have
  • kubectl delete pod --all -- only if the webservice are the only pods they have

Run as normal. Then everything will have the same labels--unless webservice in the pod changes them for some reason.

Change 549990 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] breakfix: set the label selector to a subset of actual labels

https://gerrit.wikimedia.org/r/549990

Change 549990 merged by Bstorm:
[operations/software/tools-webservice@master] breakfix: set the label selector to a subset of actual labels

https://gerrit.wikimedia.org/r/549990

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T01:45:13Z] <bstorm_> deploying bugfix for webservice in tools and toolsbeta T237836

In tests the new version is able to correctly find all webservice pods and delete them in a sensible fashion (old and new cluster).

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T02:10:20Z] <bd808> Building new Docker images for T237836

Mentioned in SAL (#wikimedia-cloud) [2019-11-10T02:17:14Z] <bd808> Building new Docker images for T237836 (retrying after cleaning out old images on tools-docker-builder-06)

bd808 assigned this task to Bstorm.

Docker containers are updated:

$ kubectl exec -it fourohfour-2046109565-7x8v1 -- /bin/bash
$ dpkg -l|grep webservice
ii  toollabs-webservice               0.52                           all          Infrastructure for running webservices on tools.wmflabs.org

Restarts are obviously needed for running containers to pick up the new images, but I think we can let that happen organically.