Improve handling Dataproc cluster creation with ERROR state #9593

turbaszek · 2020-06-30T14:56:16Z

This PR brings back some features added in #4064 that were lost during refactor.
Now if the cluster is created but its state is ERROR then we run a diagnosis on it and we delete it.

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

turbaszek · 2020-06-30T14:56:33Z

@dossett happy to hear your opinion!

airflow/providers/google/cloud/hooks/dataproc.py

dossett · 2020-07-01T18:40:21Z

@turbaszek Thanks! One other small behavior that I added in AIRFLOW-3149 was that if the cluster already existed in the DELETING state, the operator would wait for the DELETE to finish and then create a new cluster. (A rare condition, but one we hit in production more than once.)

Thank you very much!

turbaszek · 2020-07-02T11:36:23Z

@dossett I've added handling of this case

turbaszek · 2020-07-02T12:38:23Z

@olchas would you mind taking a look?

olchas · 2020-07-02T14:48:18Z

airflow/providers/google/cloud/operators/dataproc.py

+            self._handle_error_state(hook)
+        elif cluster.status.state == cluster.status.DELETING:
+            # Wait for cluster to delete
+            for time_to_sleep in exponential_sleep_generator(initial=10, maximum=120):


I know that we are waiting here for an operation that is supposed to finish within a finite amount but what do you think about adding a maximum amount of total sleep before raising an exception?

I've added 5m timeout, @dossett do you think it should be ok in most cases?

Honestly, as an Airflow user and Airflow admin I would rather there not be a timeout (or just use whatever global timeouts might exist). I've seen DELETE ops take 45-60 minutes before and if they are taking that long, there's nothing to do except wait for it to complete.

Hm, not sure if this is the right way for "create operator". We even started a discussion about this on devlist:
https://lists.apache.org/thread.html/r9a6833ebafa3f00f79f86d9688f77a958a73ab7b6d9eccd1f0998fe2%40%3Cdev.airflow.apache.org%3E

airflow/providers/google/cloud/operators/dataproc.py

turbaszek · 2020-07-04T09:29:54Z

@dossett @olchas I've added your suggestions

airflow/providers/google/cloud/operators/dataproc.py

jaketf · 2020-07-10T00:01:47Z

airflow/providers/google/cloud/operators/dataproc.py

            self.log.info("Cluster already exists.")
+            cluster = self._get_cluster(hook)


if cluster is in creating state should you block til it reaches running?

If cluster already exists (I assume this is checked by cluster id / name) then you should assert that it matches the any configuration explicitly specified in this operator by the user (e.g. a cluster could exist with this ID but have different dataproc version / missing init actions / etc.) you would not want to consider this a successful run of this operator as it did not meet it's contract of creating a cluster with explicit XYZ config provided by the user.

IMO this should result in a task failure as it is not clear what the operator should do in this scenario? delete the existing cluster? add a uuid suffix to avoid the name clash and create a new cluster?

@jaketf is there any bulletproof, simple way to compare cluster configuration? Comparing dicts doesn't sound like a way to go because created cluster includes more information than the user provided config

I think you'd just compare the keys of the dict that were specified by the user.
This might be easier said than done.

Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.

jaketf · 2020-07-10T00:27:06Z

airflow/providers/google/cloud/operators/dataproc.py

@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator):
    :type project_id: str
    :param region: leave as 'global', might become relevant in the future. (templated)
    :type region: str
+    :parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default
+        value is true.
+    :type delete_on_error: bool


should there be a similar configurable parameter like "delete_on_configuration_mismatch" that defaults to False?

I think this case will be difficult to handle in all cases correctly. You will not always need to create a new cluster when the configuration is different eg. additional components installed do not affect the usability of the cluster.

I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

turbaszek · 2020-07-27T09:14:13Z

Related #10014

turbaszek · 2020-07-27T16:47:49Z

@jaketf @dossett I did some changes. The timeout is 1h and configurable, the logic handles:

create completly new cluster
use existing cluster
reattach to cluster in CREATING state
delete cluster if is in ERROR state
if cluster is in DELETING state, wait and create new one

dossett · 2020-07-28T17:39:22Z

LGTM, thank you!

jaketf

LGTM.
I think trying to go down the "does this existing cluster at least match the config specified by the user" may be a rabbit hole and certainly shouldn't hold up the improvements in this PR.

olchas · 2020-07-30T07:32:28Z

I have one more suggestion, but maybe it is a material for another PR: maybe we could let the user decide if he wants to reuse an existing cluster via another bool argument (let's call it reuse_if_exists for now)? I think the logic in case the cluster exists could be as follows:

if it is in ERROR state, it is ok to delete it regardless of reuse_if_exists flag, as it is unusable anyway. Then, on retry we get the opportunity to recreate it.
if it is in DELETING state, it is ok to wait for its deletion regardless of reuse_if_exists flag, as we will be creating a new cluster anyway.
in all other scenarios we should proceed only if reuse_if_exists was set to True and raise an exception otherwise.
WDYT?

olchas · 2020-07-30T07:35:47Z

Also, do you think we should add a note about new exceptions being raised to UPDATING.md? I figure that users could be handling the ERROR state clusters on their own in latter task and now they would receive an exception in DataprocCreateClusterOperator.

potiuk · 2020-08-03T08:58:58Z

airflow/providers/google/cloud/operators/dataproc.py

@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator):
    :type project_id: str
    :param region: leave as 'global', might become relevant in the future. (templated)
    :type region: str
+    :parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default
+        value is true.
+    :type delete_on_error: bool


I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

potiuk · 2020-08-03T09:01:54Z

airflow/providers/google/cloud/operators/dataproc.py

            self.log.info("Cluster already exists.")
+            cluster = self._get_cluster(hook)


Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.

turbaszek · 2020-08-03T09:07:06Z

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

I think that's a good idea. Terraform on its own will be able to handle updates and stuff like that. However, I'm not sure if terraform allow users to use all config options

potiuk · 2020-08-03T09:09:24Z

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

I think that's a good idea. Terraform on its own will be able to handle updates and stuff like that. However, I'm not sure if terraform allow users to use all config options

Agree. We shoudl have both. It's great to have dedicated Operators where you have explicit configuration options etc.. But having a generic Terraform operator if you are familiar with Terraform, and use it elsewhere would be great addition.

jaketf · 2020-08-04T21:57:07Z

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

Having two levels of rendering (JINJA / hcl string interpolation) sounds like a great way to have hard to debug situations of "What level of this rendering is going wrong". And what really would by dynamic between task runs from an infra perspective?

I think a terraform hook might be a nice feature but would take some careful design. I vaguely remember this being brought up on slack or dev list but can't seem to find it. I found myself wanting it to bring up / tear down a CI composer environment during sleeping hours as cost cutting measure.

For OSS terraform / terragrunt this could be really tricky:

It means the execution environment must have terraform / terragrunt binaries (plus download remote modules referenced or source for local modules). would we provide an airflow extra for this? would this be up to the user? The latter seems sad and not easy to do with Composer specifically (could be easier if you build your own airflow images to toss terraform in and run on kubernetes).
you need an easy way need to sync terraform source from a repo (unless you imagine single resource type things that might be embedded in DAG code).
I think it would mostly end up as a Kubernetes Pod Operator that happens to run terraform.

However for terraform enterprise hook might be much simpler (as the the execution environment and source syncing and state management become not airflow's problem).

potiuk · 2020-08-05T08:38:31Z

@jaketf - just the fact that Terraform operator is tricky, makes me even more convinced that we should add it :). All the words about "careful design" and how to bring the scripts etc. show that it's very far from trivial to use Terraform as a step in Airflow - but IMHO conceptually it makes perfect sense :). I can immediately start thinking about - for example - bringing the terraform scripts from Git repo by the operator.. But you are completely right - it is a big, separate discussion that seems like devlist might be the right place for :). I do not want to open another stream of discussion now, but once we get Airflow 2.0 planning under full control and scheduled, I will for sure open one :).

turbaszek · 2020-08-05T09:22:03Z

@potiuk @jaketf are we good with this PR? Terraform sounds like the way to go in my opinion (and is/was already used for system tests?) as it would provide a unified way of managing infrastructure from Airflow level.

Handle cluster in DELETING state Extend tests fixup! Extend tests fixup! fixup! Extend tests fixup! fixup! fixup! Extend tests

potiuk · 2020-08-05T18:43:50Z

Sure - we are good - it was offtop :)

jaketf · 2020-08-05T18:57:21Z

@turbaszek yes this PR is good to go!
@potiuk indeed it would be good to make all the right decisions once :) (never seen so many smileys in a GH comment)
cc: @brandonjbjelland who's "seen it all" from terraform perspective (might help us in designing the right integration)

potiuk · 2020-08-05T18:58:48Z

@turbaszek yes this PR is good to go!
@potiuk indeed it would be good to make all the right decisions once :) (never seen so many smileys in a GH comment)

I was smiling all the time when I was writing it.

boring-cyborg bot added the provider:google Google (including GCP) related issues label Jun 30, 2020

turbaszek commented Jul 1, 2020

View reviewed changes

airflow/providers/google/cloud/hooks/dataproc.py Outdated Show resolved Hide resolved

turbaszek marked this pull request as ready for review July 2, 2020 12:34

turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 9239ac8 to 90b85e8 Compare July 2, 2020 12:36

turbaszek requested review from mik-laj and potiuk July 2, 2020 12:38

turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 90b85e8 to 727f4fa Compare July 2, 2020 13:56

olchas reviewed Jul 2, 2020

View reviewed changes

airflow/providers/google/cloud/operators/dataproc.py Outdated Show resolved Hide resolved

olchas reviewed Jul 2, 2020

View reviewed changes

airflow/providers/google/cloud/operators/dataproc.py Outdated Show resolved Hide resolved

turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 80aae20 to a155601 Compare July 4, 2020 09:29

jaketf reviewed Jul 10, 2020

View reviewed changes

potiuk mentioned this pull request Jul 27, 2020

Release a 2nd wave of Backport packages #10014

Closed

13 tasks

jaketf approved these changes Jul 29, 2020

View reviewed changes

olchas approved these changes Jul 30, 2020

View reviewed changes

potiuk approved these changes Aug 3, 2020

View reviewed changes

turbaszek closed this Aug 5, 2020

turbaszek reopened this Aug 5, 2020

Improve handling Dataproc cluster creation with ERROR state

09d1ff8

Handle cluster in DELETING state Extend tests fixup! Extend tests fixup! fixup! Extend tests fixup! fixup! fixup! Extend tests

turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 850153f to 09d1ff8 Compare August 5, 2020 09:26

turbaszek merged commit 0103226 into apache:master Aug 6, 2020

turbaszek deleted the improve-dataproc-create-cluster-op branch August 6, 2020 08:31

jaketf mentioned this pull request Aug 21, 2020

Add native Terraform integration #10454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling Dataproc cluster creation with ERROR state #9593

Improve handling Dataproc cluster creation with ERROR state #9593

turbaszek commented Jun 30, 2020 •

edited

Loading

turbaszek commented Jun 30, 2020

dossett commented Jul 1, 2020

turbaszek commented Jul 2, 2020

turbaszek commented Jul 2, 2020

olchas Jul 2, 2020

turbaszek Jul 3, 2020

dossett Jul 6, 2020

turbaszek Jul 14, 2020

turbaszek commented Jul 4, 2020

jaketf Jul 10, 2020

jaketf Jul 10, 2020

turbaszek Jul 27, 2020

jaketf Jul 29, 2020

potiuk Aug 3, 2020

jaketf Jul 10, 2020

mik-laj Jul 29, 2020

potiuk Aug 3, 2020

turbaszek commented Jul 27, 2020

turbaszek commented Jul 27, 2020

dossett commented Jul 28, 2020

jaketf left a comment

olchas commented Jul 30, 2020

olchas commented Jul 30, 2020

potiuk Aug 3, 2020

potiuk Aug 3, 2020

turbaszek commented Aug 3, 2020

potiuk commented Aug 3, 2020

jaketf commented Aug 4, 2020

potiuk commented Aug 5, 2020

turbaszek commented Aug 5, 2020

potiuk commented Aug 5, 2020

jaketf commented Aug 5, 2020

potiuk commented Aug 5, 2020

		self.log.info("Cluster already exists.")
		cluster = self._get_cluster(hook)

Improve handling Dataproc cluster creation with ERROR state #9593

Improve handling Dataproc cluster creation with ERROR state #9593

Conversation

turbaszek commented Jun 30, 2020 • edited Loading

turbaszek commented Jun 30, 2020

dossett commented Jul 1, 2020

turbaszek commented Jul 2, 2020

turbaszek commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek commented Jul 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek commented Jul 27, 2020

turbaszek commented Jul 27, 2020

dossett commented Jul 28, 2020

jaketf left a comment

Choose a reason for hiding this comment

olchas commented Jul 30, 2020

olchas commented Jul 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turbaszek commented Aug 3, 2020

potiuk commented Aug 3, 2020

jaketf commented Aug 4, 2020

potiuk commented Aug 5, 2020

turbaszek commented Aug 5, 2020

potiuk commented Aug 5, 2020

jaketf commented Aug 5, 2020

potiuk commented Aug 5, 2020

turbaszek commented Jun 30, 2020 •

edited

Loading