Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling Dataproc cluster creation with ERROR state #9593

Merged

Conversation

turbaszek
Copy link
Member

@turbaszek turbaszek commented Jun 30, 2020

This PR brings back some features added in #4064 that were lost during refactor.
Now if the cluster is created but its state is ERROR then we run a diagnosis on it and we delete it.


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added the provider:google Google (including GCP) related issues label Jun 30, 2020
@turbaszek
Copy link
Member Author

@dossett happy to hear your opinion!

@dossett
Copy link
Contributor

dossett commented Jul 1, 2020

@turbaszek Thanks! One other small behavior that I added in AIRFLOW-3149 was that if the cluster already existed in the DELETING state, the operator would wait for the DELETE to finish and then create a new cluster. (A rare condition, but one we hit in production more than once.)

Thank you very much!

@turbaszek
Copy link
Member Author

@dossett I've added handling of this case

@turbaszek turbaszek marked this pull request as ready for review July 2, 2020 12:34
@turbaszek turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 9239ac8 to 90b85e8 Compare July 2, 2020 12:36
@turbaszek turbaszek requested review from mik-laj and potiuk July 2, 2020 12:38
@turbaszek
Copy link
Member Author

@olchas would you mind taking a look?

@turbaszek turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 90b85e8 to 727f4fa Compare July 2, 2020 13:56
self._handle_error_state(hook)
elif cluster.status.state == cluster.status.DELETING:
# Wait for cluster to delete
for time_to_sleep in exponential_sleep_generator(initial=10, maximum=120):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we are waiting here for an operation that is supposed to finish within a finite amount but what do you think about adding a maximum amount of total sleep before raising an exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added 5m timeout, @dossett do you think it should be ok in most cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, as an Airflow user and Airflow admin I would rather there not be a timeout (or just use whatever global timeouts might exist). I've seen DELETE ops take 45-60 minutes before and if they are taking that long, there's nothing to do except wait for it to complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure if this is the right way for "create operator". We even started a discussion about this on devlist:
https://lists.apache.org/thread.html/r9a6833ebafa3f00f79f86d9688f77a958a73ab7b6d9eccd1f0998fe2%40%3Cdev.airflow.apache.org%3E

@turbaszek turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 80aae20 to a155601 Compare July 4, 2020 09:29
@turbaszek
Copy link
Member Author

@dossett @olchas I've added your suggestions

airflow/providers/google/cloud/operators/dataproc.py Outdated Show resolved Hide resolved
self.log.info("Cluster already exists.")
cluster = self._get_cluster(hook)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cluster is in creating state should you block til it reaches running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cluster already exists (I assume this is checked by cluster id / name) then you should assert that it matches the any configuration explicitly specified in this operator by the user (e.g. a cluster could exist with this ID but have different dataproc version / missing init actions / etc.) you would not want to consider this a successful run of this operator as it did not meet it's contract of creating a cluster with explicit XYZ config provided by the user.

IMO this should result in a task failure as it is not clear what the operator should do in this scenario? delete the existing cluster? add a uuid suffix to avoid the name clash and create a new cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaketf is there any bulletproof, simple way to compare cluster configuration? Comparing dicts doesn't sound like a way to go because created cluster includes more information than the user provided config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd just compare the keys of the dict that were specified by the user.
This might be easier said than done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.

@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator):
:type project_id: str
:param region: leave as 'global', might become relevant in the future. (templated)
:type region: str
:parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default
value is true.
:type delete_on_error: bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a similar configurable parameter like "delete_on_configuration_mismatch" that defaults to False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case will be difficult to handle in all cases correctly. You will not always need to create a new cluster when the configuration is different eg. additional components installed do not affect the usability of the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

@turbaszek
Copy link
Member Author

Related #10014

@turbaszek
Copy link
Member Author

@jaketf @dossett I did some changes. The timeout is 1h and configurable, the logic handles:

  • create completly new cluster
  • use existing cluster
  • reattach to cluster in CREATING state
  • delete cluster if is in ERROR state
  • if cluster is in DELETING state, wait and create new one

@dossett
Copy link
Contributor

dossett commented Jul 28, 2020

LGTM, thank you!

Copy link
Contributor

@jaketf jaketf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I think trying to go down the "does this existing cluster at least match the config specified by the user" may be a rabbit hole and certainly shouldn't hold up the improvements in this PR.

@olchas
Copy link
Contributor

olchas commented Jul 30, 2020

I have one more suggestion, but maybe it is a material for another PR: maybe we could let the user decide if he wants to reuse an existing cluster via another bool argument (let's call it reuse_if_exists for now)? I think the logic in case the cluster exists could be as follows:

  • if it is in ERROR state, it is ok to delete it regardless of reuse_if_exists flag, as it is unusable anyway. Then, on retry we get the opportunity to recreate it.
  • if it is in DELETING state, it is ok to wait for its deletion regardless of reuse_if_exists flag, as we will be creating a new cluster anyway.
  • in all other scenarios we should proceed only if reuse_if_exists was set to True and raise an exception otherwise.
    WDYT?

@olchas
Copy link
Contributor

olchas commented Jul 30, 2020

Also, do you think we should add a note about new exceptions being raised to UPDATING.md? I figure that users could be handling the ERROR state clusters on their own in latter task and now they would receive an exception in DataprocCreateClusterOperator.

@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator):
:type project_id: str
:param region: leave as 'global', might become relevant in the future. (templated)
:type region: str
:parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default
value is true.
:type delete_on_error: bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

self.log.info("Cluster already exists.")
cluster = self._get_cluster(hook)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.

@turbaszek
Copy link
Member Author

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

I think that's a good idea. Terraform on its own will be able to handle updates and stuff like that. However, I'm not sure if terraform allow users to use all config options

@potiuk
Copy link
Member

potiuk commented Aug 3, 2020

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

I think that's a good idea. Terraform on its own will be able to handle updates and stuff like that. However, I'm not sure if terraform allow users to use all config options

Agree. We shoudl have both. It's great to have dedicated Operators where you have explicit configuration options etc.. But having a generic Terraform operator if you are familiar with Terraform, and use it elsewhere would be great addition.

@jaketf
Copy link
Contributor

jaketf commented Aug 4, 2020

BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.

Having two levels of rendering (JINJA / hcl string interpolation) sounds like a great way to have hard to debug situations of "What level of this rendering is going wrong". And what really would by dynamic between task runs from an infra perspective?

I think a terraform hook might be a nice feature but would take some careful design. I vaguely remember this being brought up on slack or dev list but can't seem to find it. I found myself wanting it to bring up / tear down a CI composer environment during sleeping hours as cost cutting measure.

For OSS terraform / terragrunt this could be really tricky:

  1. It means the execution environment must have terraform / terragrunt binaries (plus download remote modules referenced or source for local modules). would we provide an airflow extra for this? would this be up to the user? The latter seems sad and not easy to do with Composer specifically (could be easier if you build your own airflow images to toss terraform in and run on kubernetes).
  2. you need an easy way need to sync terraform source from a repo (unless you imagine single resource type things that might be embedded in DAG code).
    I think it would mostly end up as a Kubernetes Pod Operator that happens to run terraform.

However for terraform enterprise hook might be much simpler (as the the execution environment and source syncing and state management become not airflow's problem).

@turbaszek turbaszek closed this Aug 5, 2020
@turbaszek turbaszek reopened this Aug 5, 2020
@potiuk
Copy link
Member

potiuk commented Aug 5, 2020

@jaketf - just the fact that Terraform operator is tricky, makes me even more convinced that we should add it :). All the words about "careful design" and how to bring the scripts etc. show that it's very far from trivial to use Terraform as a step in Airflow - but IMHO conceptually it makes perfect sense :). I can immediately start thinking about - for example - bringing the terraform scripts from Git repo by the operator.. But you are completely right - it is a big, separate discussion that seems like devlist might be the right place for :). I do not want to open another stream of discussion now, but once we get Airflow 2.0 planning under full control and scheduled, I will for sure open one :).

@turbaszek
Copy link
Member Author

@potiuk @jaketf are we good with this PR? Terraform sounds like the way to go in my opinion (and is/was already used for system tests?) as it would provide a unified way of managing infrastructure from Airflow level.

Handle cluster in DELETING state

Extend tests

fixup! Extend tests

fixup! fixup! Extend tests

fixup! fixup! fixup! Extend tests
@turbaszek turbaszek force-pushed the improve-dataproc-create-cluster-op branch from 850153f to 09d1ff8 Compare August 5, 2020 09:26
@potiuk
Copy link
Member

potiuk commented Aug 5, 2020

Sure - we are good - it was offtop :)

@jaketf
Copy link
Contributor

jaketf commented Aug 5, 2020

@turbaszek yes this PR is good to go!
@potiuk indeed it would be good to make all the right decisions once :) (never seen so many smileys in a GH comment)
cc: @brandonjbjelland who's "seen it all" from terraform perspective (might help us in designing the right integration)

@potiuk
Copy link
Member

potiuk commented Aug 5, 2020

@turbaszek yes this PR is good to go!
@potiuk indeed it would be good to make all the right decisions once :) (never seen so many smileys in a GH comment)

I was smiling all the time when I was writing it.

@turbaszek turbaszek merged commit 0103226 into apache:master Aug 6, 2020
@turbaszek turbaszek deleted the improve-dataproc-create-cluster-op branch August 6, 2020 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants