-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling Dataproc cluster creation with ERROR state #9593
Improve handling Dataproc cluster creation with ERROR state #9593
Conversation
@dossett happy to hear your opinion! |
@turbaszek Thanks! One other small behavior that I added in AIRFLOW-3149 was that if the cluster already existed in the DELETING state, the operator would wait for the DELETE to finish and then create a new cluster. (A rare condition, but one we hit in production more than once.) Thank you very much! |
@dossett I've added handling of this case |
9239ac8
to
90b85e8
Compare
@olchas would you mind taking a look? |
90b85e8
to
727f4fa
Compare
self._handle_error_state(hook) | ||
elif cluster.status.state == cluster.status.DELETING: | ||
# Wait for cluster to delete | ||
for time_to_sleep in exponential_sleep_generator(initial=10, maximum=120): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that we are waiting here for an operation that is supposed to finish within a finite amount but what do you think about adding a maximum amount of total sleep before raising an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added 5m timeout, @dossett do you think it should be ok in most cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, as an Airflow user and Airflow admin I would rather there not be a timeout (or just use whatever global timeouts might exist). I've seen DELETE ops take 45-60 minutes before and if they are taking that long, there's nothing to do except wait for it to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure if this is the right way for "create operator". We even started a discussion about this on devlist:
https://lists.apache.org/thread.html/r9a6833ebafa3f00f79f86d9688f77a958a73ab7b6d9eccd1f0998fe2%40%3Cdev.airflow.apache.org%3E
80aae20
to
a155601
Compare
self.log.info("Cluster already exists.") | ||
cluster = self._get_cluster(hook) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cluster is in creating state should you block til it reaches running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cluster already exists (I assume this is checked by cluster id / name) then you should assert that it matches the any configuration explicitly specified in this operator by the user (e.g. a cluster could exist with this ID but have different dataproc version / missing init actions / etc.) you would not want to consider this a successful run of this operator as it did not meet it's contract of creating a cluster with explicit XYZ config provided by the user.
IMO this should result in a task failure as it is not clear what the operator should do in this scenario? delete the existing cluster? add a uuid suffix to avoid the name clash and create a new cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaketf is there any bulletproof, simple way to compare cluster configuration? Comparing dicts doesn't sound like a way to go because created cluster includes more information than the user provided config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd just compare the keys of the dict that were specified by the user.
This might be easier said than done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.
@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator): | |||
:type project_id: str | |||
:param region: leave as 'global', might become relevant in the future. (templated) | |||
:type region: str | |||
:parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default | |||
value is true. | |||
:type delete_on_error: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a similar configurable parameter like "delete_on_configuration_mismatch" that defaults to False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this case will be difficult to handle in all cases correctly. You will not always need to create a new cluster when the configuration is different eg. additional components installed do not affect the usability of the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.
BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.
Related #10014 |
LGTM, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think trying to go down the "does this existing cluster at least match the config specified by the user" may be a rabbit hole and certainly shouldn't hold up the improvements in this PR.
I have one more suggestion, but maybe it is a material for another PR: maybe we could let the user decide if he wants to reuse an existing cluster via another bool argument (let's call it
|
Also, do you think we should add a note about new exceptions being raised to UPDATING.md? I figure that users could be handling the ERROR state clusters on their own in latter task and now they would receive an exception in |
@@ -437,6 +437,9 @@ class DataprocCreateClusterOperator(BaseOperator): | |||
:type project_id: str | |||
:param region: leave as 'global', might become relevant in the future. (templated) | |||
:type region: str | |||
:parm delete_on_error: If true the claster will be deleted if created with ERROR state. Default | |||
value is true. | |||
:type delete_on_error: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not really something that should be handled by the operator itself. I'd argue that if you really want to change configuration of cluster you can simply delete it manually and let it be re-created. I think working in a "terraformy" or "kubectly" "apply" fashion in this case should be left to terraform. I.e. if you really want to use this kind of approach, why not write a terraform script and run terraform.
BTW. Offtop - but should not we think about adding a Terraform/Terragrunt operator to Airflow ? I'd say it might be a good idea to have such an operator with some pre-defined ways on how to get terraform/terragrunt scripts in and how to integrate with airflow's JINJA templating.
self.log.info("Cluster already exists.") | ||
cluster = self._get_cluster(hook) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. See my "terraform" comment above. I think we are pretty good with manual deletion of the cluster in case we want to change configuration, I don't think we should handle all potential complexity of computing difference between expected/actual cluster configuration.
I think that's a good idea. Terraform on its own will be able to handle updates and stuff like that. However, I'm not sure if terraform allow users to use all config options |
Agree. We shoudl have both. It's great to have dedicated Operators where you have explicit configuration options etc.. But having a generic Terraform operator if you are familiar with Terraform, and use it elsewhere would be great addition. |
Having two levels of rendering (JINJA / hcl string interpolation) sounds like a great way to have hard to debug situations of "What level of this rendering is going wrong". And what really would by dynamic between task runs from an infra perspective? I think a terraform hook might be a nice feature but would take some careful design. I vaguely remember this being brought up on slack or dev list but can't seem to find it. I found myself wanting it to bring up / tear down a CI composer environment during sleeping hours as cost cutting measure. For OSS terraform / terragrunt this could be really tricky:
However for terraform enterprise hook might be much simpler (as the the execution environment and source syncing and state management become not airflow's problem). |
@jaketf - just the fact that Terraform operator is tricky, makes me even more convinced that we should add it :). All the words about "careful design" and how to bring the scripts etc. show that it's very far from trivial to use Terraform as a step in Airflow - but IMHO conceptually it makes perfect sense :). I can immediately start thinking about - for example - bringing the terraform scripts from Git repo by the operator.. But you are completely right - it is a big, separate discussion that seems like devlist might be the right place for :). I do not want to open another stream of discussion now, but once we get Airflow 2.0 planning under full control and scheduled, I will for sure open one :). |
Handle cluster in DELETING state Extend tests fixup! Extend tests fixup! fixup! Extend tests fixup! fixup! fixup! Extend tests
850153f
to
09d1ff8
Compare
Sure - we are good - it was offtop :) |
@turbaszek yes this PR is good to go! |
I was smiling all the time when I was writing it. |
This PR brings back some features added in #4064 that were lost during refactor.
Now if the cluster is created but its state is ERROR then we run a diagnosis on it and we delete it.
Make sure to mark the boxes below before creating PR: [x]
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.