-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.14 contains a breaking change that silently migrates the DB schema by default #15
Comments
Thanks very much for letting us know about the upgrading issue. When introducing the schema migration feature, we discussed the migration scenarios internally as well. Hope the discussion here in the thread will help other users in the community. The rationale of current behavior is that:
On the other hand, it indeed brings issues when multiple pipelines use the same database. We discussed how to address this, as it may occur for the following important scenarios when
There's some MLMD plan to support these deployment paradigms with the following approaches
These features will be available in the next release. |
For Kubeflow Pipelines users, we are deploying the MLMD gRPC server to sit on top of the database itself. In an upcoming version, TFX pipelines in Kubeflow will read/write metadata using the API on the gRPC server instead of directly talking to the database. This will stop pipelines from being able to make schema changes directly. It will also insulate pipelines from schema upgrades. The schema upgrade itself can then happen on startup of the gRPC server, or, using option 1. and 3. above, defer the upgrade to later point in time when the admin has determined no pipelines are running. |
hi @hughmiao, thanks for the detailed response! Glad to hear about these new features in the next release, that will help us a lot.
Will this also be in the next TFX release? |
Thanks for the quick response, the upcoming features you mentioned sound great. I have a question regarding this:
This is interesting to me, I know TFX also runs on airflow (the implementation of which I'm less familiar), but doesn't support for Kubeflow necessarily predicate an environment with multiple pipelines using the same database (even assuming only a single team was using the kubeflow instance)? I'm just a bit surprised that the average case for TFX would expect 1 MLMD per pipeline |
|
Fantastic! Thanks! |
Yes, tfx 0.13.0 by default uses one db (sqlite/mysql) per pipeline. The lifecycle of the db is the same with the pipeline. It is used to power the orchestration during pipeline runs, and introspect the lineage of artifacts/executions per pipeline. In addition, MLMD 0.13.2 also does not have a pipeline concept, so using multiple pipelines with the same db, the artifacts and executions will be mixed together and hard to be queried per pipeline basis, unless some pipeline identifier custom property is populated in all saved entities. Starting from MLMD 0.14.0, we introduced the Context in the data model, which can be used to model a pipeline/an experiment/ownership etc to grouping artifacts/executions by pipelines, and added APIs to support querying entities through Context. From TFX 0.14.0, MLMD 0.14.0 is used and mlmd::Context is starting to be integrated. In the next release, sharing db across pipeline lifecycles will be better supported. For Kubeflow, we are moving to a managed metadata service as a preferred way, where many pipelines talk to a metadata service. As @neuromage mentioned above, it will be in the next release. |
Thanks for the clarification! |
Close this now. In the new release, 0.15.1. The methods listed above are provided. We now disable the migration by default, and give the user options to enable it explicitly when connecting to the backend. We also provided a downgrade utility to revert back in case the db is accidentally upgraded. The server side migration options are added too. Please also see the updated document related to this. Please feel free to reopen if there's other issues related to this. |
Adds ReadNodesViaContextEdges workload for mlmd_bench.
My team serves a centralized Kubeflow cluster for various customers at our company, who use TFX to run ML workloads (we are using TFX instead of the KFP DSL specifically because metadata tracking is already included).
We started looking into updating our tools to TFX 0.14, and ran our integration tests with the change on our development cluster, which (as far as we're aware) applied a MLMD schema migration, ultimately renaming the column
is_artifact_type
totype_kind
. We later on found out this was breaking the pipelines that of other users who were still running TFX 0.13. Since this was a dev cluster, nothing critical was impacted, but this brought up a serious concern, which is that any end user who decided to try a newer version of TFX could effectively take down dev/prod because there's an automated DB migration.We're currently have a solution for the immediate issue, which is to create a limited privilege DB user to be used by MLMD (we have a wrapper around TFX for internal use where we can enforce this) so that schema migrations would fail unless triggered explicitly by an admin.
This however is not the only problem- we're left with the issue of getting all of our teams to "jump at the same time" and upgrade wholesale to TFX 0.14 when we're ready. While we don't have any production pipelines running on Kubeflow, this is a consideration for future breaking changes- if there are existing daily pipelines that are not actively maintained, and they use an older version of TFX, how do we manage them coexisting on the same cluster as newer versions if there is an incompatible schema?
The text was updated successfully, but these errors were encountered: