0.14 contains a breaking change that silently migrates the DB schema by default #15

rclough · 2019-09-17T17:59:20Z

My team serves a centralized Kubeflow cluster for various customers at our company, who use TFX to run ML workloads (we are using TFX instead of the KFP DSL specifically because metadata tracking is already included).

We started looking into updating our tools to TFX 0.14, and ran our integration tests with the change on our development cluster, which (as far as we're aware) applied a MLMD schema migration, ultimately renaming the column is_artifact_type to type_kind. We later on found out this was breaking the pipelines that of other users who were still running TFX 0.13. Since this was a dev cluster, nothing critical was impacted, but this brought up a serious concern, which is that any end user who decided to try a newer version of TFX could effectively take down dev/prod because there's an automated DB migration.

We're currently have a solution for the immediate issue, which is to create a limited privilege DB user to be used by MLMD (we have a wrapper around TFX for internal use where we can enforce this) so that schema migrations would fail unless triggered explicitly by an admin.

This however is not the only problem- we're left with the issue of getting all of our teams to "jump at the same time" and upgrade wholesale to TFX 0.14 when we're ready. While we don't have any production pipelines running on Kubeflow, this is a consideration for future breaking changes- if there are existing daily pipelines that are not actively maintained, and they use an older version of TFX, how do we manage them coexisting on the same cluster as newer versions if there is an incompatible schema?

The text was updated successfully, but these errors were encountered:

hughmiao · 2019-09-18T01:46:17Z

Thanks very much for letting us know about the upgrading issue.

When introducing the schema migration feature, we discussed the migration scenarios internally as well. Hope the discussion here in the thread will help other users in the community.

The rationale of current behavior is that:

the user does not need to be aware of lower level details
there'll be no data loss during the migration process, as it is atomic and tested, and there's check fail for using older version of the library with the upgraded new database.
for the tfx 0.13.0 users, we thought most cases, that each pipeline should have its own database, so upgrading a single pipeline will not affect other pipelines.

On the other hand, it indeed brings issues when multiple pipelines use the same database.

We discussed how to address this, as it may occur for the following important scenarios when

multiple pipelines with mixed versions use the same database.
development from the head, where some mlmd commit from the head has newer schema versions.

There's some MLMD plan to support these deployment paradigms with the following approaches

Give the user an option to disable the auto schema migration, and explicitly upgrade it by some utility tools or library connection options
- this will get rid of the need of setting special permissions for the db on tfx / mlmd users.
Allow a downgrade option
- for the case, when a production db is accidentally being upgraded by a dev pipeline, this downgrade option can make the db back to the original schema status.
Support N+1 compatible schema migrations
- Instead of check failure when a N-1 lib_version see a db_version = N. The N-1 lib_version will be able to work on the N+1 db_version whenever possible.
- For the incompatible neighboarding versions that N-1 does not work with N, we will mark breaking changes in RELEASE notes.
- This will help a less (hopefully zero) downtime migration when mixing library versions in the pipelines sharing the same db.

These features will be available in the next release.

hughmiao · 2019-09-18T01:47:31Z

For Kubeflow Pipelines users, we are deploying the MLMD gRPC server to sit on top of the database itself. In an upcoming version, TFX pipelines in Kubeflow will read/write metadata using the API on the gRPC server instead of directly talking to the database. This will stop pipelines from being able to make schema changes directly. It will also insulate pipelines from schema upgrades. The schema upgrade itself can then happen on startup of the gRPC server, or, using option 1. and 3. above, defer the upgrade to later point in time when the admin has determined no pipelines are running.

andrewsmartin · 2019-09-18T14:44:25Z

hi @hughmiao, thanks for the detailed response! Glad to hear about these new features in the next release, that will help us a lot.

In an upcoming version, TFX pipelines in Kubeflow will read/write metadata using the API on the gRPC server instead of directly talking to the database.

Will this also be in the next TFX release?

rclough · 2019-09-18T14:57:50Z

Thanks for the quick response, the upcoming features you mentioned sound great.

I have a question regarding this:

for the tfx 0.13.0 users, we thought most cases, that each pipeline should have its own database, so upgrading a single pipeline will not affect other pipelines.

On the other hand, it indeed brings issues when multiple pipelines use the same database.

This is interesting to me, I know TFX also runs on airflow (the implementation of which I'm less familiar), but doesn't support for Kubeflow necessarily predicate an environment with multiple pipelines using the same database (even assuming only a single team was using the kubeflow instance)? I'm just a bit surprised that the average case for TFX would expect 1 MLMD per pipeline

neuromage · 2019-09-18T15:22:37Z

Will this also be in the next TFX release?
Yes, we already have a change being reviewed, so it should make it to the 0.15 release.

andrewsmartin · 2019-09-18T15:28:37Z

Fantastic! Thanks!

hughmiao · 2019-09-18T16:06:48Z

@rclough

This is interesting to me, I know TFX also runs on airflow (the implementation of which I'm less familiar), but doesn't support for Kubeflow necessarily predicate an environment with multiple pipelines using the same database (even assuming only a single team was using the kubeflow instance)? I'm just a bit surprised that the average case for TFX would expect 1 MLMD per pipeline

Yes, tfx 0.13.0 by default uses one db (sqlite/mysql) per pipeline. The lifecycle of the db is the same with the pipeline. It is used to power the orchestration during pipeline runs, and introspect the lineage of artifacts/executions per pipeline. In addition, MLMD 0.13.2 also does not have a pipeline concept, so using multiple pipelines with the same db, the artifacts and executions will be mixed together and hard to be queried per pipeline basis, unless some pipeline identifier custom property is populated in all saved entities.

Starting from MLMD 0.14.0, we introduced the Context in the data model, which can be used to model a pipeline/an experiment/ownership etc to grouping artifacts/executions by pipelines, and added APIs to support querying entities through Context.

From TFX 0.14.0, MLMD 0.14.0 is used and mlmd::Context is starting to be integrated. In the next release, sharing db across pipeline lifecycles will be better supported.

For Kubeflow, we are moving to a managed metadata service as a preferred way, where many pipelines talk to a metadata service. As @neuromage mentioned above, it will be in the next release.

rclough · 2019-09-18T20:41:15Z

Thanks for the clarification!

hughmiao · 2019-11-12T19:27:56Z

Close this now. In the new release, 0.15.1. The methods listed above are provided. We now disable the migration by default, and give the user options to enable it explicitly when connecting to the backend. We also provided a downgrade utility to revert back in case the db is accidentally upgraded. The server side migration options are added too. Please also see the updated document related to this. Please feel free to reopen if there's other issues related to this.

Adds ReadNodesViaContextEdges workload for mlmd_bench.

hughmiao closed this as completed Nov 12, 2019

hughmiao pushed a commit that referenced this issue Aug 25, 2020

Merge pull request #15 from BrianSong/read_nodes_via_context_edges

419753c

Adds ReadNodesViaContextEdges workload for mlmd_bench.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.14 contains a breaking change that silently migrates the DB schema by default #15

0.14 contains a breaking change that silently migrates the DB schema by default #15

rclough commented Sep 17, 2019

hughmiao commented Sep 18, 2019

hughmiao commented Sep 18, 2019

andrewsmartin commented Sep 18, 2019 •

edited

Loading

rclough commented Sep 18, 2019

neuromage commented Sep 18, 2019

andrewsmartin commented Sep 18, 2019

hughmiao commented Sep 18, 2019

rclough commented Sep 18, 2019

hughmiao commented Nov 12, 2019

0.14 contains a breaking change that silently migrates the DB schema by default #15

0.14 contains a breaking change that silently migrates the DB schema by default #15

Comments

rclough commented Sep 17, 2019

hughmiao commented Sep 18, 2019

hughmiao commented Sep 18, 2019

andrewsmartin commented Sep 18, 2019 • edited Loading

rclough commented Sep 18, 2019

neuromage commented Sep 18, 2019

andrewsmartin commented Sep 18, 2019

hughmiao commented Sep 18, 2019

rclough commented Sep 18, 2019

hughmiao commented Nov 12, 2019

andrewsmartin commented Sep 18, 2019 •

edited

Loading