Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.14 contains a breaking change that silently migrates the DB schema by default #15

Closed
rclough opened this issue Sep 17, 2019 · 9 comments

Comments

@rclough
Copy link

rclough commented Sep 17, 2019

My team serves a centralized Kubeflow cluster for various customers at our company, who use TFX to run ML workloads (we are using TFX instead of the KFP DSL specifically because metadata tracking is already included).

We started looking into updating our tools to TFX 0.14, and ran our integration tests with the change on our development cluster, which (as far as we're aware) applied a MLMD schema migration, ultimately renaming the column is_artifact_type to type_kind. We later on found out this was breaking the pipelines that of other users who were still running TFX 0.13. Since this was a dev cluster, nothing critical was impacted, but this brought up a serious concern, which is that any end user who decided to try a newer version of TFX could effectively take down dev/prod because there's an automated DB migration.

We're currently have a solution for the immediate issue, which is to create a limited privilege DB user to be used by MLMD (we have a wrapper around TFX for internal use where we can enforce this) so that schema migrations would fail unless triggered explicitly by an admin.

This however is not the only problem- we're left with the issue of getting all of our teams to "jump at the same time" and upgrade wholesale to TFX 0.14 when we're ready. While we don't have any production pipelines running on Kubeflow, this is a consideration for future breaking changes- if there are existing daily pipelines that are not actively maintained, and they use an older version of TFX, how do we manage them coexisting on the same cluster as newer versions if there is an incompatible schema?

@hughmiao
Copy link
Contributor

Thanks very much for letting us know about the upgrading issue.

When introducing the schema migration feature, we discussed the migration scenarios internally as well. Hope the discussion here in the thread will help other users in the community.

The rationale of current behavior is that:

  • the user does not need to be aware of lower level details
  • there'll be no data loss during the migration process, as it is atomic and tested, and there's check fail for using older version of the library with the upgraded new database.
  • for the tfx 0.13.0 users, we thought most cases, that each pipeline should have its own database, so upgrading a single pipeline will not affect other pipelines.

On the other hand, it indeed brings issues when multiple pipelines use the same database.

We discussed how to address this, as it may occur for the following important scenarios when

  • multiple pipelines with mixed versions use the same database.
  • development from the head, where some mlmd commit from the head has newer schema versions.

There's some MLMD plan to support these deployment paradigms with the following approaches

  1. Give the user an option to disable the auto schema migration, and explicitly upgrade it by some utility tools or library connection options

    • this will get rid of the need of setting special permissions for the db on tfx / mlmd users.
  2. Allow a downgrade option

    • for the case, when a production db is accidentally being upgraded by a dev pipeline, this downgrade option can make the db back to the original schema status.
  3. Support N+1 compatible schema migrations

    • Instead of check failure when a N-1 lib_version see a db_version = N. The N-1 lib_version will be able to work on the N+1 db_version whenever possible.
    • For the incompatible neighboarding versions that N-1 does not work with N, we will mark breaking changes in RELEASE notes.
    • This will help a less (hopefully zero) downtime migration when mixing library versions in the pipelines sharing the same db.

These features will be available in the next release.

@hughmiao
Copy link
Contributor

For Kubeflow Pipelines users, we are deploying the MLMD gRPC server to sit on top of the database itself. In an upcoming version, TFX pipelines in Kubeflow will read/write metadata using the API on the gRPC server instead of directly talking to the database. This will stop pipelines from being able to make schema changes directly. It will also insulate pipelines from schema upgrades. The schema upgrade itself can then happen on startup of the gRPC server, or, using option 1. and 3. above, defer the upgrade to later point in time when the admin has determined no pipelines are running.

@andrewsmartin
Copy link

andrewsmartin commented Sep 18, 2019

hi @hughmiao, thanks for the detailed response! Glad to hear about these new features in the next release, that will help us a lot.

In an upcoming version, TFX pipelines in Kubeflow will read/write metadata using the API on the gRPC server instead of directly talking to the database.

Will this also be in the next TFX release?

@rclough
Copy link
Author

rclough commented Sep 18, 2019

Thanks for the quick response, the upcoming features you mentioned sound great.

I have a question regarding this:

for the tfx 0.13.0 users, we thought most cases, that each pipeline should have its own database, so upgrading a single pipeline will not affect other pipelines.

On the other hand, it indeed brings issues when multiple pipelines use the same database.

This is interesting to me, I know TFX also runs on airflow (the implementation of which I'm less familiar), but doesn't support for Kubeflow necessarily predicate an environment with multiple pipelines using the same database (even assuming only a single team was using the kubeflow instance)? I'm just a bit surprised that the average case for TFX would expect 1 MLMD per pipeline

@neuromage
Copy link

Will this also be in the next TFX release?
Yes, we already have a change being reviewed, so it should make it to the 0.15 release.

@andrewsmartin
Copy link

Fantastic! Thanks!

@hughmiao
Copy link
Contributor

@rclough

This is interesting to me, I know TFX also runs on airflow (the implementation of which I'm less familiar), but doesn't support for Kubeflow necessarily predicate an environment with multiple pipelines using the same database (even assuming only a single team was using the kubeflow instance)? I'm just a bit surprised that the average case for TFX would expect 1 MLMD per pipeline

Yes, tfx 0.13.0 by default uses one db (sqlite/mysql) per pipeline. The lifecycle of the db is the same with the pipeline. It is used to power the orchestration during pipeline runs, and introspect the lineage of artifacts/executions per pipeline. In addition, MLMD 0.13.2 also does not have a pipeline concept, so using multiple pipelines with the same db, the artifacts and executions will be mixed together and hard to be queried per pipeline basis, unless some pipeline identifier custom property is populated in all saved entities.

Starting from MLMD 0.14.0, we introduced the Context in the data model, which can be used to model a pipeline/an experiment/ownership etc to grouping artifacts/executions by pipelines, and added APIs to support querying entities through Context.

From TFX 0.14.0, MLMD 0.14.0 is used and mlmd::Context is starting to be integrated. In the next release, sharing db across pipeline lifecycles will be better supported.

For Kubeflow, we are moving to a managed metadata service as a preferred way, where many pipelines talk to a metadata service. As @neuromage mentioned above, it will be in the next release.

@rclough
Copy link
Author

rclough commented Sep 18, 2019

Thanks for the clarification!

@hughmiao
Copy link
Contributor

Close this now. In the new release, 0.15.1. The methods listed above are provided. We now disable the migration by default, and give the user options to enable it explicitly when connecting to the backend. We also provided a downgrade utility to revert back in case the db is accidentally upgraded. The server side migration options are added too. Please also see the updated document related to this. Please feel free to reopen if there's other issues related to this.

hughmiao pushed a commit that referenced this issue Aug 25, 2020
Adds ReadNodesViaContextEdges workload for mlmd_bench.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants