Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col or filters set #631

Merged
merged 39 commits into from
May 2, 2024

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Apr 22, 2024

Please review #636 first, as this PR builds on that.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 335727141
🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 22, 2024
@tswast tswast changed the title feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set. Set force=True to allow default sequential index feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set. Set bigframes.options.compute.allow_default_index_on_clustered_partitioned_tables = True to allow default sequential index Apr 22, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 23, 2024
tswast added 3 commits April 24, 2024 16:06
This ensures the cached `primary_keys` is more likely to be
correct, in case the user called ALTER TABLE after we originally
cached the snapshot time.
@tswast tswast changed the title feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set. Set bigframes.options.compute.allow_default_index_on_clustered_partitioned_tables = True to allow default sequential index feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set Apr 24, 2024
@tswast tswast marked this pull request as ready for review May 1, 2024 16:07
@tswast tswast requested review from a team as code owners May 1, 2024 16:07
@tswast tswast requested a review from milkshakeiii May 1, 2024 16:07
@tswast tswast changed the title feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col or filters set May 1, 2024
Comment on lines 129 to 131
**New in bigframes version 1.4.0**: Support
:class:`bigframes.enums.TypeKind` to override default index
behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeKind or DefaultIndexKind?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultIndexKind. Good catch. I had tried a few names for this but the refactor option didn't catch docstrings I think.

Comment on lines 48 to 50
rows via pandas-like outer join behavior. Operations like
``cumsum()`` that window across a non-unique index can have some
unpredictability due to ambiguous ordering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part about non-determinism I don't think is correct? My understanding is if the index is non-unique, we fall back to hidden hashes to ensure total ordering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped this.

@@ -25,6 +27,8 @@
"BigQueryOptions",
"get_global_session",
"close_session",
"enums",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is enums an intuitive module, or would a domain-related term be better, eg indexing.IndexType or directly putting the enum in the main module, bigframes.pandas.IndexType?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find some guidance on this, but Python community doesn't seem particularly prescriptive about module names.

PEP-8 has this to say:

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability.

https://peps.python.org/pep-0008/#package-and-module-names

Google Python style guide has a bit more to say:

Place related classes and top-level functions together in a module. Unlike Java, there is no need to limit yourself to one class per module.

Use CapWords for class names, but lower_with_under.py for module names.

https://google.github.io/styleguide/pyguide.html#3162-naming-conventions

I tried a few of these options out locally (bigframes.indexes.DefaultIndexKind and bigframes.pandas.DefaultIndexKind), but it feels strange to have something not really mimicking pandas in the pandas sub-package and bigframes.indexes.DefaultIndexKind would imply that we should move the Index and MultiIndex classes there, which is kinda the opposite of what we want to do.

The other option we could try is bigframes.pandas.core.indexes, but in pandas "core" is how they signify that an API is private an not to be relied on.

IMO, determining if classes are "related" by type for the basic types (e.g. exceptions, enums, ...) will be less effort for us long-term than having to figure out which public package to place these things if it doesn't fit in an existing API.

@@ -107,11 +117,18 @@ def read_gbq(
`project.dataset.tablename` or `dataset.tablename`.
Can also take wildcard table name, such as `project.dataset.table_prefix*`.
In tha case, will read all the matched table as one DataFrame.
index_col (Iterable[str] or str):
index_col (Iterable[str], str, bigframes.enums.IndexKind):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultIndexKind?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

@tswast tswast requested a review from TrevorBergeron May 1, 2024 17:59
@tswast tswast enabled auto-merge (squash) May 1, 2024 18:17
@tswast
Copy link
Collaborator Author

tswast commented May 1, 2024

Looks like test_read_csv_bq_engine_throws_not_implemented_error is a real failure. Will address.

@tswast tswast disabled auto-merge May 2, 2024 15:50
@tswast
Copy link
Collaborator Author

tswast commented May 2, 2024

Looks like test_read_csv_bq_engine_throws_not_implemented_error is a real failure. Will address.

Done! Looks like from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html index_col=False is equivalent to our DefaultIndexKind.SEQUENTIAL_INT64. I see no reason to raise NotImplementedError in that case, so I've updated those tests and made some slight tweaks so that they are treated the same.

@tswast tswast enabled auto-merge (squash) May 2, 2024 15:53
@tswast tswast merged commit 73064dd into main May 2, 2024
15 of 16 checks passed
@tswast tswast deleted the b335727141-clustered-or-partitioned-default-index-error branch May 2, 2024 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants