feat: raise `NoDefaultIndexError` from `read_gbq` on clustered/partitioned tables with no `index_col` or `filters` set #631

tswast · 2024-04-22T19:17:47Z

Please review #636 first, as this PR builds on that.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes internal issue 335727141
🦕

…` by default

…rtitioned-default-index-error

…ed-default-index-error

…or-partitioned-default-index-error

This ensures the cached `primary_keys` is more likely to be correct, in case the user called ALTER TABLE after we originally cached the snapshot time.

…tered-or-partitioned-default-index-error

…or-partitioned-default-index-error

…ve to separate module add todos

…ead_gbq_table

…or-partitioned-default-index-error

TrevorBergeron · 2024-05-01T16:35:37Z

third_party/bigframes_vendored/pandas/io/gbq.py

+                **New in bigframes version 1.4.0**: Support
+                :class:`bigframes.enums.TypeKind` to override default index
+                behavior.


TypeKind or DefaultIndexKind?

DefaultIndexKind. Good catch. I had tried a few names for this but the refactor option didn't catch docstrings I think.

TrevorBergeron · 2024-05-01T16:38:12Z

third_party/bigframes_vendored/pandas/io/gbq.py

+          rows via pandas-like outer join behavior. Operations like
+          ``cumsum()`` that window across a non-unique index can have some
+          unpredictability due to ambiguous ordering.


The part about non-determinism I don't think is correct? My understanding is if the index is non-unique, we fall back to hidden hashes to ensure total ordering.

Dropped this.

TrevorBergeron · 2024-05-01T16:45:10Z

bigframes/__init__.py

@@ -25,6 +27,8 @@
    "BigQueryOptions",
    "get_global_session",
    "close_session",
+    "enums",


Is enums an intuitive module, or would a domain-related term be better, eg indexing.IndexType or directly putting the enum in the main module, bigframes.pandas.IndexType?

I tried to find some guidance on this, but Python community doesn't seem particularly prescriptive about module names.

PEP-8 has this to say:

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability.

https://peps.python.org/pep-0008/#package-and-module-names

Google Python style guide has a bit more to say:

Place related classes and top-level functions together in a module. Unlike Java, there is no need to limit yourself to one class per module.

Use CapWords for class names, but lower_with_under.py for module names.

https://google.github.io/styleguide/pyguide.html#3162-naming-conventions

I tried a few of these options out locally (bigframes.indexes.DefaultIndexKind and bigframes.pandas.DefaultIndexKind), but it feels strange to have something not really mimicking pandas in the pandas sub-package and bigframes.indexes.DefaultIndexKind would imply that we should move the Index and MultiIndex classes there, which is kinda the opposite of what we want to do.

The other option we could try is bigframes.pandas.core.indexes, but in pandas "core" is how they signify that an API is private an not to be relied on.

IMO, determining if classes are "related" by type for the basic types (e.g. exceptions, enums, ...) will be less effort for us long-term than having to figure out which public package to place these things if it doesn't fit in an existing API.

TrevorBergeron · 2024-05-01T16:52:48Z

third_party/bigframes_vendored/pandas/io/gbq.py

@@ -107,11 +117,18 @@ def read_gbq(
                `project.dataset.tablename` or `dataset.tablename`.
                Can also take wildcard table name, such as `project.dataset.table_prefix*`.
                In tha case, will read all the matched table as one DataFrame.
-            index_col (Iterable[str] or str):
+            index_col (Iterable[str], str, bigframes.enums.IndexKind):


DefaultIndexKind?

Good catch. Done.

…index-error

tswast · 2024-05-01T21:53:53Z

Looks like test_read_csv_bq_engine_throws_not_implemented_error is a real failure. Will address.

…or-partitioned-default-index-error

…ned-default-index-error' into b335727141-clustered-or-partitioned-default-index-error

tswast · 2024-05-02T15:52:07Z

Looks like test_read_csv_bq_engine_throws_not_implemented_error is a real failure. Will address.

Done! Looks like from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html index_col=False is equivalent to our DefaultIndexKind.SEQUENTIAL_INT64. I see no reason to raise NotImplementedError in that case, so I've updated those tests and made some slight tweaks so that they are treated the same.

…or-partitioned-default-index-error

tswast added 8 commits April 19, 2024 21:32

docs: document index as a best practice

b303999

docs: set index_cols in read_gbq as a best practice

0ddd86b

feat: support primary key(s) in read_gbq by using as the `index_col…

994a8f1

…` by default

revert WIP commit

5fcc5a0

Merge branch 'main' into b335727141-primary_key

6b6a5ab

address type error in tests

8c4e31c

Merge branch 'b335727141-primary_key' into b335727141-clustered-or-pa…

dd940bd

…rtitioned-default-index-error

document behaviors

b96cba3

product-auto-label bot added size: m api: bigquery labels Apr 22, 2024

Merge branch 'b335727141-docs' into b335727141-clustered-or-partition…

fb3b508

…ed-default-index-error

update docs to reflect new default index behavior

477a516

product-auto-label bot added size: l and removed size: m labels Apr 23, 2024

tswast added 3 commits April 24, 2024 16:06

add DefaultIndexKind to allowed index_col values

2c5a0dd

Merge remote-tracking branch 'origin/main' into b335727141-clustered-…

d485be6

…or-partitioned-default-index-error

refactor: cache table metadata alongside snapshot time

d816db3

This ensures the cached `primary_keys` is more likely to be correct, in case the user called ALTER TABLE after we originally cached the snapshot time.

tswast mentioned this pull request Apr 24, 2024

refactor: cache table metadata alongside snapshot time #636

Merged

Merge branch 'b335727141-snapshot-save-metadata' into b335727141-clus…

d3f0891

…tered-or-partitioned-default-index-error

tswast added 9 commits April 25, 2024 19:01

add unit tests

241dc60

parametrize tables with clustered and partitioned

613e660

Merge remote-tracking branch 'origin/main' into b335727141-clustered-…

2c782ca

…or-partitioned-default-index-error

refactor: split read_gbq_table implementation into functions and mo…

f437dcf

…ve to separate module add todos

refactor progress

0090dc0

add index_cols function

850db7a

maybe ready for review

ab98d4a

Merge remote-tracking branch 'origin/main' into b335727141-refactor-r…

5b665dd

…ead_gbq_table

Update bigframes/session/__init__.py

0577131

tswast added 4 commits April 30, 2024 22:28

add error raising plus todos

adaf664

Merge remote-tracking branch 'origin/main' into b335727141-clustered-…

e8bdded

…or-partitioned-default-index-error

add TODO for ROW_NUMBER() in the query we generate

d028bc5

remove filters unit test for now

658f61d

tswast marked this pull request as ready for review May 1, 2024 16:07

tswast requested review from a team as code owners May 1, 2024 16:07

tswast requested a review from milkshakeiii May 1, 2024 16:07

blunderbuss-gcf bot assigned Genesis929 May 1, 2024

tswast changed the title ~~feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col set~~ feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col or filters set May 1, 2024

TrevorBergeron reviewed May 1, 2024

View reviewed changes

docstring fixes

f1b3f88

tswast requested a review from TrevorBergeron May 1, 2024 17:59

TrevorBergeron approved these changes May 1, 2024

View reviewed changes

tswast enabled auto-merge (squash) May 1, 2024 18:17

Merge branch 'main' into b335727141-clustered-or-partitioned-default-…

6b0e63c

…index-error

tswast added 3 commits May 2, 2024 15:12

Merge remote-tracking branch 'origin/main' into b335727141-clustered-…

40fab82

…or-partitioned-default-index-error

feat: support index_col=False in read_csv and engine="bigquery"

9f3e149

Merge remote-tracking branch 'origin/b335727141-clustered-or-partitio…

722abbb

…ned-default-index-error' into b335727141-clustered-or-partitioned-default-index-error

tswast disabled auto-merge May 2, 2024 15:50

tswast enabled auto-merge (squash) May 2, 2024 15:53

tswast added 3 commits May 2, 2024 18:48

revert typo

e7c4d93

attempt 2

d136bc0

Merge remote-tracking branch 'origin/main' into b335727141-clustered-…

586cca2

…or-partitioned-default-index-error

tswast merged commit 73064dd into main May 2, 2024
15 of 16 checks passed

tswast deleted the b335727141-clustered-or-partitioned-default-index-error branch May 2, 2024 22:51

release-please bot mentioned this pull request May 2, 2024

chore(main): release 1.5.0 #645

Merged

tswast mentioned this pull request May 3, 2024

fix: downgrade NoDefaultIndexError to DefaultIndexWarning #658

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: raise `NoDefaultIndexError` from `read_gbq` on clustered/partitioned tables with no `index_col` or `filters` set #631

feat: raise `NoDefaultIndexError` from `read_gbq` on clustered/partitioned tables with no `index_col` or `filters` set #631

tswast commented Apr 22, 2024 •

edited

Loading

TrevorBergeron May 1, 2024

tswast May 1, 2024

TrevorBergeron May 1, 2024

tswast May 1, 2024

TrevorBergeron May 1, 2024

tswast May 1, 2024

TrevorBergeron May 1, 2024

tswast May 1, 2024

tswast commented May 1, 2024

tswast commented May 2, 2024

feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col or filters set #631

feat: raise NoDefaultIndexError from read_gbq on clustered/partitioned tables with no index_col or filters set #631

Conversation

tswast commented Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented May 1, 2024

tswast commented May 2, 2024

feat: raise `NoDefaultIndexError` from `read_gbq` on clustered/partitioned tables with no `index_col` or `filters` set #631

feat: raise `NoDefaultIndexError` from `read_gbq` on clustered/partitioned tables with no `index_col` or `filters` set #631

tswast commented Apr 22, 2024 •

edited

Loading