Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use session temp tables for all ephemeral storage #1569

Merged
merged 10 commits into from
Apr 2, 2025

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Mar 28, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 31, 2025
@TrevorBergeron TrevorBergeron marked this pull request as ready for review March 31, 2025 21:03
@TrevorBergeron TrevorBergeron requested review from a team as code owners March 31, 2025 21:03
@TrevorBergeron TrevorBergeron requested review from tswast and removed request for drylks-work March 31, 2025 21:03
@@ -57,8 +58,13 @@ def test_bq_session_create_temp_table_clustered(bigquery_client: bigquery.Client
assert result_table.clustering_fields == cluster_cols

session_resource_manager.close()
with pytest.raises(google.api_core.exceptions.NotFound):
Copy link
Collaborator

@tswast tswast Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: if you sync to main I made a similar fix in https://github.com/googleapis/python-bigquery-dataframes/pull/1572/files. I think we can keep the pytest.raises and just do the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, just merged in and used your version

@TrevorBergeron TrevorBergeron requested a review from tswast April 1, 2025 17:59
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few nits.

Comment on lines 404 to 410
anon_dataset_manager = getattr(self, "_anon_dataset_manager", None)
if anon_dataset_manager:
self._anon_dataset_manager.close()

if getattr(self, "_session_resource_manager", None):
if self._session_resource_manager is not None:
self._session_resource_manager.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can make this a little more internally consistent. If you'd like to use getattr in the if statement, let's use the := (walrus operator) so that we aren't fetching from self more than once.

Suggested change
anon_dataset_manager = getattr(self, "_anon_dataset_manager", None)
if anon_dataset_manager:
self._anon_dataset_manager.close()
if getattr(self, "_session_resource_manager", None):
if self._session_resource_manager is not None:
self._session_resource_manager.close()
if anon_dataset_manager := getattr(self, "_anon_dataset_manager", None):
anon_dataset_manager.close()
if session_resource_manager := getattr(self, "_session_resource_manager", None):
session_resource_manager.close()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed my own suggestion. 🤞 hopefully I didn't break anything.

@@ -906,8 +923,6 @@ def read_csv(
engine=engine,
write_engine=write_engine,
)
table = self._temp_storage_manager.allocate_temp_table()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the table isn't created right away? If so, I think we might need to supply a session ID in the load job, if available.

Edit: I see this was moved to read_bigquery_load_job, which makes sense to me. Aside: with more hybrid engine stuff in the future, I can imagine some cases where to_gbq() would be doing a load job to a user-managed table, but I suppose that would probably use a very different job config, anyway.

@@ -166,8 +171,13 @@ def read_pandas_load_job(
job_config.clustering_fields = cluster_cols

job_config.labels = {"bigframes-api": api_name}
job_config.schema_update_options = [
google.cloud.bigquery.job.SchemaUpdateOption.ALLOW_FIELD_ADDITION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: why ALLOW_FIELD_ADDITION here, but using WRITE_TRUNCATE for read_gbq_load_job? Might be worthwhile to add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure the ordering_col does not get overridden, as that is what is clustered.
Might still work with TRUNCATE as well?

)

self._start_generic_job(load_job)
table_id = f"{table.project}.{table.dataset_id}.{table.table_id}"

# Update the table expiration so we aren't limited to the default 24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume _storage_manager.create_temp_table handles this in the anonymous dataset case now?

Edit: Yes, I see we do set an expiration above:

        expiration = (
            datetime.datetime.now(datetime.timezone.utc) + constants.DEFAULT_EXPIRATION
        )
        table = bf_io_bigquery.create_temp_table(
            self.bqclient,
            self.allocate_temp_table(),
            expiration,
            schema=schema,
            cluster_columns=list(cluster_cols),
            kms_key=self._kms_key,
        )

@TrevorBergeron TrevorBergeron enabled auto-merge (squash) April 2, 2025 18:14
@TrevorBergeron TrevorBergeron merged commit 9711b83 into main Apr 2, 2025
17 of 24 checks passed
@TrevorBergeron TrevorBergeron deleted the use_temp_session_tables branch April 2, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants