Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve repr performance #918

Merged
merged 9 commits into from
Aug 30, 2024
Merged

perf: Improve repr performance #918

merged 9 commits into from
Aug 30, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Aug 23, 2024
@TrevorBergeron TrevorBergeron requested a review from tswast August 23, 2024 20:07
@TrevorBergeron TrevorBergeron marked this pull request as ready for review August 23, 2024 20:07
@TrevorBergeron TrevorBergeron requested review from a team as code owners August 23, 2024 20:07
@tswast tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 28, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 28, 2024
return cls(node)

@classmethod
def from_cached(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, probably from a recent refactor



## Put ordering in here or just add order_by node above?
@dataclass(frozen=True)
class ReadTableNode(BigFrameNode):
class ReadTableNode(LeafNode):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a row_count here? If there are no filters applied, we should be able to get this from table metadata (though it could be slightly inaccurate if there was anything in the streaming buffer, which can show up in query results before the count arrives at the table metadata).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is probably right. In new revision will stuff some more table metadata into these nodes and use that to provide row_count where possible.

@@ -398,6 +415,11 @@ def relation_ops_created(self) -> int:
# Assume worst case, where readgbq actually has baked in analytic operation to generate index
return 3

@property
def supports_fast_head(self) -> bool:
# TODO: Be more lenient for small tables, or those clustered on non-sequential order key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LIMIT clause is quite cheap on clustered tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, this is not actually the case yet. All the files still get dispatched for ORDER BY+LIMIT, and so full cost is paid. Only filter over materialized row numbers is granting significant saving from my experiements. This may change in time.

"""Can get head fast if can push head operator down to leafs and operators preserve rows."""
if isinstance(node, nodes.LeafNode):
return node.supports_fast_head
# TODO: In theory we can push head down through concat, but requires some dedicated logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking something like: limit applied to each + an overall limit clause? Does the BigQuery engine not push down limit through a union if we apply it at the top level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't think too hard about this. I am not aware of engine-level ordered head optimization that would prevent dispatching all the underlying data

self, array_value: bigframes.core.ArrayValue, n_rows: int
) -> tuple[bigquery.table.RowIterator, bigquery.QueryJob]:
"""
A 'peek' efficiently accesses a small number of rows in the dataframe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return self._run_execute_query(sql=sql)

def get_row_count(self, array_value: bigframes.core.ArrayValue) -> int:
# optimized plan less likely to have count-destroying operators like filter or join
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this comment mean? Do you mean that if the count is not None, then we have an optimized plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed comment. essential idea is to use the row_num metadata from cached execution where possible

@@ -218,7 +274,7 @@ def _run_execute_query(
else:
raise

def _with_cached_executions(self, node: nodes.BigFrameNode) -> nodes.BigFrameNode:
def _get_optimized_plan(self, node: nodes.BigFrameNode) -> nodes.BigFrameNode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a docstring for this? Or maybe rename? I'd love to understand when we'd use this? It seems we'd always want to use an "optimized" plan, as no downsides are documented here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring. only optimization right now is to apply caching. no real downsides at this stage, other than mutating the structure of the tree, which is fine at execution stage (don't want to do it earlier as it would interfere with implicit joining).

@TrevorBergeron TrevorBergeron requested a review from tswast August 29, 2024 19:53
@dataclass(frozen=True)
class ReadTableNode(BigFrameNode):
class GbqTable:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so we can get something hashable? So we can make sure we only have the fields we care about? A docstring with the purpose would be helpful here.

@tswast tswast enabled auto-merge (squash) August 30, 2024 21:44
@tswast tswast merged commit 46f2dd7 into main Aug 30, 2024
21 of 23 checks passed
@tswast tswast deleted the faster_repr branch August 30, 2024 22:28
arwas11 pushed a commit that referenced this pull request Sep 3, 2024
* perf: Improve repr performance

* extract gbq metadata from nodes to common struct

* clarify fast head

* fix physical_schema to be bq client types

* add classmethod annotation to GbqTable struct factory method

* add classmethod annotation to GbqTable struct factory method

---------

Co-authored-by: Tim Sweña (Swast) <swast@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants