-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improve repr performance #918
Conversation
db383cc
to
b378198
Compare
return cls(node) | ||
|
||
@classmethod | ||
def from_cached( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, probably from a recent refactor
bigframes/core/nodes.py
Outdated
|
||
|
||
## Put ordering in here or just add order_by node above? | ||
@dataclass(frozen=True) | ||
class ReadTableNode(BigFrameNode): | ||
class ReadTableNode(LeafNode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a row_count
here? If there are no filters applied, we should be able to get this from table metadata (though it could be slightly inaccurate if there was anything in the streaming buffer, which can show up in query results before the count arrives at the table metadata).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is probably right. In new revision will stuff some more table metadata into these nodes and use that to provide row_count where possible.
bigframes/core/nodes.py
Outdated
@@ -398,6 +415,11 @@ def relation_ops_created(self) -> int: | |||
# Assume worst case, where readgbq actually has baked in analytic operation to generate index | |||
return 3 | |||
|
|||
@property | |||
def supports_fast_head(self) -> bool: | |||
# TODO: Be more lenient for small tables, or those clustered on non-sequential order key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LIMIT
clause is quite cheap on clustered tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, this is not actually the case yet. All the files still get dispatched for ORDER BY+LIMIT, and so full cost is paid. Only filter over materialized row numbers is granting significant saving from my experiements. This may change in time.
bigframes/core/tree_properties.py
Outdated
"""Can get head fast if can push head operator down to leafs and operators preserve rows.""" | ||
if isinstance(node, nodes.LeafNode): | ||
return node.supports_fast_head | ||
# TODO: In theory we can push head down through concat, but requires some dedicated logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking something like: limit applied to each + an overall limit clause? Does the BigQuery engine not push down limit
through a union if we apply it at the top level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't think too hard about this. I am not aware of engine-level ordered head optimization that would prevent dispatching all the underlying data
bigframes/session/executor.py
Outdated
self, array_value: bigframes.core.ArrayValue, n_rows: int | ||
) -> tuple[bigquery.table.RowIterator, bigquery.QueryJob]: | ||
""" | ||
A 'peek' efficiently accesses a small number of rows in the dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bigframes/session/executor.py
Outdated
return self._run_execute_query(sql=sql) | ||
|
||
def get_row_count(self, array_value: bigframes.core.ArrayValue) -> int: | ||
# optimized plan less likely to have count-destroying operators like filter or join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this comment mean? Do you mean that if the count is not None, then we have an optimized plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed comment. essential idea is to use the row_num metadata from cached execution where possible
@@ -218,7 +274,7 @@ def _run_execute_query( | |||
else: | |||
raise | |||
|
|||
def _with_cached_executions(self, node: nodes.BigFrameNode) -> nodes.BigFrameNode: | |||
def _get_optimized_plan(self, node: nodes.BigFrameNode) -> nodes.BigFrameNode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a docstring for this? Or maybe rename? I'd love to understand when we'd use this? It seems we'd always want to use an "optimized" plan, as no downsides are documented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docstring. only optimization right now is to apply caching. no real downsides at this stage, other than mutating the structure of the tree, which is fine at execution stage (don't want to do it earlier as it would interfere with implicit joining).
@dataclass(frozen=True) | ||
class ReadTableNode(BigFrameNode): | ||
class GbqTable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so we can get something hashable? So we can make sure we only have the fields we care about? A docstring with the purpose would be helpful here.
* perf: Improve repr performance * extract gbq metadata from nodes to common struct * clarify fast head * fix physical_schema to be bq client types * add classmethod annotation to GbqTable struct factory method * add classmethod annotation to GbqTable struct factory method --------- Co-authored-by: Tim Sweña (Swast) <swast@google.com>
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕