Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow DataFrame.join for self-join on Null index #860

Merged
merged 5 commits into from
Jul 30, 2024
Merged

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners July 24, 2024 23:24
@product-auto-label product-auto-label bot added the size: s Pull request size is small. label Jul 24, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Jul 24, 2024
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Trevor!

Should we add an e2e test for the .fit(X, y) table with null index / unordered mode, too?

@TrevorBergeron
Copy link
Contributor Author

Thanks Trevor!

Should we add an e2e test for the .fit(X, y) table with null index / unordered mode, too?

Good idea. This revealed that ml modules were caching pre-join, which is invalidated by row-identity join. Instead, I made it cache post-join.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Jul 25, 2024
@TrevorBergeron TrevorBergeron requested a review from tswast July 26, 2024 23:06
@@ -326,7 +326,7 @@ def create_model(
if y_train is None:
input_data = X_train.cache()
else:
input_data = X_train.cache().join(y_train.cache(), how="outer")
input_data = X_train.join(y_train.cache(), how="outer").cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be y_train without cache() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixed

def test_unordered_mode_logistic_regression_configure_fit_score(
unordered_session, penguins_table_id, dataset_id
):
model = bigframes.ml.linear_model.LogisticRegression()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, but if we only pick one model to test some shared functionalities, usually the choice is LinearReg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, used linear regression model instead

@tswast tswast merged commit e950533 into main Jul 30, 2024
23 checks passed
@tswast tswast deleted the null_join branch July 30, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants