Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: reading JSON data as a custom arrow extension type #1458

Merged
merged 9 commits into from
Mar 11, 2025

Conversation

chelsea-lin
Copy link
Contributor

@chelsea-lin chelsea-lin commented Mar 5, 2025

We initially implemented a local pandas extension (db_dtypes.JSONType) for handling JSON data. Subsequently, the Arrow project introduced a native JSON data type in pyarrow after v19.0. We've opted to adopt this native type as our primary solution (see go/bf-json2 for internal design document). To ensure compatibility for users with older pyarrow versions, we've been using a custom Arrow extension as a fallback. This change transitions to using this custom Arrow extension as a stepping stone towards fully integrating the native pyarrow JSON type.

Release-As: 1.40.0

  • Fixes internal issue 401054811
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 401054811🦕

Page not found · GitHub · GitHub
Skip to content
404 “This is not the web page you are looking for”
@chelsea-lin chelsea-lin requested review from a team as code owners March 5, 2025 23:04
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Mar 5, 2025
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Mar 5, 2025
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_jsonarrowtypeonly branch from 7f8d18c to 5dcbdc0 Compare March 6, 2025 00:06
@GarrettWu GarrettWu removed their assignment Mar 6, 2025
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_jsonarrowtypeonly branch from 5dcbdc0 to 94ef33b Compare March 6, 2025 06:12
@tswast
Copy link
Collaborator

tswast commented Mar 6, 2025

Getting some test failures in presubmit:

FAILED tests/system/small/test_dataframe.py::test_df_drop_duplicates_w_json[first]
FAILED tests/system/small/test_dataframe.py::test_df_drop_duplicates_w_json[last]
FAILED tests/system/small/test_dataframe.py::test_df_drop_duplicates_w_json[False]
3 failed, 2715 passed, 16 skipped, 43 xfailed, 2 xpassed, 418 warnings in 1083.21s (0:18:03)

Also, could we make sure we add a Release-As: footer to our final commit message to make sure this doesn't trigger the 2.0 release? See: https://github.com/googleapis/release-please/blob/main/README.md#how-do-i-change-the-version-number

@chelsea-lin chelsea-lin force-pushed the main_chelsealin_jsonarrowtypeonly branch from 94ef33b to 5a2baa6 Compare March 6, 2025 23:55
@chelsea-lin chelsea-lin force-pushed the main_chelsealin_jsonarrowtypeonly branch from 65770f3 to a2edcbf Compare March 7, 2025 00:08
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One suggestion, but otherwise looks good.

@@ -62,7 +62,7 @@
# No arrow equivalent
GEO_DTYPE = gpd.array.GeometryDtype()
# JSON
JSON_DTYPE = db_dtypes.JSONDtype()
JSON_DTYPE = pd.ArrowDtype(db_dtypes.JSONArrowType())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we switch to pd.ArrowDtype(pyarrow.json_(pyarrow.string())) if pyarrow.json_ is available?

Also, would be good to make sure we align with OBJ_REF_DTYPE by creating a JSON_ARROW_TYPE variable to use here and there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow up with b/401055693. Thanks for reviewing!

@tswast tswast enabled auto-merge (squash) March 10, 2025 21:16
@tswast tswast merged commit e720f41 into main Mar 11, 2025
22 of 23 checks passed
@tswast tswast deleted the main_chelsealin_jsonarrowtypeonly branch March 11, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants