Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37829

An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0
    • 3.3.3, 3.4.1, 3.5.0
    • SQL
    • None

    Description

      Doing an outer-join using joinWith on DataFrames used to return missing values as null in Spark 2.4.8, but returns them as Rows with null values in Spark 3+.

      The issue can be reproduced with the following test that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0.

      The problem only arises when working with DataFrames: Datasets of case classes work as expected as demonstrated by this other test.

      I couldn't find an explanation for this change in the Migration guide so I'm assuming this is a bug.

      A git bisect pointed me to that commit.

      Reverting the commit solves the problem.

      A similar solution,  but without reverting, is shown here.

      Happy to help if you think of another approach / can provide some guidance.

      Attachments

        Issue Links

          Activity

            People

              kings129 Jason Xu
              cdegroc Clément de Groc
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: