- Crop values in natural language stats generator.
- Switch to using PyBind11 instead of SWIG for wrapping C++ libraries.
- CSV decoder support for multivalent columns by using tfx_bsl's decoder.
- When inferring a schema entry for a feature, do not add a shape with dim = 0 when min_num_values = 0.
- Add utility methods
tfdv.get_slice_stats
to get statistics for a slice andtfdv.compare_slices
to compare statistics of two slices using Facets. - Make
tfdv.load_stats_text
andtfdv.write_stats_text
public. - Add PTransforms
tfdv.WriteStatisticsToText
andtfdv.WriteStatisticsToTFRecord
to write statistics proto to text and tfrecord files respectively. - Modify
tfdv.load_statistics
to handle reading statistics from TFRecord and text files. - Added an extra requirement group
mutual-information
. As a result, barebone TFDV does not requirescikit-learn
any more. - Added an extra requirement group
visualization
. As a result, barebone TFDV does not requireipython
any more. - Added an extra requirement group
all
that specifies all the extra dependencies TFDV needs. Usepip install tensorflow-data-validation[all]
to pull in those dependencies. - Depends on
pyarrow>=0.16,<0.17
. - Depends on
apache-beam[gcp]>=2.20,<3
. - Depends on `ipython>=7,<8;python_version>="3"'.
- Depends on `scikit-learn>=0.18,<0.24'.
- Depends on
tensorflow>=1.15,!=2.0.*,<3
. - Depends on
tensorflow-metadata>=0.22.0,<0.23
. - Depends on
tensorflow-transform>=0.22,<0.23
. - Depends on
tfx-bsl>=0.22,<0.23
.
- (Known issue resolution) It is no longer necessary to use Apache Beam 2.17 when running TFDV on Windows. The current release of Apache Beam will work.
tfdv.GenerateStatistics
now accepts a PCollection ofpa.RecordBatch
instead ofpa.Table
.- All the TFDV coders now output a PCollection of
pa.RecordBatch
instead of a PCollection ofpa.Table
. tfdv.validate_instances
andtfdv.api.validation_api.IdentifyAnomalousExamples
now takespa.RecordBatch
as input instead ofpa.Table
.- The
StatsGenerator
interface (and all its sub-classes) now takespa.RecordBatch
as the input data instead ofpa.Table
. - Custom slicing functions now accepts a
pa.RecordBatch
instead ofpa.Table
as input and should output a tuple(slice_key, record_batch)
.
- Deprecating Py2 support.