Python-snappy not found during execution of CSVExampleGen #6843

sagnik-t · 2024-06-23T10:40:19Z

System information

Have I specified the code to reproduce the issue (Yes, No): Yes
Environment in which the code is executed: Zorin OS (Ubuntu 22.04)
Interactive Notebook, Google Cloud, etc):
TensorFlow version: 2.15.1
TFX Version: 1.15.1
Python version: 3.10.14
Python dependencies (from pip freeze output):
absl-py==1.4.0 annotated-types==0.7.0 anyio==4.4.0 apache-beam==2.56.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 array_record==0.5.1 arrow==1.3.0 asttokens==2.4.1 astunparse==1.6.3 async-lru==2.0.4 async-timeout==4.0.3 attrs==23.2.0 Babel==2.15.0 backcall==0.2.0 beautifulsoup4==4.12.3 bleach==6.1.0 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==2.2.1 colorama==0.4.6 comm==0.2.2 contourpy==1.2.1 cramjam==2.8.3 crcmod==1.7 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.1.1 dm-tree==0.1.8 dnspython==2.6.1 docker==4.4.4 docopt==0.6.2 docstring_parser==0.16 etils==1.7.0 exceptiongroup==1.2.1 executing==2.0.1 facets-overview==1.1.1 fastavro==1.9.4 fasteners==0.19 fastjsonschema==2.20.0 flatbuffers==24.3.25 fonttools==4.53.0 fqdn==1.5.1 fsspec==2024.6.0 gast==0.5.4 google-api-core==2.19.0 google-api-python-client==1.12.11 google-apitools==0.5.31 google-auth==2.30.0 google-auth-httplib2==0.2.0 google-auth-oauthlib==1.2.0 google-cloud-aiplatform==1.56.0 google-cloud-bigquery==3.24.0 google-cloud-bigquery-storage==2.25.0 google-cloud-bigtable==2.24.0 google-cloud-core==2.4.1 google-cloud-dataproc==5.9.3 google-cloud-datastore==2.19.0 google-cloud-dlp==3.18.0 google-cloud-language==2.13.3 google-cloud-pubsub==2.21.3 google-cloud-pubsublite==1.10.0 google-cloud-recommendations-ai==0.10.10 google-cloud-resource-manager==1.12.3 google-cloud-spanner==3.47.0 google-cloud-storage==2.17.0 google-cloud-videointelligence==2.13.3 google-cloud-vision==3.7.2 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.7.1 googleapis-common-protos==1.63.1 grpc-google-iam-v1==0.13.0 grpc-interceptor==0.15.4 grpcio==1.64.1 grpcio-status==1.48.2 h11==0.14.0 h5py==3.11.0 hdfs==2.7.3 httpcore==1.0.5 httplib2==0.22.0 httpx==0.27.0 idna==3.7 immutabledict==4.2.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.25.0 ipython-genutils==0.2.0 ipywidgets==8.1.3 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 joblib==1.4.2 Js2Py==0.74 json5==0.9.25 jsonpickle==3.2.1 jsonpointer==3.0.0 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.2.0 jupyter_core==5.7.2 jupyter_server==2.14.1 jupyter_server_terminals==0.5.3 jupyterlab==4.2.2 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.2 jupyterlab_widgets==3.0.11 keras==2.15.0 keras-tuner==1.4.7 kiwisolver==1.4.5 kt-legacy==1.0.5 kubernetes==12.0.1 libclang==18.1.1 lxml==5.2.2 Markdown==3.6 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mistune==3.0.2 ml-dtypes==0.3.2 ml-metadata==1.15.0 ml-pipelines-sdk==1.15.1 mplcyberpunk==0.7.1 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 nltk==3.8.1 notebook==7.2.1 notebook_shim==0.2.4 numpy==1.26.4 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.7.0 opt-einsum==3.3.0 orjson==3.10.5 overrides==7.7.0 packaging==24.1 pandas==1.5.3 pandocfilters==1.5.1 parso==0.8.4 pathlib==1.0.1 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 platformdirs==4.2.2 portalocker==2.8.2 portpicker==1.6.0 prometheus_client==0.20.0 promise==2.3 prompt_toolkit==3.0.47 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.7.4 pydantic_core==2.18.4 pydot==1.4.2 pyfarmhash==0.3.2 Pygments==2.18.0 pyjsparser==2.7.1 pymongo==4.7.3 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-json-logger==2.0.7 python-snappy==0.7.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.3 redis==5.0.6 referencing==0.35.1 regex==2024.5.15 requests==2.32.3 requests-oauthlib==2.0.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rouge_score==0.1.2 rpds-py==0.18.1 rsa==4.9 sacrebleu==2.4.2 scipy==1.12.0 seaborn==0.13.2 Send2Trash==1.8.3 shapely==2.0.4 simple_parsing==0.1.5 six==1.16.0 sniffio==1.3.1 soupsieve==2.5 sqlparse==0.5.0 stack-data==0.6.3 tabulate==0.9.0 tensorboard==2.15.2 tensorboard-data-server==0.7.2 tensorflow==2.15.1 tensorflow-data-validation==1.15.1 tensorflow-datasets==4.9.6 tensorflow-estimator==2.15.0 tensorflow-hub==0.15.0 tensorflow-io-gcs-filesystem==0.37.0 tensorflow-metadata==1.15.0 tensorflow-serving-api==2.15.1 tensorflow-transform==1.15.0 tensorflow_model_analysis==0.46.0 termcolor==2.4.0 terminado==0.18.1 tfx==1.15.1 tfx-bsl==1.15.1 timeloop==1.0.2 tinycss2==1.3.0 toml==0.10.2 tomli==2.0.1 tornado==6.4.1 tqdm==4.66.4 traitlets==5.14.3 types-python-dateutil==2.9.0.20240316 typing_extensions==4.12.2 tzlocal==5.2 uri-template==1.3.0 uritemplate==3.0.1 urllib3==2.2.2 wcwidth==0.2.13 webcolors==24.6.0 webencodings==0.5.1 websocket-client==1.8.0 Werkzeug==3.0.3 widgetsnbextension==4.0.11 wrapt==1.14.1 zipp==3.19.2 zstandard==0.22.0

Current Behavior

I have a simple pipeline consisting of only one component (CSVExampleGen) to ingest csv files and convert them to TFRecords.
However, upon running the pipeline I get the following warning:
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
I have already installed python-snappy and the corresponding C library using the following commands:

sudo apt-get install libsnappy-dev
pip install python-snappy

Expected behavior: The execution of this simple pipeline should be much faster and no such warnings should be produced.

Standalone code to reproduce the issue

Download any moderately sized csv file with numerical data and run the following code:

from tfx.proto import example_gen_pb2
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
from tfx.orchestration.pipeline import Pipeline
from tfx.orchestration.local.local_dag_runner import LocalDagRunner
from tfx.orchestration import metadata

pipeline_root = 'artifacts'
data_dir = 'data'

input_config = example_gen_pb2.Input(
    splits=[
        example_gen_pb2.Input.Split(name='data', pattern='data.csv')
    ]
)

output_config = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(
        splits=[
            example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=8),
            example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2)
        ]
    )
)

example_gen = CsvExampleGen(
    input_base=data_dir,
    input_config=input_config,
    output_config=output_config,
)

pipeline = Pipeline(
    pipeline_name='testing pipeline',
    pipeline_root=pipeline_root,
    components=[
        example_gen
    ],
    enable_cache=True,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(
        os.path.join(pipeline_root, 'metadata.sqlite')
    )
)

LocalDagRunner().run(pipeline)

The text was updated successfully, but these errors were encountered:

lego0901 · 2024-07-24T03:12:35Z

Hi, sorry for responding this issue so late.

The warning appears when it fails to import python snappy, as per https://github.com/apache/beam/blob/v2.56.0/sdks/python/apache_beam/io/tfrecordio.py#L48.

Could you please test running python3 -c 'import snappy' if it is properly imported? Thanks!

sagnik-t added the type:bug label Jun 23, 2024

singhniraj08 assigned AnuarTB Jul 23, 2024

singhniraj08 added the stat:awaiting tensorflower label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python-snappy not found during execution of CSVExampleGen #6843

Python-snappy not found during execution of CSVExampleGen #6843

sagnik-t commented Jun 23, 2024 •

edited

Loading

lego0901 commented Jul 24, 2024

Python-snappy not found during execution of CSVExampleGen #6843

Python-snappy not found during execution of CSVExampleGen #6843

Comments

sagnik-t commented Jun 23, 2024 • edited Loading

Current Behavior

Standalone code to reproduce the issue

lego0901 commented Jul 24, 2024

sagnik-t commented Jun 23, 2024 •

edited

Loading