You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mismatch between TensorRT version used in TF 2.14 GPU docker images for tensorflow/serving and tensorflow/tensorflow causes segfault during inference
#2205
Open
saimidu opened this issue
Feb 22, 2024
· 1 comment
It must be a bug, a feature request, or a significant problem with
documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
Here's why we have that policy: TensorFlow developers respond to issues.
We want to focus on work that benefits the whole community, e.g., fixing bugs
and adding features. Support only helps individuals. GitHub also notifies
thousands of people when issues are filed. We want them to see you communicating
an interesting problem, rather than being redirected to Stack Overflow.
Feature Request
If this is a feature request, please fill out the following form in full:
Describe the problem the feature is intended to solve
The TensorRT version used in the TensorFlow Serving dockerfile for TensorFlow Serving 2.14.1 is 8.4.3, which doesn't match the version installed in tensorflow/tensorflow:2.14.0-gpu, which is 8.6.1.6.
This leads to segaults that look like the following:
2024-02-19 18:03:52.300479: E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 1: [stdArchiveReader.cpp::StdArchiveReader::30] Error Code 1: Serialization (Serialization assertion magicTagRead == kMAGIC_TAG failed.Magic tag does not match)
2024-02-19 18:03:52.309645: E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
This can be avoided by using the tensorflow/tensorflow:2.13.0-gpu image to convert models for tensorrt, but this may not always be possible for all models, or always compatible for all future TensorFlow versions.
Describe the solution
Please keep the TensorRT and libnvinfer versions used in the TensorFlow and TensorFlow Serving docker images consistent with one another.
Describe alternatives you've considered
An alternative solution would be to build a custom build of TensorFlow with a TensorRT version that matches the serving image, or build a custom tensorflow serving image that upgrades TensorRT packages to the version used in the TensorFlow image.
Additional context
Bug Report
If this is a bug report, please fill out the following form in full:
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04 on an AWS p3.16xlarge EC2 instance (also reproducable on g4dn.xlarge and p4d.24xlarge)
TensorFlow Serving installed from (source or binary): binary
TensorFlow Serving version: 2.14
Describe the problem
The tensorflow/serving:2.14.1-gpu image is lagging behind the tensorflow/tensorflow:2.14.0-gpu image in the TensorRT version used to convert TF saved models, which results in segfaults such as the one shown above.
Exact Steps to Reproduce
Docker pull the tensorflow/tensorflow:2.14.0-gpu image
Run trt_convert.TrtGraphConverterV2 on a saved TF model, and then save the converted model.
Docker pull the tensorflow/serving:2.14.1-gpu or 2.14.0-gpu image
Run the serving container and serve the saved TensorRT-converted model
The container will stop, and docker logs -f will show an error that looks like
E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 1: [stdArchiveReader.cpp::StdArchiveReader::30] Error Code 1: Serialization (Serialization assertion magicTagRead == kMAGIC_TAG failed.Magic tag does not match)
E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached. Try to provide a reproducible test case that is the bare
minimum necessary to generate the problem.
ubuntu@ip-172-31-63-199:~$ nvidia-docker run -itd --name tfs_2.14 -p 8501:8501 --mount type=bind,source=$(pwd)/serving/tensorflow_serving/example/models/tftrt_saved_model,target=/models/tftrt_saved_model/1 -e TEST_MODE=1 -e MODEL_NAME=tftrt_saved_model tensorflow/serving:2.14.1-gpu
947d6dc1b31f4bacfa175798fba2b64bb3cb04179898594e957684c736afdfff
ubuntu@ip-172-31-63-199:~$ docker logs -f tfs_2.14
2024-02-22 17:40:32.874354: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config: model_name: tftrt_saved_model model_base_path: /models/tftrt_saved_model
2024-02-22 17:40:32.874680: I tensorflow_serving/model_servers/server_core.cc:467] Adding/updating models.
2024-02-22 17:40:32.874703: I tensorflow_serving/model_servers/server_core.cc:596] (Re-)adding model: tftrt_saved_model
2024-02-22 17:40:33.117052: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: tftrt_saved_model version: 1}
2024-02-22 17:40:33.117107: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: tftrt_saved_model version: 1}
2024-02-22 17:40:33.117127: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: tftrt_saved_model version: 1}
2024-02-22 17:40:33.117219: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /models/tftrt_saved_model/1
2024-02-22 17:40:33.117759: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-22 17:40:33.117783: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /models/tftrt_saved_model/1
2024-02-22 17:40:33.117882: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-22 17:40:35.482553: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.483125: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.483631: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.484161: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.484655: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.485148: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.485634: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.486120: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.523103: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.523737: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.524237: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.524723: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.525205: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.525684: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.526164: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.526644: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.527119: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.527603: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.528125: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.528614: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.529090: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.529568: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.530044: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.530521: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.603388: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.603972: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.604491: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.605008: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.605517: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.606024: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.606529: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.607036: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.607549: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.608095: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.608595: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.609087: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.609575: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.610061: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.610546: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.611033: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.611547: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.612051: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14791 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0
2024-02-22 17:40:35.612768: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.613249: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14791 MB memory: -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:18.0, compute capability: 7.0
2024-02-22 17:40:35.613840: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.614310: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14791 MB memory: -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0
2024-02-22 17:40:35.614890: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.615362: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14791 MB memory: -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1a.0, compute capability: 7.0
2024-02-22 17:40:35.615955: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.616434: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 14791 MB memory: -> device: 4, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0
2024-02-22 17:40:35.617027: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.617523: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 14791 MB memory: -> device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0
2024-02-22 17:40:35.618086: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.618562: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 14791 MB memory: -> device: 6, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0
2024-02-22 17:40:35.619105: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-22 17:40:35.619574: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 14791 MB memory: -> device: 7, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0
2024-02-22 17:40:35.663731: I external/org_tensorflow/tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2024-02-22 17:40:35.664152: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-22 17:40:36.014646: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /models/tftrt_saved_model/1
2024-02-22 17:40:36.572758: E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 1: [stdArchiveReader.cpp::StdArchiveReader::30] Error Code 1: Serialization (Serialization assertion magicTagRead == kMAGIC_TAG failed.Magic tag does not match)
2024-02-22 17:40:36.581946: E external/org_tensorflow/tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:87] DefaultLogger 4: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
/usr/bin/tf_serving_entrypoint.sh: line 3: 7 Segmentation fault (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"
The text was updated successfully, but these errors were encountered:
I can see Tensorflow is using TensorRT 8.6 and currently TF Serving uses 8.4.3. Let's us keep this issue as a feature request to upgrade TensorRT version to 8.6.1. I will forward this request to team internally and update this issue once we have an update. Thank you!
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow-serving
If you open a GitHub issue, here is our policy:
documentation (for small docs fixes please send a PR instead).
Here's why we have that policy: TensorFlow developers respond to issues.
We want to focus on work that benefits the whole community, e.g., fixing bugs
and adding features. Support only helps individuals. GitHub also notifies
thousands of people when issues are filed. We want them to see you communicating
an interesting problem, rather than being redirected to Stack Overflow.
Feature Request
If this is a feature request, please fill out the following form in full:
Describe the problem the feature is intended to solve
The TensorRT version used in the TensorFlow Serving dockerfile for TensorFlow Serving 2.14.1 is 8.4.3, which doesn't match the version installed in tensorflow/tensorflow:2.14.0-gpu, which is 8.6.1.6.
This leads to segaults that look like the following:
This can be avoided by using the tensorflow/tensorflow:2.13.0-gpu image to convert models for tensorrt, but this may not always be possible for all models, or always compatible for all future TensorFlow versions.
Describe the solution
Please keep the TensorRT and
libnvinfer
versions used in the TensorFlow and TensorFlow Serving docker images consistent with one another.Describe alternatives you've considered
An alternative solution would be to build a custom build of TensorFlow with a TensorRT version that matches the serving image, or build a custom tensorflow serving image that upgrades TensorRT packages to the version used in the TensorFlow image.
Additional context
Bug Report
If this is a bug report, please fill out the following form in full:
System information
Describe the problem
The tensorflow/serving:2.14.1-gpu image is lagging behind the tensorflow/tensorflow:2.14.0-gpu image in the TensorRT version used to convert TF saved models, which results in segfaults such as the one shown above.
Exact Steps to Reproduce
tensorflow/tensorflow:2.14.0-gpu
imagetrt_convert.TrtGraphConverterV2
on a saved TF model, and then save the converted model.tensorflow/serving:2.14.1-gpu
or2.14.0-gpu
imagedocker logs -f
will show an error that looks likeSource code / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached. Try to provide a reproducible test case that is the bare
minimum necessary to generate the problem.
The text was updated successfully, but these errors were encountered: