forked from gilbertlee-amd/ucx
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
232 lines (202 loc) · 9.32 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
#
## Copyright (C) Mellanox Technologies Ltd. 2001-2015. ALL RIGHTS RESERVED.
## Copyright (C) UT-Battelle, LLC. 2014-2015. ALL RIGHTS RESERVED.
## Copyright (C) ARM Ltd. 2017-2018. ALL RIGHTS RESERVED.
##
## See file LICENSE for terms.
##
#
## Current
Features:
- TBD
Tested configurations:
- RDMA: MLNX_OFED 4.5, distribution inbox drivers, rdma-core 22.1
- CUDA: gdrcopy 1.3.2, cuda 9.2
- XPMEM: 2.6.2
- KNEM: 1.1.3
## 1.5.0 (February 14, 2019)
Features:
- New emulation mode enabling full UCX functionality (Atomic, Put, Get)
over TCP and RDMA-CORE interconnects that don't implement full RDMA semantics
- Non-blocking API for all one-sided operations. All blocking communication APIs marked
as deprecated
- New client/server connection establishment API, which allows connected handover between workers
- Support for rdma-core direct-verbs (DEVX) and DC with mlx5 transports
- GPU - Support for stream API and receive side pipelining
- Malloc hooks using binary instrumentation instead of symbol override
- Statistics for UCT tag API
- GPU-to-Infiniband HCA affinity support based on locality/distance (PCIe)
Bugfixes:
- Fix overflow in RC/DC flush operations
- Update description in SPEC file and README
- Fix RoCE source port for dc_mlx5 flow control
- Improve ucx_info help message
- Fix segfault in UCP, due to int truncation in count_one_bits()
- Multiple other bugfixes (full list on github)
Tested configurations:
- InfiniBand: MLNX_OFED 4.4-4.5, distribution inbox drivers, rdma-core
- CUDA: gdrcopy 1.2, cuda 9.1.85
- XPMEM: 2.6.2
- KNEM: 1.1.2
## 1.4.0-rc2 (October 23, 2018)
Features:
- Improved support for installation with latest ROCm
- Improved support for latest rdma-core
- Adding support for CUDA IPC for intra-node GPU
- Added support for CUDA memory allocation cache for mem-type detection
- Added support for latest Mellanox devices
- Added support for Nvidia GPU managed memory
- Added support for multiple connections between the same pair of workers
- Added support large worker address for client/server connection establishment
and INADDR_ANY
- Added support for bitwise atomics operations
Bugfixes:
- Performance fixes for rendezvous protocol
- Memory hook fixes
- Clang support fixes
- Self tl multi-rail fix
- Thread safety fixes in IB/RDMA transport
- Compilation fixes with upstream rdma-core
- Multiple minor bugfixes (full list on github)
- Segfault fix for a code generated by armclang compiler
- UCP memory-domain index fix for zero-copy active messages
Tested configurations:
- InfiniBand: MLNX_OFED 4.2-4.4, distribution inbox drivers, rdma-core
- CUDA: gdrcopy 1.2, cuda 9.1.85
- XPMEM: 2.6.2
- KNEM: 1.1.2
- Multiple bugfixes (full list on github)
Known issues:
#2919 - Segfault in CUDA support when KNEM not present and CMA is active
intra-node RMA transport. As a workaround user can disable CMA support at
compile time: --disable-cma. Alternatively user can remove CMA from UCX_TLS
list, for example: UCX_TLS=mm,rc,cuda_copy,cuda_ipc,gdr_copy.
## 1.3.1 (August 20, 2018)
Bugfixes:
- Prevent potential out-of-order sending in shared memory active messages
- CUDA: Include cudamem.h in source tarball, pass cudaFree memory size
- Registration cache: fix large range lookup, handle shmat(REMAP)/mmap(FIXED)
- Limit IB CQE size for specific ARM boards
- RPM: explicitly set gcc-c++ as requirement
- Multiple bugfixes (full list on github)
Tested configurations:
- InfiniBand: MLNX_OFED 4.2, inbox OFED drivers.
- CUDA: gdrcopy 1.2, cuda 9.1.85
- XPMEM: 2.6.2
- KNEM: 1.1.2
## 1.3.0 (February 15, 2018)
Features:
- Added stream-based communication API to UCP
- Added support for GPU platforms: Nvidia CUDA and AMD ROCM software stacks
- Added API for client/server based connection establishment
- Added support for TCP transport
- Support for InfiniBand tag-matching offload for DC and accelerated transports
- Multi-rail support for eager and rendezvous protocols
- Added support for tag-matching communications with CUDA buffers
- Added ucp_rkey_ptr() to obtain pointer for shared memory region
- Avoid progress overhead on unused transports
- Improved scalability of software tag-matching by using a hash table
- Added transparent huge-pages allocator
- Added non-blocking flush and disconnect for UCP
- Support fixed-address memory allocation via ucp_mem_map()
- Added ucp_tag_send_nbr() API to avoid send request allocation
- Support global addressing in all IB transports
- Add support for external epoll fd and edge-triggered events
- Added registration cache for knem
- Initial support for Java bindings
Bugfixes:
- Multiple bugfixes (full list on github)
Tested configurations:
- InfiniBand: MLNX_OFED 4.2, inbox OFED drivers.
- CUDA: gdrcopy 1.2, cuda 9.1.85
- XPMEM: 2.6.2
- KNEM: 1.1.2
Known issues:
#2047 - UCP: ucp_do_am_bcopy_multi drops data on UCS_ERROR_NO_RESOURCE
#2047 - failure in ud/uct_flush_test.am_zcopy_flush_ep_nb/1
#1977 - failure in shm/test_ucp_rma.blocking_small/0
#1926 - Timeout in mpi_test_suite with HW TM
#1920 - transport retry count exceeded in many-to-one tests
#1689 - Segmentation fault on memory hooks test in jenkins
## 1.2.2 (January 4, 2018)
Main:
- Support including UCX API headers from C++ code
- UD transport to handle unicast flood on RoCE fabric
- Compilation fixes for gcc 7.1.1, clang 3.6, clang 5
Details:
- When UD transport is used with RoCE, packets intended for other peers may
arrive on different adapters (as a result of unicast flooding).
- This change adds packet filtering based on destination GIDs. Now the packet
is silently dropped, if its destination GID does not match the local GID.
- Added a new device ID for InfiniBand HCA
- [packaging] Move `examples/` and `perftest/` into doc
- [packaging] Update spec to work on old distros while complaint with Fedora
guidelines
- [cleanup] Removed unused ptmalloc version (2.83)
- [cleanup] Fixup license headers
## 1.2.1 (August 28, 2017)
- Compilation fixes for gcc 7.1
- Spec file cleanups
- Versioning cleanups
## 1.2.0 (June 15, 2017)
Supported platforms
- Shared memory: KNEM, CMA, XPMEM, SYSV, Posix
- VERBs over InfiniBand and RoCE.
VERBS over other RDMA interconnects (iWarp, OmniPath, etc.) is available
for community evaluation and has not been tested in context of this release
- Cray Gemini and Aries
- Architectures: x86_64, ARMv8 (64bit), Power64
Features:
- Added support for InfiniBand DC and UD transports, including accelerated verbs for Mellanox devices
- Full support for PGAS/SHMEM interfaces, blocking and non-blocking APIs
- Support for MPI tag matching, both in software and offload mode
- Zero copy protocols and rendezvous, registration cache
- Handling transport errors
- Flow control for DC/RC
- Dataypes support: contiguous, IOV, generic
- Multi-threading support
- Support for ARMv8 64bit architecture
- A new API for efficient memory polling
- Support for malloc-hooks and memory registration caching
Bugfixes:
- Multiple bugfixes improving overall stability of the library
Known issues:
#1604 - Failure in ud/test_ud_slow_timer.retransmit1/1 with valgrind bug
#1588 - Fix reading cpuinfo timebase for ppc bug portability training
#1579 - Ud/test_ud.ca_md test takes too long too complete bug
#1576 - Failure in ud/test_ud_slow_timer.retransmit1/0 with valgrind bug
#1569 - Send completion with error with dc_verbs bug
#1566 - Segfault in malloc_hook.fork on arm bug
#1565 - Hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker bug
#1534 - Wireup.c:473 Fatal: endpoint reconfiguration not supported yet bug
#1533 - Stack overflow under Valgrind 'rc_mlx5/uct_p2p_err_test.local_access_error/0' bug
#1513 - Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test bug
#1504 - Failure in cm/uct_p2p_am_test.am_bcopy/1 bug
#1492 - Hang when using polling fd bug
#1489 - Hang on the osu_fop_latency test with RoCE bug
#1005 - ROcE problem with OMPI direct modex - UD assertion
## 1.1.0 (September 1, 2015)
Workarounds:
Features:
- Added support for AM based on FIFO in `mm` shared memory transport
- Added support for UCT `knem` shared memory transport (http://knem.gforge.inria.fr)
- Added support for UCT `mm/xpmem` shared memory transport (https://github.com/hjelmn/xpmem)
Bugfixes:
Known issues:
## 1.0.0 (July 22, 2015)
Features:
- Added support for UCT `cma` shared memory transport (Cross-Memory Attatch)
- Added support for UCT `mm` shared memory transport with mmap/sysv APIs
- Added support for UCT `rc` transport based on Infiniband/RC with verbs
- Added support for UCT `mlx5_rc` transport based on Infiniband/RC with accelerated verbs
- Added support for UCT `cm` transport based on Infiniband/SIDR (Service ID Resolution)
- Added support for UCT `ugni` transport based on Cray/UGNI
- Added support for Doxygen based documentation generation
- Added support for UCP basic protocol layer to fit PGAS paradigm (RMA, AMO)
- Added ucx_perftest utility to exercise major UCX flows and provide performance metrics
- Added test script for jenkins (contrib/test_jenkins.sh)
- Added packaging for RPM/DEB based linux distributions (see contrib/buildrpm.sh)
- Added Unit-tests infractucture for UCX functionality based on Google Test framework (see test/gtest/)
- Added initial integration for OpenMPI with UCX for PGAS/SHMEM API
(see: https://github.com/openucx/ompi-mirror/pull/1)
- Added end-to-end testing infrastructure based on MTT (see contrib/mtt/README_MTT)