Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESMX build failures on Discover/NAG nightlies #322

Open
theurich opened this issue Nov 14, 2024 · 1 comment
Open

ESMX build failures on Discover/NAG nightlies #322

theurich opened this issue Nov 14, 2024 · 1 comment

Comments

@theurich
Copy link
Member

Our current Discover/NAG report summary looks like this:

ESMF hash: v8.8.0b04-7-g93a7b1a
Collection timestamp: 2024-11-12 23:28:42
Build timestamp: 2024-11-12 22:36:49
Clone timestamp: 2024-11-12 22:08:39
Test dir: /discover/nobackup/projects/gmao/SIteam/ESMF_Testing/nag_7.2.15_openmpi_O_develop
Machine: discover
Job: 40685210

RESULTS:
================================
build:			PASS
unit tests:		PASS 14208 FAIL 1
system tests:		PASS 51 FAIL 0
example tests:		PASS 81 FAIL 0
nuopc tests:		PASS 52 FAIL 4
esmpy install:		NONE
esmpy tests:		NONE

The one unit test that is failing is ESMF_ArrayRedistPerfUTest.F90 from not meeting our performance threshold:

   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3102508769999996 >   2.0000000000000000
   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3102568969999999 >   2.0000000000000000
   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3102608169999996 >   2.0000000000000000
   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3102768469999999 >   2.0000000000000000
   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3103082779999999 >   2.0000000000000000
   FAIL: openmpi/O:   Check ArrayRedistStore() performance - Test, ESMF_ArrayRedistPerfUTest.F90, line 248:  ArrayRedistStore() performance problem!    2.3102446169999999 >   2.0000000000000000

This is essentially benign, because everyone knows not to use NAG when performance matters. People only use NAG because it is very strict wrt Fortran standard checking.

More important interesting are the 4 ESMX failures. The NUOPC apps proto report card looks like this:

Tue Nov 12 23:28:20 EST 2024
== TEST SUMMARY START ==
PASS: AsyncIOBlockingProto
PASS: AsyncIONonblockingProto
PASS: AtmOcnConOptsProto
PASS: AtmOcnConProto
PASS: AtmOcnCplListProto
PASS: AtmOcnCplSetProto
PASS: AtmOcnFDSynoProto
PASS: AtmOcnIceSimpleImplicitProto
PASS: AtmOcnImplicitProto
PASS: AtmOcnLndProto
PASS: AtmOcnLogNoneProto
PASS: AtmOcnMedIngestFromConfigProto
PASS: AtmOcnMedIngestFromInternalProto
PASS: AtmOcnMedPetListProto
PASS: AtmOcnMedPetListTimescalesProto
PASS: AtmOcnMedPetListTimescalesSplitFastProto
PASS: AtmOcnMedProto
PASS: AtmOcnMirrorFieldsProto
PASS: AtmOcnPetListProto
PASS: AtmOcnProto
PASS: AtmOcnRtmTwoTimescalesProto
PASS: AtmOcnScalarProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSelectProto
PASS: AtmOcnSimpleImplicitProto
PASS: AtmOcnTransferGridProto
PASS: AtmOcnTransferLocStreamProto
PASS: AtmOcnTransferMeshProto
PASS: CustomFieldDictionaryProto
PASS: DriverInDriverDataDepProto
PASS: DriverInDriverProto
PASS: DynPhyProto
PASS: ExternalDriverAPIProto
PASS: ExternalDriverAPIWeakCplDAProto
PASS: GenericMediatorProto
PASS: HierarchyProto
PASS: NamespaceProto
PASS: NestingMultipleProto
PASS: NestingSingleProto
PASS: NestingTelescopeMultipleProto
PASS: SingleModelProto
PASS: SingleModelOpenMPProto
PASS: SingleModelOpenMPUnawareProto
PASS: ESMX_StartHereProto
PASS: ESMX_StartHereProto-Step1
PASS: ESMX_StartHereProto-Step2
PASS: ESMX_StartHereProto-Step3
PASS: ESMX_StartHereProto-Step4
PASS: ESMX_SingleModelInFortranProto
PASS: ESMX_SingleModelInFortranProto-DL
PASS: ESMX_SingleModelInCProto-DL
FAIL: ESMX_AtmOcnProto
FAIL: ESMX_AtmOcnProto-Alt
FAIL: ESMX_AtmOcnFortranAndCProto
FAIL: ESMX_ExternalDriverAPIProto
== TEST SUMMARY STOP ==
Tue Nov 12 23:28:20 EST 2024

Several ESMX app protos pass, indicating that this is NOT a general ESMX/CMake build issue with NAG. The 4 failing tests fail during the final link stage with duplicate symbol errors. The common theme in those 4 tests is the specification of an OpenMP dependency in esmxBuild.yaml via

application:
  link_packages: OpenMP

Which leads to errors like the following during linking:

/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o: in function `__NAGf90_pAlloc':
newfit.c:(.text+0x17b30): multiple definition of `__NAGf90_pAlloc'; /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o:newfit.c:(.text+0x17b30): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o: in function `__NAGf90_lpAlloc':
newfit.c:(.text+0x17b70): multiple definition of `__NAGf90_lpAlloc'; /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o:newfit.c:(.text+0x17b70): first defined here
/usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o: in function `__NAGf90_oAlloc':
newfit.c:(.text+0x17bb0): multiple definition of `__NAGf90_oAlloc'; /discover/nobackup/projects/gmao/SIteam/comp/SLES-15/nag/7.2.15/lib/NAG_Fortran/safefit.o:newfit.c:(.text+0x17bb0): first defined here
...

After some searching I found that this seems to be a known issue in CMake. See: https://gitlab.kitware.com/cmake/cmake/-/issues/21280

Seems that @mathomp4 is aware of this CMake issue since he left a comment under the mentioned ticket.

I wonder if later CMake versions might handles this case better. We should see if there are later CMake versions on Discover we could test.

@mathomp4
Copy link
Contributor

Ahhh. Yes, it is indeed an issue and I'm not sure if newer CMake fixes it yet (cc @bradking).

Currently in GEOS, I essentially hacked around it at some point as at least the latest CMake at that time didn't yet fix it. Perhaps you can implement it as well?

My workaround is:

# CMake has a bug with NAG and OpenMP:
#   https://gitlab.kitware.com/cmake/cmake/-/issues/21280
# so we work around it
if (OpenMP_Fortran_FOUND AND CMAKE_Fortran_COMPILER_ID STREQUAL "NAG")
  message(STATUS "NAG Fortran detected, resetting OpenMP flags to avoid CMake bug")
  set_property(TARGET OpenMP::OpenMP_Fortran PROPERTY INTERFACE_LINK_LIBRARIES "")
  set_property(TARGET OpenMP::OpenMP_Fortran PROPERTY INTERFACE_LINK_OPTIONS "-openmp")
endif()

where I just sort of..."shove" the -openmp flag into the OpenMP::OpenMP_Fortran target. Is this "good" CMake, no, but it's working CMake it seems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants