Thread 'BOINC 7.18.x and later: Computation error oddly specific to ROCm'

Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm
Message board moderation

To post messages, you must log in.

AuthorMessage
Wedge009
Avatar

Send message
Joined: 9 Jun 18
Posts: 13
Australia
Message 109857 - Posted: 22 Sep 2022, 3:14:15 UTC

This issue is oddly specific to AMD GPUs running ROCr-based OpenCL on Linux. It doesn't appear to be a problem for NV GPUs or AMD's legacy OpenCL on Linux, or for any Windows-based set-up. (AMD OpenCL support for Linux requires ROCm for Vega GPUs and later.)

When attempting to run GPU tasks for Einstein@Home, it results in 'computation error' within ~10 seconds, an example:
<message>
process exited with code 69 (0x45, -187)</message>
<stderr_txt>
09:31:22 (11580): [normal]: This Einstein@home App was built at: Jan 16 2017 08:09:16

09:31:22 (11580): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati'.
09:31:22 (11580): [debug]: 1e+16 fp, 5.9e+09 fp/s, 1785112 s, 495h51m52s17
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220912.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 836.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220912_0844_11462382.dat --debug 0 --device 1 -o LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out
output files: 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_0' 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_1'
09:31:22 (11580): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
09:31:22 (11580): [debug]: glibc version/release: 2.35/stable
09:31:22 (11580): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x1e97b40 , 0x7fabc0742d90]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc.
Max allocation limit: 7287183768
Global mem size: 8573157376
Couldn't create OpenCL command queue (error: -6)!
OpenCL shutdown complete!
initialize_ocl returned error [2013]
OCL context null
OCL queue null
Error generating generic FFT context object [5]
09:31:22 (11580): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags:
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
09:31:34 (11580): [normal]: done. calling boinc_finish(69).
09:31:34 (11580): called boinc_finish

</stderr_txt>

I've determined that this issue appears to be specific to BOINC because while I confirm it's a problem with BOINC 7.18.1 and 7.20.2, it's not a problem with 7.16.17. All other hardware and software remains the same - even the same desktop session (ie no rebooting between BOINC installations). I wonder if it's a permissions issue, because of all the file missing messages - is there a change in how BOINC runs GPU tasks between 7.16.x and 7.18.x that ROCm might be sensitive to?

Here are some of my Linux hosts:
Ubuntu 20.04, ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17:
https://einsteinathome.org/host/12803029

Ubuntu 22.04 (issue also occurs on 20.04), ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17:
https://einsteinathome.org/host/12918837

Ubuntu 22.04, legacy OpenCL, running just fine with BOINC 7.20.2:
https://einsteinathome.org/host/12887570

On the other hand, I've found a host that's using BOINC 7.18.1 and appears to be running AMD GPU fine, but I can't tell what the amdgpu set-up is. (I've attempted to contact the owner before but never got an answer.)
https://einsteinathome.org/host/12941414
ID: 109857 · Report as offensive
Wedge009
Avatar

Send message
Joined: 9 Jun 18
Posts: 13
Australia
Message 109915 - Posted: 30 Sep 2022, 4:51:53 UTC - in response to Message 109857.  

I did some digging - initialize_ocl() seems to be a function in Einstein code, not BOINC. For whatever reason, though, newer BOINCs cause a problem in it. According to the source code for Einstein BRP (which may well be out of date) error code 2013 is the definition in demod_binary.h for RADPUL_OCL_MEM_ALLOC_DEVICE. It's one of the error codes in response clCreateCommandQueue(), which is an OpenCL function. Error code -6 corresponds to CL_OUT_OF_HOST_MEMORY. It seems to be a common error code for a variety of reasons, so I suspect it's not really out of memory, just some weird interaction between potentially old Einstein code and new BOINC versions. Why and how newer BOINCs are causing this is still a mystery to me, however.
ID: 109915 · Report as offensive
Wedge009
Avatar

Send message
Joined: 9 Jun 18
Posts: 13
Australia
Message 110044 - Posted: 8 Oct 2022, 0:22:13 UTC
Last modified: 8 Oct 2022, 0:22:30 UTC

It turns out that disabling some of the systemd hardening is a work-around for this issue. I consider it only a work-around because it wasn't necessary for BOINC 7.16.17, and presumably the hardening is there for good reason.

https://github.com/BOINC/boinc/issues/4948
ID: 110044 · Report as offensive

Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.