Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm
Message board moderation
Author | Message |
---|---|
Send message Joined: 9 Jun 18 Posts: 13 |
This issue is oddly specific to AMD GPUs running ROCr-based OpenCL on Linux. It doesn't appear to be a problem for NV GPUs or AMD's legacy OpenCL on Linux, or for any Windows-based set-up. (AMD OpenCL support for Linux requires ROCm for Vega GPUs and later.) When attempting to run GPU tasks for Einstein@Home, it results in 'computation error' within ~10 seconds, an example: <message> process exited with code 69 (0x45, -187)</message> <stderr_txt> 09:31:22 (11580): [normal]: This Einstein@home App was built at: Jan 16 2017 08:09:16 09:31:22 (11580): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati'. 09:31:22 (11580): [debug]: 1e+16 fp, 5.9e+09 fp/s, 1785112 s, 495h51m52s17 command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220912.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 836.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220912_0844_11462382.dat --debug 0 --device 1 -o LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out output files: 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_0' 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_1' 09:31:22 (11580): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86 09:31:22 (11580): [debug]: glibc version/release: 2.35/stable 09:31:22 (11580): [debug]: Set up communication with graphics process. boinc_get_opencl_ids returned [0x1e97b40 , 0x7fabc0742d90] Using OpenCL platform provided by: Advanced Micro Devices, Inc. Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc. Max allocation limit: 7287183768 Global mem size: 8573157376 Couldn't create OpenCL command queue (error: -6)! OpenCL shutdown complete! initialize_ocl returned error [2013] OCL context null OCL queue null Error generating generic FFT context object [5] 09:31:22 (11580): [CRITICAL]: ERROR: MAIN() returned with error '5' FPU status flags: mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory 09:31:34 (11580): [normal]: done. calling boinc_finish(69). 09:31:34 (11580): called boinc_finish </stderr_txt> I've determined that this issue appears to be specific to BOINC because while I confirm it's a problem with BOINC 7.18.1 and 7.20.2, it's not a problem with 7.16.17. All other hardware and software remains the same - even the same desktop session (ie no rebooting between BOINC installations). I wonder if it's a permissions issue, because of all the file missing messages - is there a change in how BOINC runs GPU tasks between 7.16.x and 7.18.x that ROCm might be sensitive to? Here are some of my Linux hosts: Ubuntu 20.04, ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17: https://einsteinathome.org/host/12803029 Ubuntu 22.04 (issue also occurs on 20.04), ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17: https://einsteinathome.org/host/12918837 Ubuntu 22.04, legacy OpenCL, running just fine with BOINC 7.20.2: https://einsteinathome.org/host/12887570 On the other hand, I've found a host that's using BOINC 7.18.1 and appears to be running AMD GPU fine, but I can't tell what the amdgpu set-up is. (I've attempted to contact the owner before but never got an answer.) https://einsteinathome.org/host/12941414 |
Send message Joined: 9 Jun 18 Posts: 13 |
I did some digging - initialize_ocl() seems to be a function in Einstein code, not BOINC. For whatever reason, though, newer BOINCs cause a problem in it. According to the source code for Einstein BRP (which may well be out of date) error code 2013 is the definition in demod_binary.h for RADPUL_OCL_MEM_ALLOC_DEVICE. It's one of the error codes in response clCreateCommandQueue(), which is an OpenCL function. Error code -6 corresponds to CL_OUT_OF_HOST_MEMORY. It seems to be a common error code for a variety of reasons, so I suspect it's not really out of memory, just some weird interaction between potentially old Einstein code and new BOINC versions. Why and how newer BOINCs are causing this is still a mystery to me, however. |
Send message Joined: 9 Jun 18 Posts: 13 |
It turns out that disabling some of the systemd hardening is a work-around for this issue. I consider it only a work-around because it wasn't necessary for BOINC 7.16.17, and presumably the hardening is there for good reason. https://github.com/BOINC/boinc/issues/4948 |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.