Thread 'Problem accessing GPU memory'

Message boards : Questions and problems : Problem accessing GPU memory
Message board moderation

To post messages, you must log in.

AuthorMessage
Pandemonyum

Send message
Joined: 10 Mar 25
Posts: 2
Message 115581 - Posted: 10 Mar 2025, 1:26:22 UTC
Last modified: 10 Mar 2025, 1:38:42 UTC

Hi all,

I just setup boinc-client 8.0.2-715 amd64 on Debian 12 CLI with a Nvidia GTX 950 GPU and an Intel CPU. The CPU was working fine but I recently got 18 new asteroids@home GPU tasks which have all failed (with "state: compute error"). I was wondering if this is a problem with the task itself or with my GPU drivers, but I am strongly leaning towards the drivers.

I checked the logs in /var/lib/boinc/stderrgpudetect.txt and found this
OpenCL: libOpenCL.so.1: cannot open shared object file: No such file or directory
NVIDIA library reports 1 GPU
[coproc] cuMemGetInfo(0) returned 201
ATI: libaticalrt.so: cannot open shared object file: No such file or directory
[coproc] cuMemGetInfo(0) returned 201 seems to be very relevant but I am not sure how to solve this. Most results online are people using this function in their own code.

When running, nvidia-smi, I also noticed that this log is produced (unfortunately I cannot find which log file it is from).
kernel: __vm_enough_memory: pid: 1533, comm: nvidia-smi, no enough memory for the allocation

Here are a few more details:
$ nvidia-smi
Sun Mar  9 21:16:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 950         Off | 00000000:01:00.0 Off |                  N/A |
| 32%   31C    P8              15W /  75W |      1MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I tried installing the CUDA toolkit just in case.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Then I reinstalled all NVIDIA packages and drivers, including CUDA, and it has not helped. Kernel version is 6.1.0-31-amd64. All packages should be up to date.

Is there any chance that installing an older GPU driver could fix the issue? Which one should I try installing?

I just started learning Linux (and it has been a rough ride) so please be as clear and detailed as possible in your response. Any help would be appreciated in solving this issue. Thank you!
ID: 115581 · Report as offensive     Reply Quote
hadron

Send message
Joined: 5 Sep 22
Posts: 41
Canada
Message 115610 - Posted: 13 Mar 2025, 19:15:03 UTC - in response to Message 115581.  

In reply to Pandemonyum's message of 10 Mar 2025:
Hi all,

kernel: __vm_enough_memory: pid: 1533, comm: nvidia-smi, no enough memory for the allocation

Here are a few more details:
$ nvidia-smi
Sun Mar 9 21:16:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 950 Off | 00000000:01:00.0 Off | N/A |
| 32% 31C P8 15W / 75W | 1MiB / 2048MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+


I just started learning Linux (and it has been a rough ride) so please be as clear and detailed as possible in your response. Any help would be appreciated in solving this issue. Thank you!

This is not a problem with Linux. You only have 2 GB of memory on your GPU but the tasks you are trying to run need more than that.
ID: 115610 · Report as offensive     Reply Quote
Pandemonyum

Send message
Joined: 10 Mar 25
Posts: 2
Message 115622 - Posted: 15 Mar 2025, 2:24:16 UTC - in response to Message 115610.  

The vm_enough_memory log would show up when running the nvidia-smi command. In that case, I don't think much GPU memory is required, if any.

I noticed that persistence mode was off, which is strange because the nvidia persistance daemon was active. I could not manually enable persistence mode as it was throwing out an error. It turns out that the manual command only works if the daemon is not running. This kills the process and turns on persistence mode:

ps aux | grep nvidia-persistenced
sudo kill [PID]
sudo nvidia-smi -pm 1

After running these commands, I ended up getting more GPU tasks and they were all successful. I am still getting the [coproc] cuMemGetInfo(0) returned 201 error in /var/lib/boinc/stderrgpudetect.txt, but it seems to be irrelevant to the issue at hand.

I will try to make the daemon work, but if I cannot do it, a quick and easy fix would be disabling the daemon and running a script at boot to turn persistence mode on manually.

I hope this helps someone else!
ID: 115622 · Report as offensive     Reply Quote

Message boards : Questions and problems : Problem accessing GPU memory

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.