Message boards : BOINC client : The old scheduling problem strikes again
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 11 |
Running a PrimeGrid task on the GPU that was expected to take 20 hours, but new GPU finishes them in about 8. Task gets to 99.990% complete, estimated time left is 7 secs, the task ran for 8hrs 13mins and 16secs straight, and then the client decides to suspend it and go on to other projects. Arrg. Have to figure out where to put an if statement in the code: if (estimated_time_left < 60) //keep running task |
Send message Joined: 29 Aug 05 Posts: 11 |
? Yes, and was that 99.99% a checkpoint event? I've seen 7 seconds turn into hours and hours, and since those 'estimated' TTCs are notorious inexact **, the client just goes and applies the swap app logic if it was another project's turn. I would have no problem if the cpu_scheduler was following the switch between tasks every 60 minutes setting, but with the PrimeGrid gpu tasks that run for more than an hour it isn't. Here is the log from a current PrimeGrid task that ran for 1:47:42 and was 99.917% complete with 14 seconds remaining: 7/14/2017 9:47:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed 7/14/2017 9:47:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:47:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:48:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:48:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:49:49 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:49:49 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:50:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed 7/14/2017 9:50:50 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:50:50 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:51:20 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:51:20 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:51:26 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:51:26 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:52:27 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:52:27 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:53:27 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:53:27 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:53:38 PM | PrimeGrid | [task] result genefer19_10979043_0 checkpointed 7/14/2017 9:54:28 PM | PrimeGrid | [coproc] ATI instance 0; 1.000000 pending for genefer19_10979043_0 7/14/2017 9:54:28 PM | PrimeGrid | [coproc] ATI instance 0: confirming 1.000000 instance for genefer19_10979043_0 7/14/2017 9:54:51 PM | PrimeGrid | [cpu_sched] Preempting genefer19_10979043_0 (removed from memory) 7/14/2017 9:54:51 PM | PrimeGrid | [task] task_state=QUIT_PENDING for genefer19_10979043_0 from request_exit() 7/14/2017 9:54:51 PM | | request_exit(): PID 7148 has 0 descendants 7/14/2017 9:54:52 PM | PrimeGrid | [task] Process for genefer19_10979043_0 exited, exit code 0, task state 8 7/14/2017 9:54:52 PM | PrimeGrid | [task] task_state=UNINITIALIZED for genefer19_10979043_0 from handle_exited_app And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory? The problem of letting the tasks run for hours and then suspending them just before they complete is that it is usually hours before the task is restarted to finish that final few seconds which could result in being just a checker for finding a prime instead of the the computer that found the prime. |
Send message Joined: 5 Oct 06 Posts: 5128 |
And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory?Yes, GPU apps are always removed from memory when suspended. GPUs don't have the facility to swap stale memory images out to a paging file on disk, so a suspended task would always continue to occupy real, physical, memory, which might be in short supply. |
Send message Joined: 30 May 15 Posts: 265 |
And why is it removing the task from memory in violation of the leave suspended tasks in memory setting, is it to clear the gpu's memory?Yes, GPU apps are always removed from memory when suspended. GPUs don't have the facility to swap stale memory images out to a paging file on disk, so a suspended task would always continue to occupy real, physical, memory, which might be in short supply. Am i right in saying, the GPU task actually falls back to last checkpoint and restarts from that point? |
Send message Joined: 5 Oct 06 Posts: 5128 |
Am i right in saying, the GPU task actually falls back to last checkpoint and restarts from that point?They should do, provided the developer has implemented checkpointing correctly. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.