Thread '"Backup" projects frequently returning work units late'

Author	Message
William Albert Send message Joined: 16 Mar 25 Posts: 4	Message 115630 - Posted: 16 Mar 2025, 19:48:32 UTC Last modified: 16 Mar 2025, 19:49:39 UTC I'm crunching for a number of different projects. These projects are my "primary" projects, and all have resource share of 100: DENIS@Home GPUGrid Rosetta@Home World Community Grid These projects are my "backup" projects, and all have a resource share of 0: Asteroids@Home Einstein@Home LHC@Home Milkyway@Home My expectation with this setup is that my worker nodes will request work units from the "primary" projects when all is well, and will only request work units from the "backup" projects if ALL of the primary projects are collectively out of work or having an outage for some reason. This part seems to be working as expected. However, if I do pull work units from my backup projects, and a primary project comes back online with work to do before the WU from the backup project finishes, the backup project WUs will sit paused until very close to their deadline, and will even frequently exceed their deadline. This behavior is especially problematic for projects like Milkyway@Home and LHC@Home where the WU completion time can vary quite a bit. Indeed, as I write this, I'm looking at a Milkyway@Home WU that I received on March 4th, that has been sitting paused for nearly two weeks (and is now being crunched with high priority), is already late, and is projected to finish nine hours after its deadline has expired (assuming the server doesn't cancel it first). My worker nodes are dedicated to BOINC, and are running 24/7 on a very lightweight Linux install that have essentially no backgrounds tasks that would take away resources from BOINC. I've also run the BOINC benchmark on all of my nodes right after letting it crunch work for an extended period of time, so it was benchmarked when it was heat-soaked, rather than during a quiet period when the processors are capable of boosting much higher. Put simply, I've given BOINC as favorable of an environment as I can provide in terms of it being able to predict the node's computing capacity. I also don't run large caches — BOINC is configured to store at most 0.5 days of work for each project (and it seems to only pull a few WUs at a time from the backup projects in any case), Outside of adjusting the resource share and the cache size, and making sure that BOINC is allowed to crunch 24/7, I haven't customized any project preferences (at least not in a way that would expect to cause WUs to be late). Why does BOINC do this? Even though I want to focus on my primary projects, if I do request work from a backup project, I still want to return a result back to them within a timely manner. Did I do something wrong? If not, is there a way to direct BOINC to process WUs based on when they arrived, rather than however BOINC is currently prioritizing them? ID: 115630 · Reply Quote

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 68	Message 115631 - Posted: 17 Mar 2025, 5:41:04 UTC - in response to Message 115630. In reply to William Albert's message of 16 Mar 2025: I also don't run large caches — BOINC is configured to store at most 0.5 days of work for each project (and it seems to only pull a few WUs at a time from the backup projects in any case), When running more than one project, no cache (or almost no cache is best). If running projects purely as backup projects then no cache is best. ie Store at least 0.05 days (70min) and Store up to an additional 0.01 days (that way it should only download a Task just before it need finishes processing the current one. Even if you do run with any sort of cache, you need to set the cache value using the Store at least Value. The Additional days value is always best set at 0.01 days- eg. If you set things to 1 day and 5 additional days, what will happen is you will get 6 days of work, then it will run down until it drops below 1 day, and then reload back up to 6 days worth. If you want 6 days worth, then set it to 6 and 0.01 additional days). And apparently not all projects honour the 0 Resource share value setting, so odd things can happen. Grant Darwin NT. ID: 115631 · Reply Quote

William Albert Send message Joined: 16 Mar 25 Posts: 4	Message 115632 - Posted: 17 Mar 2025, 7:18:27 UTC I can see about adjusting the cache value, but I so far haven't had a situation where I've requested so much work that it's too much to complete by the deadline. Rather, my issue is that BOINC is letting work from my backup projects sit in a waiting state for too long while it continuously fetches new work for a primary project. Likewise, if a project isn't respecting my resource share setting, I'd expect to get work when I don't want it, or have my worker's capacity weighted toward a particular project. While I have noticed this somewhat with LHC@Home because it's my only project where a WU requests more than one core, I haven't seen this to be a problem more generally, and BOINC sitting on WUs until they're late when there's plenty of computing capacity available seems like an entirely client-side issue. ID: 115632 · Reply Quote

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 2869	Message 115633 - Posted: 17 Mar 2025, 8:00:20 UTC I have only ever had this problem when mixing projects whose run times differ widely. cpdn tasks often take 3-6 days on my machine. (Still a vast improvement from the days when they could take six months or longer on a slow machine!) Because my preferred work is all either CPDN or ARP tasks from WCG, supply of which is erratic in both cases, I settle for being more interventionist and when work supply is good from my preferred projects, I turn the others off. ID: 115633 · Reply Quote

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 68	Message 115634 - Posted: 17 Mar 2025, 8:05:00 UTC - in response to Message 115632. How many Tasks for the backup projects have been completed and Validated? All projects seem to have issues with initial estimates (some are a bit out, others not even close), and it takes 10 completed & validated Tasks for the project sever & BOINC manager to sort out their actual processing rates to then enable accurate estimates of processing time. Significant hardware changes, new applications or major changes to the Tasks being processed for a given application can throw everything out of whack & take time to re-determine the processing rate again. The smaller the cache, the more cores & threads available to BOINC, the more time BOINC has to actually process work (ie Use at most 100% of CPU time, never suspend when non-BOINC usage exceeds any level, never suspend BOINC processing when there is keyboard or mouse input) then the sooner that can occur. However, with backup projects, if they very rarely get called upon, and the more projects you have, then the processing rate estimates for the backup projects may take months to become accurate. work_fetch_debug cpu_sched_debug priority_debug rr_simulation time_debug are all options you can set in the Manager for the Event log to see what the Manager is doing, and why. But be warned- some of them will produce huge amounts of output, and the more projects there are, the more output... Grant Darwin NT. ID: 115634 · Reply Quote

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15641	Message 115635 - Posted: 17 Mar 2025, 8:22:05 UTC - in response to Message 115632. In reply to William Albert's message of 17 Mar 2025: While I have noticed this somewhat with LHC@Home because it's my only project where a WU requests more than one core Both LHC and Milkyway have multithreaded applications (https://lhcathome.cern.ch/lhcathome/apps.php, https://milkyway.cs.rpi.edu/milkyway/apps.php), which afaik require the amount of cores to be free that were available when BOINC requested work and got that from those projects. If then at a later time those cores aren't free, that work will wait until those cores are free. ID: 115635 · Reply Quote

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 68	Message 115636 - Posted: 17 Mar 2025, 9:15:14 UTC - in response to Message 115635. In reply to Jord's message of 17 Mar 2025: In reply to William Albert's message of 17 Mar 2025: While I have noticed this somewhat with LHC@Home because it's my only project where a WU requests more than one core Both LHC and Milkyway have multithreaded applications (https://lhcathome.cern.ch/lhcathome/apps.php, https://milkyway.cs.rpi.edu/milkyway/apps.php), which afaik require the amount of cores to be free that were available when BOINC requested work and got that from those projects. If then at a later time those cores aren't free, that work will wait until those cores are free. However, if the backup project Task(s) are in danger of missing their deadlines, then they should become High Priority, and as many other Tasks as necessary to supply the needed number of cores to complete them for the other projects should be paused, while the now High Priority Tasks are processed (i would have thought). Grant Darwin NT. ID: 115636 · Reply Quote

William Albert Send message Joined: 16 Mar 25 Posts: 4	Message 115637 - Posted: 17 Mar 2025, 10:23:14 UTC - in response to Message 115633. Last modified: 17 Mar 2025, 10:24:32 UTC In reply to Grant (SSSF)'s message of 17 Mar 2025: How many Tasks for the backup projects have been completed and Validated? All projects seem to have issues with initial estimates (some are a bit out, others not even close), and it takes 10 completed & validated Tasks for the project sever & BOINC manager to sort out their actual processing rates to then enable accurate estimates of processing time. Significant hardware changes, new applications or major changes to the Tasks being processed for a given application can throw everything out of whack & take time to re-determine the processing rate again. The smaller the cache, the more cores & threads available to BOINC, the more time BOINC has to actually process work (ie Use at most 100% of CPU time, never suspend when non-BOINC usage exceeds any level, never suspend BOINC processing when there is keyboard or mouse input) then the sooner that can occur. However, with backup projects, if they very rarely get called upon, and the more projects you have, then the processing rate estimates for the backup projects may take months to become accurate. work_fetch_debug cpu_sched_debug priority_debug rr_simulation time_debug are all options you can set in the Manager for the Event log to see what the Manager is doing, and why. But be warned- some of them will produce huge amounts of output, and the more projects there are, the more output... With the exception of DENIS@Home (which I haven't completed any work for yet due to the project being out of work for an extended period of time), I've completed hundreds of work units minimum for all of my active projects. Additionally, while I can understand if a WU is late if it takes longer to process than estimated, the estimates for the late WUs that I've seen have been reasonably accurate -- BOINC just doesn't resume them in a timely manner. In fact, for the Milkyway@Home WU that I cited as an example in my original post, the WU was still "waiting to run" even though the estimated time to completion was longer than the remaining deadline. It was my understanding that it should have had its priority boosted by BOINC, but it didn't happen until the WU was nearly expired, and the WU ended up being returned hours late (thankfully, I still got credit for it). Thanks for the debug tips. I'll see about turning on debugging the next time BOINC pulls work from a backup project. In reply to Jord's message of 17 Mar 2025: In reply to William Albert's message of 17 Mar 2025: While I have noticed this somewhat with LHC@Home because it's my only project where a WU requests more than one core Both LHC and Milkyway have multithreaded applications (https://lhcathome.cern.ch/lhcathome/apps.php, https://milkyway.cs.rpi.edu/milkyway/apps.php), which afaik require the amount of cores to be free that were available when BOINC requested work and got that from those projects. If then at a later time those cores aren't free, that work will wait until those cores are free. MIlkyway@Home has a multithreaded application, but the number of threads a WU can use can be controlled in the project preference, and I have it set to 1 thread per WU. In reply to Grant (SSSF)'s message of 17 Mar 2025: In reply to Jord's message of 17 Mar 2025: In reply to William Albert's message of 17 Mar 2025: While I have noticed this somewhat with LHC@Home because it's my only project where a WU requests more than one core Both LHC and Milkyway have multithreaded applications (https://lhcathome.cern.ch/lhcathome/apps.php, https://milkyway.cs.rpi.edu/milkyway/apps.php), which afaik require the amount of cores to be free that were available when BOINC requested work and got that from those projects. If then at a later time those cores aren't free, that work will wait until those cores are free. However, if the backup project Task(s) are in danger of missing their deadlines, then they should become High Priority, and as many other Tasks as necessary to supply the needed number of cores to complete them for the other projects should be paused, while the now High Priority Tasks are processed (i would have thought). That's what I thought as well, but either this isn't working properly, or something is disrupting the priority boost for some reason. In reply to Dave's message of 17 Mar 2025: I have only ever had this problem when mixing projects whose run times differ widely. cpdn tasks often take 3-6 days on my machine. (Still a vast improvement from the days when they could take six months or longer on a slow machine!) Because my preferred work is all either CPDN or ARP tasks from WCG, supply of which is erratic in both cases, I settle for being more interventionist and when work supply is good from my preferred projects, I turn the others off. If there a way to tell BOINC "resume work on these WUs", and have those WUs prioritized, then this wouldn't be such an issue because I could temporarily intervene. However, I'm not aware of any such functionality, and I've resorted in the past to setting my backup projects to "No New Work" and suspending my primary projects to allow the backup project WUs to complete in a timely manner. I tried to let BOINC do its thing this time around because I've read advice elsewhere that BOINC's scheduler can be disrupted if one tries to micromanage it, but letting it be results in WUs being late. If there were a way to tell BOINC to prioritize WUs in the order which they were downloaded (rather than whatever BOINC does by default), that would presumably resolve this issue and allow me to just set and forget it. ID: 115637 · Reply Quote

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 68	Message 115638 - Posted: 17 Mar 2025, 11:07:21 UTC - in response to Message 115637. If there were a way to tell BOINC to prioritize WUs in the order which they were downloaded (rather than whatever BOINC does by default), that would presumably resolve this issue and allow me to just set and forget it. Actually, that is the default way it processes the Tasks. First in, First out. With multiple projects and even Resource share values, when BOINC first starts processing work for them, it will download a group of each and often start processing on the first few from each group (all depending on the number of cores available). All things being equal, eventually it would get to the point that the Tasks are processed in the order they are downloaded. But since all things aren't equal, it doesn't necessarily happen that way all the time. It should, generally, do them in the order that they are downloaded. But if a Task is received that has a longer or shorter deadline or estimated runtime than that type of Task will usually have, then it may get done sooner or later than it otherwise would be. The change in processing order comes about from different deadlines, Resource share settings, and changing processing rate values (with some applications and data, the processing time for a given application and Task is pretty much stable. For others, Taks can take more than twice as long, or finish in half the time, of the average running time for most Tasks). Some projects have much more efficient applications than others. But while a Task for a project that is set as a backup project will have the lowest level of priority, even if it was downloaded earlier; however it should still be completed before it's deadline. Grant Darwin NT. ID: 115638 · Reply Quote

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 2869	Message 115639 - Posted: 17 Mar 2025, 11:33:44 UTC The only other thing to add is if you want BOINC to work things out, don't constantly change settings as each time you do that it has to start again. Or if you spend lots of time on your computer you can keep an eye on it micromanage it. ID: 115639 · Reply Quote

William Albert Send message Joined: 16 Mar 25 Posts: 4	Message 115640 - Posted: 17 Mar 2025, 12:00:21 UTC - in response to Message 115638. In reply to Grant (SSSF)'s message of 17 Mar 2025: If there were a way to tell BOINC to prioritize WUs in the order which they were downloaded (rather than whatever BOINC does by default), that would presumably resolve this issue and allow me to just set and forget it. Actually, that is the default way it processes the Tasks. First in, First out. With respect, it objectively doesn't unless you're only doing work for a single project where all WUs have the same length deadline, and the rest of your post even details the algorithm BOINC uses to determine how to schedule WUs. In reply to Grant (SSSF)'s message of 17 Mar 2025: But while a Task for a project that is set as a backup project will have the lowest level of priority, even if it was downloaded earlier; however it should still be completed before it's deadline. Unfortunately, this doesn't seem to work as expected. ID: 115640 · Reply Quote

Grant (SSSF) Send message Joined: 7 Dec 24 Posts: 68	Message 115643 - Posted: 18 Mar 2025, 5:05:05 UTC - in response to Message 115640. Last modified: 18 Mar 2025, 5:05:29 UTC With respect, it objectively doesn't unless you're only doing work for a single project where all WUs have the same length deadline, and the rest of your post even details the algorithm BOINC uses to determine how to schedule WUs. That doesn't change the fact that the default is first in first out- regardless of the number of projects you might have. But Resource share settings, application runtimes and differing deadlines do result in non-first in first out processing in order to meet those competing requirements. Grant Darwin NT. ID: 115643 · Reply Quote

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.