Message boards : BOINC Manager : Someone explains how the new scheduler works!
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Feb 07 Posts: 19 |
A few days ago I've updated my BOINC Manager with the new version 5.8.8. I was trying to have a bigger cache of SIMAP WUs (I was running only SIMAP), so I setted my "Connect to network about every" at 7days and the outcome is been... no new WUs downloaded! :-( After a lot of threads reading, I re-setted my cache to 2days and the scheduler allow me to download 10-20WUs. You can read about this adventure here. After a lot of SIMAP WUs returned, I setted my cache to 1.5day and I unsuspended uFluids and allow to get more work, then I go out. Tonight I opened my BOINC Manager and I've seen 360 WUs downloaded from the scheduler. So, I want a big cache, put 7.0days and it doesn't get new work; I want to avoid to get a lot of WUs, put 1.5days and it download work for a month and more, even if the deadline is at 15days. I'm really upset! I hate this scheduler! Maybe it takes wrong decisions when only one project is running... I don't know! PLEASE, someone explains how this new scheduler works and why it behaves in this way! I've red a lot of threads and posts, I know the rules and the formula that the scheduler uses to take a decision... but, at a first glance, it seems to do the exact opposite!! Thanks in advance for your help! Luca B. |
Send message Joined: 30 Oct 05 Posts: 1239 |
I think the reason you got so many uFluids WUs is probably related to what your DCF was (didn't they have a lot of short WUs that came with extremely high estimated time to completion which would drive your DCF down low). If you DCF was really low and the estimates by the project for how many operations the WU would take is off, then you'll download a boatload of work. The scheduler can only use the information that is given to it. Just remember the old adage GIGO... Garbage In, Garbage Out (it sounds more harsh than what I really mean, but bad input -> bad output). Kathryn :o) |
Send message Joined: 6 Feb 07 Posts: 19 |
I think the reason you got so many uFluids WUs is probably related to what your DCF was (didn't they have a lot of short WUs that came with extremely high estimated time to completion which would drive your DCF down low). If you DCF was really low and the estimates by the project for how many operations the WU would take is off, then you'll download a boatload of work. I use 4 PC on BOINC; this time I opened only my PC at home for uFluids and I can't say what was my DCF before the huge download. But I can say that the other three PC have a DCF on uFluids that ranges between 0.75 and 1.25... so I suppose that even my PC at home was in the same rage. And I can say that after only 2 completed uFluids WUs my DCF is gone to 37.802176 (even with 5.8.9 that fix the problem related to DCF). Obviously the scheduler decided to download all these WUs without working on them because I still had some SIMAP WUs to crunch... D'oh! So you think that, once again, I've been unlucky... The problem is that I've never had similar accidents with 5.4.11 and my way to work with BOINC is never changed! |
Send message Joined: 30 Oct 05 Posts: 1239 |
Let's say your DCF was at 1 (just because that's a) the default value and b) halfway between what it was on the other computers). Let's say Bob (the admin over at uFluids) gave an operations estimate so that given your benchmarks and your DCF the time to completion estimate was 10 minutes (I know this isn't realistic, but makes my math easier). 1.5 days = 2160 minutes If uFluids was the only project (which it wasn't but it makes things easier), the scheduler should have asked for around 216 WUs. And given the deadline is 15 days, the scheduler shouldn't have a problem with this (if my math is ok). Now, the first WU runs and instead of taking 10 minutes, it takes 6 hours (roughly 360 minutes). That would now give you a DCF of roughly 36. BOINC will update the time to completion on the remaining 215 WUs to 6 hours and you'll be doing uFluids for the next roughly 1300 hours (or 50+ days). Obviously some are going to be late... Now all this said, uFluids WUs can vary in length. It's been a while since I last ran it, but they went from under 5 minutes to around 10 hours the last time. I don't know how variable this latest batch is. But Bob needs to know that his estimates are way off if you've got a DCF of 37. In the end, the long term debt for uFluid will get extraordinarily high (or is it low... I can never remember) and work won't be fetched for quite a while because the project will have gotten more than its fair share of time. All these numbers are made up of course. But I'm trying to illustrate a point... the scheduler can only take the information it's given to make it's decisions. The DCF allows for some compensation for user error (that is the projects low estimates) but until that first unit is run, it only has past history to go on. And if that past history isn't an accurate reflection of the current reality then strange things can happen. It all goes back to a direct mapping between the numbers that are given to the scheduler and how much work it decides to fetch. If those numbers are off, then work requests are going to be off. Is it the fault of the scheduler? Probably not... Yes, it's a complex set of code (and I won't even being to pretend to understand it on more than a superficial level) and yes, there are flaws in it. But I do see the changes as an improvement, especially for those who run more than one project at a time. Hopefully JM7 will be able to pop by this thread and correct any mistakes that I've made (because I'm certain that I've made a few along the way) and clarify anything I've said that isn't clear (because I'm certain there is stuff here that is clear as mud). Kathryn :o) |
Send message Joined: 6 Feb 07 Posts: 19 |
Thanks for your help, Kathryn. The uFluids WUs were built to last 2-3 minutes, then Bob increased that time to 10 minutes... so your simulation is not so far from reality! ;-) The decision made by the scheduler is, in this situation, clearly correct! But now, someone at uFluids should tell me why these WUs are taking HOURS to finish! To recapitulate... I'm only just a bit unlucky in this period... ;-) Do you think that a complete re-installation from scratch of BOINC (to erase even the client_state) and a minimum cache at 0.5 can re-establish a good order? My debts and DCF will reset and maybe the scheduler could have a bit of time (with a small cache) to sort things out! Thanks again! Luca B. |
Send message Joined: 29 Aug 05 Posts: 15561 |
Not if you are using BOINC 5.8.8 as that has a broken Duration Correction Factor. Go for the test version 5.8.9 if you want the DCF fixed. |
Send message Joined: 6 Feb 07 Posts: 19 |
Thanks Ageless, but, as I stated in a previous post, I've already put at work the 5.8.9 and it has fixed the problem with DCF. If my DCF has gone so high is just because my last WUs were extremely short (10min) while the new ones are quite long (4h or more). Now the Manager is trying to set things right and my DCF is gone from 37 (after the first WU returned) to 34 (after three WUS)... thanks to 5.8.9! Luca B. |
Send message Joined: 29 Aug 05 Posts: 15561 |
OK, terminology lessons. BOINC Manager: A graphical user interface (GUI) that allows the user to easier control the BOINC daemon. It doesn't do much of anything. It only allows you to get around in BOINC quicker, not needing you to drop to a command line interface and 'program' BOINC from there. (boincmgr.exe) BOINC daemon: This is the actual program that does everything, from checking if an attached project needs more work and then downloading it, to keeping track of your credits as it downloads those statistics from the various web sites, to uploading/reporting your work when it's done. (boinc.exe) The daemon doesn't computer the work though. The (science) application: This program is downloaded by the BOINC daemon when you attach to a project. It's the program that does all the computing on the results that the BOINC daemon downloads and stores in a queue. (e.g. metropolis_4.56.exe) Everything that happens in BOINC happens in the BOINC daemon. It's also named here and there, the core client. So the daemon will calculate the DCF. It will calculate which project is next in line to use the CPU, which project is next to download work. None of this happens in the Manager as that's just a GUI. Anyway, the new DCF calculations will make your DCF numbers go down pretty quickly. Sure, you can go edit your DCF numbers back to 1 and start from there. But if you just leave things alone for a couple of days, the number will have gone down so far that it may as well do the same trick. If you want to go for editing the client_state.xml file, to reduce your DCF numbers per project there, that's possible. Make sure you completely exit BOINC (so if you are running as a service, exiting the Manager won't kill the service, you have to exit the service by hand). Then navigate to your main BOINC directory. Open client_state.xml with Notepad. Use the Find option (F3) to find <duration_correction_factor> for each of your projects. The number behind this tag needs to be reset to exactly this number: 1.000000 ... (so a 1 with a dot and 6 zeroes. It won't work with just a 1 !!) Change your DCF this way for all your projects, then Save client_state.xml (Do not save as..) Restart BOINC. |
Send message Joined: 6 Feb 07 Posts: 19 |
Thanks for this lesson, Ageless! But after one year on BOINC and five years on SETI Classic, I konw well the differences between daemon, manager and project's application. I used the term "Manager" just to simplify... I usually work with the client_state to tweak my debts, to force some decisions of my scheduler... or better I used to force these decisions! I opened this thread just because, with the new scheduler policy, I can't force the scheduler to do what I want. With 5.4.11 I could reset the debts and set the cache to 7.0 days to have plenty of works for SIMAP, without any deadline problem. But this is no more possible with the new version, because "the formula" (If the task will not complete by round robin simulation in < 90% of the time to the computation deadline it is in deadline trouble. The computation deadline is the report deadline - (min_queue + project switch interval + 1 day).) deny this procedure and vain every my effort to take more WUs. And there is nothing I can do about it. Than I had the accident reported in this thread. And, once again, nothing to do with this scheduler. My DCF indicated to him that I was able to do a lot of work on uFluids ('cause the last batch of WUs were extremely short) and the scheduler filled my cache with 360 result... even if now every WUs is taking hours to complete. There is no error or bug in the scheduler, now I know that I was unlucky and that I've fallen into a "limit case"! These two facts happened in two successive days and they were linked to two opposite situations. I had just changed my BOINC version and I pointed my finger to the new scheduler. Now I'm trying to cure my wounds and I'm searching a good strategy to set things in their own way. Luca B. |
Send message Joined: 29 Aug 05 Posts: 15561 |
And there is nothing I can do about it. Sure there is. SIMAP has an 8 day deadline. Knowing that and knowing that the formula will add one day to your queue time (connect to), you just set your connect to to less than those 8 days. So 6 days of connect to gets you slightly more than 6 days of work. |
Send message Joined: 29 Aug 05 Posts: 147 |
Let's say your DCF was at 1 (just because that's a) the default value and b) halfway between what it was on the other computers). The numbers work if it is .75 day and 5 minutes / WU estimated as well. Then when the first one runs for a few hours the DCF is going to go through the roof - and the daemon is going to figure that there is a problem meeting deadlines and get started attempting to meet as many of the current deadlines as possible. BTW, higher debt means that the project is owed CPU time. Lower debe means that the project owes the othe projects CPU time - and if it is low enough, the project will be blocked from downloading work until the other projects have had a chance at the CPU. BOINC WIKI |
Send message Joined: 6 Feb 07 Posts: 19 |
John, I'm curious about a fact: SIMAP has a deadline of 8 days since you d/l the WU. Now suppose that my cache is empty, that I've set the cache to 7 days, that my DCF is 0.6 for SIMAP, that I have no problems with debts with the project and SIMAP is the only active project (with the others set to Suspended & No New Work). I ask to you: Is it possible to fill 7 days of cache on SIMAP with the new scheduler? IMHO, using the formula you posted on Seti forum report deadline - (min_queue + project switch interval + 1 day), there is no chanche! Cause if report_deadline is 8 and min_queue is 7, the scheduler will be always in deadline problems! In theory, I shouldn't have any problems to do 7 days of work within 8 days... With the old version there were no problems to do this. Two days ago, with the new scheduler and 10 WUs waiting to run, I was unable to get more work on SIMAP. Why? Is my idea correct? |
Send message Joined: 30 Oct 05 Posts: 1239 |
BTW, higher debt means that the project is owed CPU time. Lower debe means that the project owes the othe projects CPU time - and if it is low enough, the project will be blocked from downloading work until the other projects have had a chance at the CPU. Thanks John! I've written it down, so hopefully I'll actually remember.... Kathryn :o) |
Send message Joined: 19 Jan 07 Posts: 1179 |
Then navigate to your main BOINC directory. Open client_state.xml with Notepad. I have always set it to 1.0 with a single zero, and I guess a 1 alone will work too. |
Send message Joined: 19 Mar 07 Posts: 7 |
I'm really upset! I hate this scheduler! Maybe it takes wrong decisions when only one project is running... I don't know!It definitely does what it wants. Running one project does not seem to generate optimal results. It winds down and finishes the last workunit before getting more. One computer is running its last workunit, which isn't due until 3/23. It still won't get more. It only downloads 10-20 after emptiness depending on what mood it is in. It doesn't really fill the queue again. You have to wait for the first one to download before crunching starts again. This behavio(u)r happens now in 5.8.8, 5.8.11, and 5.8.15. One could always back out and reinstall 5.4.11 to fill the queue, but that seems like a pain. 5.4.11 was also very good about getting some more workunits after a few finished, to keep the queue as a queue. Unfortunately, it only wants new workunits when the queue is empty. The computer needs to be connected at exactly that point in time. 5.8.15 also works worse with an old Win98 computer. I always know when the workunit is finished and I need another workunit because the computer freezes. |
Send message Joined: 29 Aug 05 Posts: 304 |
Often the behavior of seeing the queue run down to nothing and refilling is because of too large a queue setting. The queue needs to be set to no more than a quarter of the shortest deadline to make things work smoothly. On the SETI project that means about 1.0 days is the max that will work smoothly. There is a chart somewhere with exact numbers posted. edit: Here is the link to the table http://boinc-wiki.ath.cx/index.php?title=Work_Buffer BOINC WIKI BOINCing since 2002/12/8 |
Send message Joined: 29 Aug 05 Posts: 147 |
John, I'm curious about a fact: You are correct up to a point. The contract is that the user will connect to the projects at least once every X days. Now assume that the host is not connected to the projects for 7 days straight. The host is clearly always in deadline trouble because all work must be finished by the connection before the deadline so that the result can be uploaded and reported on time. This is why there is a computation deadline that is earlier than the report deadline. The computation should be finished at least one connection interval before the report is due - otherwise, the host might be disconnected from the time that the task is completed until after the report is due. BOINC WIKI |
Send message Joined: 29 Aug 05 Posts: 147 |
Often the behavior of seeing the queue run down to nothing and refilling is because of too large a queue setting. The queue needs to be set to no more than a quarter of the shortest deadline to make things work smoothly. On the SETI project that means about 1.0 days is the max that will work smoothly. There is a chart somewhere with exact numbers posted. A better calculation for the max to set a queue length to typically avoid EDF (which is not an error condition per se, but the attempt to avoid an error condition) would be (minimum report deadline - 1 day) / 3. Settings larger than (report deadline - 1 day) / 2 are just about guaranteed to download a batch and run it to completion before downloading the next batch. If you are really disconnected for nearly that amount of time, then you wouldn't notice very much, would you. BOINC WIKI |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.