Message boards : BOINC Manager : Beta BOINC 5.7.x/5.8.x discussion/problem report
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 6 May 06 Posts: 287 |
Resource limit exceeded sucks. This turns otherwise productive machines into cripples, even when all settings are set to 100%. The max memory usage doesn't take into account the swap partition. Maybe it's something that the individual projects have to address - the main victims seem to be WCG and Rosetta. Reverting back to 5.4.11 BTW, these settings don't limit usage (ie. hold usage to a max level - which most crunchers would take it to mean) they abort the wu as soon as it's exceeded. Instead of acting as a cruise control (limiting max speed) as soon as the set speed is exceeded the journey is aborted. CIC1=CC=C(C2=N[C@@H](CC(OC(C)(C)C)=O)C3=NN=C(C)N3C4=C2C(C)=C(C)S4)C=C1 |
Send message Joined: 6 May 06 Posts: 287 |
Resource .... is aborted. The problem is not with HDC wu's, as you say the amount of RAM is detected beforehand and the wu's are not issued. Seems to be only FAAH wu's (from WCG) that fall prey to this behaviour. As I said with settings at 100% (manually altered to these settings after the first errors appeared) the errors still continue. This occurs on machines with 512MB and lower with up to 1Gig swap partition, which prior to upgrading to 5.8.x ran without problems. Anyway reverted back to 5.4.11 CIC1=CC=C(C2=N[C@@H](CC(OC(C)(C)C)=O)C3=NN=C(C)N3C4=C2C(C)=C(C)S4)C=C1 |
Send message Joined: 14 Jan 07 Posts: 2 |
ok ive been running betas here and there to test. most worked fine with no problems, but the last two betas 5.8.1 and 5.8.2 both showed some interesting work unit problems. i ran both betas and my work unit list would slowly but surely dry up to the point where last night (after running 5.8.2 for a day+/-) that i had only 4 work units (and i run 20+/- projects). now 5.8.1 had this problem also but to a lesser extent. so after watching my work unit list dry up to nothing, i reinstalled 5.4.11 (stable) and lo and behold....within 10 minutes of my install work units FLOODED in. now i dont know if the work unit "part" of the betas are limited so that you dont recieve too many projects and overload the beta or if there is truly something borked with the betas that dont allow work units to come in....but either way there is something wrong and im posting so that someone can fix the rather large problem or explain that its the beta program limiting. thanks |
Send message Joined: 30 Oct 05 Posts: 1239 |
ok ive been running betas here and there to test. most worked fine with no problems, but the last two betas 5.8.1 and 5.8.2 both showed some interesting work unit problems. i ran both betas and my work unit list would slowly but surely dry up to the point where last night (after running 5.8.2 for a day+/-) that i had only 4 work units (and i run 20+/- projects). now 5.8.1 had this problem also but to a lesser extent. so after watching my work unit list dry up to nothing, i reinstalled 5.4.11 (stable) and lo and behold....within 10 minutes of my install work units FLOODED in. now i dont know if the work unit "part" of the betas are limited so that you dont recieve too many projects and overload the beta or if there is truly something borked with the betas that dont allow work units to come in....but either way there is something wrong and im posting so that someone can fix the rather large problem or explain that its the beta program limiting. thanks The changes in the CPU scheduler between the 5.4.x series and the 5.8.x series is quite drastic. This page describes the client scheduling policies. I've always use a very small connect to interval (.01 days) so I haven't noticed a large change in the number of WUs in my cache. But, your mileage may vary. Kathryn :o) |
Send message Joined: 20 Nov 06 Posts: 34 |
I also saw some strange behaviour after viewing the Simple Gui. This linux 5.8.1 "problem" seems to be fixed in the 5.8.2 build. Also this one: Message button closed the BOINC manager, but BOINC (and the app running) remained active. ;-) |
Send message Joined: 11 Nov 06 Posts: 12 |
I'm also seeing a problem with 5.8.2 running on Windows XP on one of my machines. I've set the profile for this machine to connect to the internet every 2.5 days (it is only running WCG work units). It pulls down about 12 hours of work rather than the amount it requested (261000 seconds) and it is completing all the work units in the buffer before requesting more work units. On my other machines running BOINC 5.8.2 (using other profiles), the machines pull down the correct amount of work and pull down more work units as they complete units and report back to WCG. I've tried letting all the work units complete on this machine and resetting, and this does not fix the problem. The DCF for this machine is around 1 (currently .97). This machine is a laptop that is suspended at night while the other machines are desktops running 24x7. I was not seeing this problem with 5.7.x and with 5.8.0/5.8.1. |
Send message Joined: 14 Jan 07 Posts: 2 |
The changes in the CPU scheduler between the 5.4.x series and the 5.8.x series is quite drastic. yeah i read that and i cant realy see where/why it would complete almost every WU in my list before even attempting to contact and ask for more work. i looked over the logs before i reinstalled 5.4.11 and saw that over the course of the day it had barely polled any projects for work, but instead plowed through the WUs i had already downloaded. (btw small update, since reinstalling 5.4.11 ive had approximately 50-100 WUs download to me in the one day since 5.8.2, so apparently something is wonky) I'm also seeing a problem with 5.8.2 running on Windows XP on one of my machines. I've set the profile for this machine to connect to the internet every 2.5 days (it is only running WCG work units). It pulls down about 12 hours of work rather than the amount it requested (261000 seconds) and it is completing all the work units in the buffer before requesting more work units. On my other machines running BOINC 5.8.2 (using other profiles), the machines pull down the correct amount of work and pull down more work units as they complete units and report back to WCG. I've tried letting all the work units complete on this machine and resetting, and this does not fix the problem. The DCF for this machine is around 1 (currently .97). This machine is a laptop that is suspended at night while the other machines are desktops running 24x7. I was not seeing this problem with 5.7.x and with 5.8.0/5.8.1. exactly, i mean ive tested most if not all the beta versions since the last stable and i think a few before that and none (til the 5.7.xx series and up) have had this serious WU dropoff issue. i mean there were times with 5.4.11 (and a few beta versions up in series from that) where i was deluged with WUs having 75+ WUs in my queue at a time, so the 4 from 5.8.2 was quite shocking |
Send message Joined: 11 Nov 06 Posts: 12 |
Yes, I understood that. My concern is that it is not pulling down more work units until it has completed all the work units vs. pulling down work units as it reports completion so that I have about 2.5 days of work units always in the queue. |
Send message Joined: 11 Nov 06 Posts: 12 |
One of the folks on a World Community Grid bulletin board shared how to setup cc_config to see why it was not pulling more work. I ran BOINC with logging on for a few minutes and it reported back the following: "2007-01-15 13:51:26 [World Community Grid] [rr_sim] result faah1231_d105n643_x2BPZ_01_2 finishes after 168168.786883 (23573.056801/0.140175) 2007-01-15 13:51:26 [World Community Grid] [rr_sim] result faah1231_d105n643_x2BPZ_01_2 misses deadline by 118092.291263". If this means that BOINC thinks that this work unit is going to miss it's deadline, something is not correct. The work unit has a deadline of Jan 20 and is about to start in a few hours and is estimated to need 6:30 hours to complete. What diagnostic info can I provide to help isolate the problem and get it corrected? |
Send message Joined: 29 Aug 05 Posts: 15561 |
WCG only allows for so many results to be downloaded at the same time. It doesn't care about the "connect to" setting. If it's more than the Connect to setting, you only get up to the setting that WCG set it to. So you can't ask for 10 days of work. And if you ask for 12 hours of work, it may well be they give you 10 hours of work or less. Not as other projects do give you 12+ hours of work. I don't think in this case you can blame BOINC. Blame WCG. :-) |
Send message Joined: 12 Jul 06 Posts: 35 |
Another problem with 5.8.2: It doesn't forget host-specific resource shares. 5.8 supports host-specific resource shares and so does BAM. BAM also lets you remove them, which removes the setting from the account manager reply. But 5.8.2 doesn't remove the ams_resource_share setting from the client_state.xml file. If you remove a host resource share in BAM, 5.8.2 goes back to using the project resource share instead, but it leaves the ams_resource_share setting in the client_state file. When the client is next restarted, it goes back to using the ams_resource_share instead. |
Send message Joined: 11 Nov 06 Posts: 12 |
The problem turned out to be that cpu_efficiency had been computed as below .25 on this machine.....not sure what had caused this but it is steadily rising so hopefully the problem will correct itself. |
Send message Joined: 11 Nov 06 Posts: 12 |
It turns out the problem was a beta version of another program that had a CPU utilization bug...it was consuming a fair amount of CPU in the background and I did not notice it. cpu_efficiency has risen to .38 and scheduling is starting to work more normally. |
Send message Joined: 30 Aug 05 Posts: 58 |
Problem in 5.8.2-gnu: Boinc tries to download applications of suspended projects like the Spinup project (CPDN Spinup is now closed and the apps aren't on the server anymore, but I'm still attached to it and the project is suspended : Boinc wants to download the apps) Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinup_4.09_i686-pc-linux-gnu not found Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinup_4.09_i686-pc-linux-gnu.so not found Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinupse_4.09_i686-pc-linux-gnu.zip not found ............... Tue 16 Jan 2007 20:29:43 CET|CPDN HadCM3 Spinup|Backing off 41 minutes and 21 seconds on download of file hadcm3spinup_4.09_i686-pc-linux-gnu.so Tue 16 Jan 2007 20:30:04 CET||[http_debug] HTTP_OP::init_get(): http://climateapps1.oucs.ox.ac.uk/hadcm3spinup/download/hadcm3spinup_4.09_i686-pc-linux-gnu Tue 16 Jan 2007 20:30:04 CET|CPDN HadCM3 Spinup|[file_xfer] Started download of file hadcm3spinup_4.09_i686-pc-linux-gnu Tue 16 Jan 2007 20:30:14 CET|CPDN HadCM3 Spinup|[file_xfer] Temporarily failed download of hadcm3spinup_4.09_i686-pc-linux-gnu: file not found Tue 16 Jan 2007 20:30:14 CET|CPDN HadCM3 Spinup|Backing off 45 minutes and 6 seconds on download of file hadcm3spinup_4.09_i686-pc-linux-gnu Tue 16 Jan 2007 20:30:19 CET||[http_debug] HTTP_OP::init_get(): http://climateapps1.oucs.ox.ac.uk/hadcm3spinup/download/hadcm3spinupse_4.09_i686-pc-linux-gnu.zip Tue 16 Jan 2007 20:30:19 CET|CPDN HadCM3 Spinup|[file_xfer] Started download of file hadcm3spinupse_4.09_i686-pc-linux-gnu.zip Tue 16 Jan 2007 20:30:20 CET|CPDN HadCM3 Spinup|[file_xfer] Temporarily failed download of hadcm3spinupse_4.09_i686-pc-linux-gnu.zip: file not found Tue 16 Jan 2007 20:30:20 CET|CPDN HadCM3 Spinup|Backing off 42 minutes and 54 seconds on download of file hadcm3spinupse_4.09_i686-pc-linux-gnu.zip |
Send message Joined: 12 Jul 06 Posts: 35 |
Answer from David Anderson: Hi, just tested this with 5.8.3, and although the problem with the Advanced View is fixed when connecting to a remote 5.4.11, the issue with the Simple View is not - it still gives an error and shows no tasks. |
Send message Joined: 5 Mar 06 Posts: 16 |
JM7 fixed the bug that attempted to download 10 days of work for all 10 projects attached (100 days worth of work with the typical deadline being about 2 weeks - not a good idea). I have no problem with the other fixes included with Boinc 5.8.x, the improved scheduler seems to work much better than 5.4.11 but the displayed estimated time to completion (TOC) needs to be fixed. Regardless of the changes in the scheduler algorithm, the displayed TOC (re: DCF) ought to adjust itself to match the actual runtimes even if the scheduler uses a different value. I do run multiple projects, typically three at a time now are active. Are you suggesting that the displayed TOC (re:DCF) will adjust itself (down) to the actual runtime if I were running only one project? The theory being that the scheduler is adjusting the TOC (re:DCF) to some arbitrary value so that I don't download too much work, assuming that I run two or more projects. Even if this is what is happening, the displayed TOC ought to converge to the actual runtime. Regarding the scheduler and the actual amount of time that Boinc is allowed to run: I often shut down Boinc for part of the day (4 to 12 hours) when I cannot have Boinc affecting other work that the computer is doing. This seems to have absolutely no effect on the displayed TOC (re:DCF) or the amount of work that Boinc is allowed to download. According to what you've said, I might expect the scheduler to download less work, because the scheduler knows (or thinks) that I'm only going to allow Boinc to run part-time. I have not seen this effect at all, the amount of work queued is more or less equal to my connect time as determined by the adding up the displayed TOCs. Currently the actual runtime is ~0.56 of the displayed TOC. So, for two active projects at a connect time of 0.5 days, I can queue about 0.25 days of WUs for each project. I'm sure that that is what the scheduler intends to do, and I think that's a nice featue as it prevents a project from supplying too much work for my computer. In any case I always seem to have an amount of queued WUs equal to about half my connect time and that amount is exactly equal to the amount that would be allowed if computed from the displayed TOC. However, I would still like the displayed TOC to converge on the actual runtime. I'll have to look carefully at the "Boinc allowed to run" time and see if that's what the problem is. I don't think that's the problem with the displayed TOC though. According to the theory (I don't know if it's true) that the TOC is a reflection of the number of projects allowed to run, then if I have three projects running should I see the TOC increase to approximately 3 times the actual runtime? I don't think it will, I don't think it does now. In any case the displayed TOC should converge to the actual runtime, regardless of what the scheduler does with the values. I'll have to try running just one project (suspend the others) and see what happens to the displayed TOC. Also, running Boinc for 24 hours continuously for a couple of days at a time has absolutely no effect on the displayed TOC as you might expect if the DCF was adjusted with regard to the "Boinc allowed to run" value. I do see how the "Boinc allowed to run" would have an effect on EDF and deadline times, I don't think that behavior has changed. Thanks for putting up with my observations, I just think that once this Boinc version is let loose on the general community that most users are going to see this TOC disparity and wonder what's going on just as I've done. Then again there is blissful ignorance. |
Send message Joined: 11 Nov 06 Posts: 12 |
I am running BOINC 5.8.3 for Windows on a Windows XP machine with the profile set to "keep applications in memory when preempted" set to no. The running WCG work unit was preempted by another WCG work unit and the preempted work unit is still loaded into memory per the Windows task manager. Is this a bug in BOINC 5.8.3 or something that is being overriden by the WCG work unit setup and therefore needs to be reported on the WCG forum? |
Send message Joined: 30 Oct 05 Posts: 1239 |
I think it is that the 5.8.x client will no longer swap out a result that hasn't checkpointed yet. I think I ran into that when I was testing memory preferences. If you want to see when apps checkpoint, you'll need to make a cc_config.xml file and drop it in the BOINC directory after shutting down BOINC. I believe the flag you need to set is task_debug. Then restart BOINC. You'll see in the start up messages... 1/17/2007 9:42:12 PM||Starting BOINC client version 5.8.3 for windows_intelx86 1/17/2007 9:42:12 PM||log flags: task, file_xfer, sched_ops, cpu_sched, task_debug 1/17/2007 9:42:12 PM||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3 1/17/2007 9:42:12 PM||Data directory: C:\\Program Files\\BOINC 1/17/2007 9:42:12 PM||Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz 1/17/2007 9:42:12 PM||Memory: 446.98 MB physical, 1.78 GB virtual 1/17/2007 9:42:12 PM||Disk: 55.88 GB total, 39.94 GB free You'll see stuff like this in your logs... 1/17/2007 10:32:59 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed 1/17/2007 10:32:59 PM|DepSpid|[task_debug] result spider_24040_0 checkpointed 1/17/2007 10:33:38 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed 1/17/2007 10:33:39 PM|DepSpid|[task_debug] result spider_24039_0 checkpointed 1/17/2007 10:32:59 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed 1/17/2007 10:32:59 PM|DepSpid|[task_debug] result spider_24040_0 checkpointed 1/17/2007 10:33:38 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed 1/17/2007 10:33:39 PM|DepSpid|[task_debug] result spider_24039_0 checkpointed [edit]If you need help setting up that file, let us know. I have one that I can post.[/edit] Kathryn :o) |
Send message Joined: 21 Jun 06 Posts: 156 |
@KSMarksPsych: If you want to use the cc_config.xml file you dont need to restart boinc (>5.8.0). There is a good feature in boinc "read config file", you can edit the xml while running Boinc and activate the feature again and again ;) |
Send message Joined: 30 Oct 05 Posts: 1239 |
@KSMarksPsych: If you want to use the cc_config.xml file you dont need to restart boinc (>5.8.0). There is a good feature in boinc "read config file", you can edit the xml while running Boinc and activate the feature again and again ;) Well that's a nifty little feature! I shall have to tuck that away into the depths of brain. :) Kathryn :o) |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.