Message boards : The Lounge : The Seti is Slumbering Cafe
Message board moderation
Previous · 1 . . . 439 · 440 · 441 · 442 · 443 · 444 · 445 . . . 508 · Next
Author | Message |
---|---|
Send message Joined: 1 Oct 15 Posts: 394 |
Yikes. This is to the point we may have to find a virgin and throw them into a volcano! I hear there is one - a volcano - available now. I had one newer cruncher that was virgin to Einstein@home, which has now been sacrificed. Given the near-volcanic heat produced by GPUs on Einstein, I'm hoping this will perhaps suffice. |
Send message Joined: 27 Jun 08 Posts: 641 |
The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right. |
Send message Joined: 10 May 07 Posts: 1443 |
13 plus hours of outrage makes things a DOUBLE OUTRAGE! Time to break out the heavy stuff in celebration of the DOUBLE OUTRAGE Line Aqavit from the old country. Anyone care for a shot or two? |
Send message Joined: 1 Oct 15 Posts: 394 |
The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right. Agreed. Something got really broken in 7.16.3 on resource sharing and scheduling. Been fighting this for a while. Best example is a case where I'm trying to clear out the Einstein queue after SETI resumes, using a resource share of 1, det NNT, and max concurrent set to less than all physical GPUs. What now happens, and didn't on 7.14.2, is that when max_concurrent # of GPUs are engaged, the other GPUs will sit idle rather then process SETI, apparently because of a resource share debt. My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise. I'd be interested to see if you experience anything like this. |
Send message Joined: 23 Feb 08 Posts: 2493 |
Close to 13 hours downtime now. |
Send message Joined: 2 Jan 18 Posts: 170 |
I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage. BTW I will remain at the outrage pub for about 1/2 hour, need to work tomorrow soon, hope that will be enought to satisfy the SETI Gods and bring the servers back to life. Tried to find a virgin here to sacrify at the vulcano and that was impossible. |
Send message Joined: 18 Oct 14 Posts: 1487 |
If things are not fixed soon Einstein here I come. |
Send message Joined: 27 Jun 08 Posts: 641 |
My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise. Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. This system normally runs SETI and GPUgrid at %100 and Einstein at %0. I added Milkyway at 0 and after a while the Einstein GPUs went idle. The work count in excess of 64 seem to be "lost work units" and I am guessing that number is not used when checking the GPU count. Both mining systems had a lot of "lost work units": However, I cannot account for something like 300 lost units. I only run Einstein when seti is offline. I clicked on Einstein's "www host schedule log" which duplicate info shown in the event viewer: "...lost tasks..." However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units. There is no reason for the 4 GPUs to be idle. I aborted the Milkyway as I didn't want them stopping Einstein from running. Einstein then started up and, !INCREDIBLY! I got 3 GPUgrid work units. Probably been a week or more since any showed up. 7 of the 9 GPUs are at %100 utilization but I got 2 idle due to the CPU not having enough threads. |
Send message Joined: 21 Mar 09 Posts: 33 |
If things are not fixed soon Einstein here I come. I am running some Collatz for now. |
Send message Joined: 1 Oct 15 Posts: 394 |
Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. ... There is no reason for the 4 GPUs to be idle. . .Agreed. Well, at least know you know it isn't just you. I would fall back to 7.14.2, but the "finish file present too long" error was becoming annoying, even ignoring other factors. Thanks for the confirmation. |
Send message Joined: 27 Jun 08 Posts: 641 |
I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage. I made a change to my program as I had been applying the 64 to all projects. I am now using the project app_config and setting the # of gpus depending on the project. Since this system has 9 GPUs then the below just limits the count to 4 instead of 9. Seti still has 64 to get through the off-line time. However, the 4000 limit I use did not get me over the 13+ hours. root@h110btc:/var/lib/boinc/projects/einstein.phys.uwm.edu# cat app_config.xml <app_config> <app> <name>einstein_O2MDF</name> <max_concurrent>4</max_concurrent> </app> <spoofedgpus>4</spoofedgpus> </app_config> I set the value in cs_scheduler // update hardware info, and write host info // host_info.get_host_info(false); set_ncpus(); iGPU = (gstate.spoof_gpus == -1) ? 0 : gstate.spoof_gpus; if(p->app_configs.spoofedgpus > 0) iGPU = p->app_configs.spoofedgpus; host_info.write(mf, !cc_config.suppress_net_info, false, iGPU); |
Send message Joined: 1 Oct 15 Posts: 394 |
...(not broken)...I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :) |
Send message Joined: 1 Oct 15 Posts: 394 |
15 hours. Actually, it's closer to 21 hours now, if you consider the fact that it basically quit handing out work ~6 hrs before maintenance began ~0500 PST. |
Send message Joined: 17 Nov 16 Posts: 890 |
However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units. That would be the output of the Einstein "locality scheduling" They run very old server software that uses very different (from current BOINC) schedulers. The log output from Einstein can be pages worth in reporting what is possible and not possible for various work units and the size of your cache. So it really messes up other projects scheduling based on REC. |
Send message Joined: 17 Nov 16 Posts: 890 |
I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :) But doesn't fit David Anderson's definition of what is idle. This is caused by the changes in 7.16.3 that fixed the issue with max_concurrent and exclude_gpu. |
Send message Joined: 18 Oct 14 Posts: 1487 |
Collatz has always made me feel stupid. |
Send message Joined: 21 Mar 09 Posts: 33 |
Collatz has always made me feel stupid. 41 valid tasks and an RAC of 114k or so. |
Send message Joined: 27 Jun 08 Posts: 641 |
Collatz has always made me feel stupid. How about 320,000 credits every 5 and 1/2 seconds? http://www.ukboincteam.org.uk/newforum/viewtopic.php?t=6221 The project is good for credit points only and ranks up there with bitcoin utopia. No scientific value what-so-ever but that is just my honest opinion worth about 2c. I did run up a lot of points on it and also on bitcoin utopia but could have been finding solution for medical problems over at WCG or other more useful work. Again, just IMHO but I didn't know better. |
Send message Joined: 18 Oct 14 Posts: 1487 |
+1 |
Send message Joined: 10 May 07 Posts: 1443 |
Anyone still sober and or still awake? It is almost 2130 in the evening at Berkeley and SETI is still down! This Super Duper Massive Grand Mal Outrage is going to be one to remember for the recent history books. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.