Thread 'The Seti is Slumbering Cafe'

Message boards : The Lounge : The Seti is Slumbering Cafe
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 439 · 440 · 441 · 442 · 443 · 444 · 445 . . . 508 · Next

AuthorMessage
Jimbocous
Avatar

Send message
Joined: 1 Oct 15
Posts: 394
United States
Message 95069 - Posted: 15 Jan 2020, 1:02:06 UTC - in response to Message 95066.  

Yikes. This is to the point we may have to find a virgin and throw them into a volcano! I hear there is one - a volcano - available now.

I had one newer cruncher that was virgin to Einstein@home, which has now been sacrificed.
Given the near-volcanic heat produced by GPUs on Einstein, I'm hoping this will perhaps suffice.
ID: 95069 · Report as offensive     Reply Quote
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 95072 - Posted: 15 Jan 2020, 2:11:58 UTC

The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right.
ID: 95072 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1443
United States
Message 95073 - Posted: 15 Jan 2020, 2:17:07 UTC

13 plus hours of outrage makes things a DOUBLE OUTRAGE!

Time to break out the heavy stuff in celebration of the DOUBLE OUTRAGE Line Aqavit from the old country.

Anyone care for a shot or two?
ID: 95073 · Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 1 Oct 15
Posts: 394
United States
Message 95074 - Posted: 15 Jan 2020, 2:20:34 UTC - in response to Message 95072.  

The extended outage allowed me to notice that a 4 core (8 thread) CPU cannot feed 9 GPUs running Einstein. I had to configure for 4 concurrent Einstein and 5 concurrent Milkyway and in addition had to scrap the "64" spoofed GPUs as that got too many Einstein. I had resources set to 0 but got way more than 64 work units. Should have gotten 1 for each GPU but I am looking at 110 on one mining system and 241 on another. Resource on both for Einstein was 0 so something not right.

Agreed. Something got really broken in 7.16.3 on resource sharing and scheduling. Been fighting this for a while. Best example is a case where I'm trying to clear out the Einstein queue after SETI resumes, using a resource share of 1, det NNT, and max concurrent set to less than all physical GPUs. What now happens, and didn't on 7.14.2, is that when max_concurrent # of GPUs are engaged, the other GPUs will sit idle rather then process SETI, apparently because of a resource share debt. My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise.
I'd be interested to see if you experience anything like this.
ID: 95074 · Report as offensive     Reply Quote
ProfileGary Charpentier
Avatar

Send message
Joined: 23 Feb 08
Posts: 2493
United States
Message 95077 - Posted: 15 Jan 2020, 2:31:14 UTC - in response to Message 95071.  

Close to 13 hours downtime now.
Plenty of active volcanoes around, but virgins.....hmm....

ID: 95077 · Report as offensive     Reply Quote
juan BFP

Send message
Joined: 2 Jan 18
Posts: 170
Panama
Message 95078 - Posted: 15 Jan 2020, 2:32:19 UTC
Last modified: 15 Jan 2020, 2:35:38 UTC

I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage.

BTW I will remain at the outrage pub for about 1/2 hour, need to work tomorrow soon, hope that will be enought to satisfy the SETI Gods and bring the servers back to life. Tried to find a virgin here to sacrify at the vulcano and that was impossible.
ID: 95078 · Report as offensive     Reply Quote
betreger
Volunteer tester
Help desk expert

Send message
Joined: 18 Oct 14
Posts: 1487
United States
Message 95079 - Posted: 15 Jan 2020, 2:44:51 UTC - in response to Message 95077.  

If things are not fixed soon Einstein here I come.
ID: 95079 · Report as offensive     Reply Quote
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 95080 - Posted: 15 Jan 2020, 3:00:27 UTC - in response to Message 95074.  
Last modified: 15 Jan 2020, 3:02:02 UTC

My contention is that GPUs should never sit idle, regardless of any perceived debt. Apparently, the software feels otherwise.
I'd be interested to see if you experience anything like this.


Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. This system normally runs SETI and GPUgrid at %100 and Einstein at %0. I added Milkyway at 0 and after a while the Einstein GPUs went idle.

The work count in excess of 64 seem to be "lost work units" and I am guessing that number is not used when checking the GPU count. Both mining systems had a lot of "lost work units": However, I cannot account for something like 300 lost units. I only run Einstein when seti is offline. I clicked on Einstein's "www host schedule log" which duplicate info shown in the event viewer: "...lost tasks..." However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units.

There is no reason for the 4 GPUs to be idle. I aborted the Milkyway as I didn't want them stopping Einstein from running. Einstein then started up and, !INCREDIBLY! I got 3 GPUgrid work units. Probably been a week or more since any showed up. 7 of the 9 GPUs are at %100 utilization but I got 2 idle due to the CPU not having enough threads.
ID: 95080 · Report as offensive     Reply Quote
arkayn
Avatar

Send message
Joined: 21 Mar 09
Posts: 33
United States
Message 95081 - Posted: 15 Jan 2020, 3:02:57 UTC - in response to Message 95079.  

If things are not fixed soon Einstein here I come.


I am running some Collatz for now.
ID: 95081 · Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 1 Oct 15
Posts: 394
United States
Message 95083 - Posted: 15 Jan 2020, 3:16:54 UTC - in response to Message 95080.  

Exactly what I have been looking at in the last 2 hours and trying to figure out. I had 4 GPU idle that should have been running Einstein and the other 5 GPUs are running milkyway. ... There is no reason for the 4 GPUs to be idle. . .
Agreed. Well, at least know you know it isn't just you.
I would fall back to 7.14.2, but the "finish file present too long" error was becoming annoying, even ignoring other factors.
Thanks for the confirmation.
ID: 95083 · Report as offensive     Reply Quote
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 95084 - Posted: 15 Jan 2020, 3:18:53 UTC - in response to Message 95078.  
Last modified: 15 Jan 2020, 3:19:37 UTC

I not k now what metodoth or program you use to spoofed the GPU count, but i could tell for sure, max concurrent & scheduler works totaly different (not broken) from the previous versions than on the 7.16 Boinc. That is why we not use that with the spoofed client we use. Instead of that we manage the number of active cores/threads with CPU usage.

BTW I will remain at the outrage pub for about 1/2 hour, need to work tomorrow soon, hope that will be enought to satisfy the SETI Gods and bring the servers back to life. Tried to find a virgin here to sacrify at the vulcano and that was impossible.


I made a change to my program as I had been applying the 64 to all projects. I am now using the project app_config and setting the # of gpus depending on the project. Since this system has 9 GPUs then the below just limits the count to 4 instead of 9. Seti still has 64 to get through the off-line time. However, the 4000 limit I use did not get me over the 13+ hours.
root@h110btc:/var/lib/boinc/projects/einstein.phys.uwm.edu# cat app_config.xml
<app_config>
 <app>
  <name>einstein_O2MDF</name>
  <max_concurrent>4</max_concurrent>
 </app>
 <spoofedgpus>4</spoofedgpus>
</app_config>

I set the value in cs_scheduler
    // update hardware info, and write host info
    //
    host_info.get_host_info(false);
    set_ncpus();
    iGPU = (gstate.spoof_gpus == -1) ? 0 : gstate.spoof_gpus;
    if(p->app_configs.spoofedgpus > 0) iGPU = p->app_configs.spoofedgpus;
    host_info.write(mf, !cc_config.suppress_net_info, false, iGPU);
ID: 95084 · Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 1 Oct 15
Posts: 394
United States
Message 95085 - Posted: 15 Jan 2020, 3:19:55 UTC - in response to Message 95078.  
Last modified: 15 Jan 2020, 3:20:19 UTC

...(not broken)...
I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :)
ID: 95085 · Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 1 Oct 15
Posts: 394
United States
Message 95087 - Posted: 15 Jan 2020, 3:25:48 UTC - in response to Message 95086.  

15 hours.
A fantastic outrage :-)

Actually, it's closer to 21 hours now, if you consider the fact that it basically quit handing out work ~6 hrs before maintenance began ~0500 PST.
ID: 95087 · Report as offensive     Reply Quote
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 95088 - Posted: 15 Jan 2020, 3:29:21 UTC - in response to Message 95080.  

However, I also saw a strange message "..[CRITCAL] … two instances of the scheduler running.." or something to that wording. I am not running two instances of Boinc. The so-called "schedule" is an Einstein app that (my understanding) arranges to download database items, not just project work units.

That would be the output of the Einstein "locality scheduling" They run very old server software that uses very different (from current BOINC) schedulers. The log output from Einstein can be pages worth in reporting what is possible and not possible for various work units and the size of your cache. So it really messes up other projects scheduling based on REC.
ID: 95088 · Report as offensive     Reply Quote
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 890
United States
Message 95089 - Posted: 15 Jan 2020, 3:32:22 UTC - in response to Message 95085.  

I would suggest any situation where GPUs will sit idle when there is work they could be performing simply because it's not "the right work" indicates "broke". :)

But doesn't fit David Anderson's definition of what is idle. This is caused by the changes in 7.16.3 that fixed the issue with max_concurrent and exclude_gpu.
ID: 95089 · Report as offensive     Reply Quote
betreger
Volunteer tester
Help desk expert

Send message
Joined: 18 Oct 14
Posts: 1487
United States
Message 95090 - Posted: 15 Jan 2020, 3:45:26 UTC - in response to Message 95081.  

Collatz has always made me feel stupid.
ID: 95090 · Report as offensive     Reply Quote
arkayn
Avatar

Send message
Joined: 21 Mar 09
Posts: 33
United States
Message 95091 - Posted: 15 Jan 2020, 4:12:04 UTC - in response to Message 95090.  

Collatz has always made me feel stupid.


41 valid tasks and an RAC of 114k or so.
ID: 95091 · Report as offensive     Reply Quote
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 95092 - Posted: 15 Jan 2020, 4:26:56 UTC - in response to Message 95091.  

Collatz has always made me feel stupid.


41 valid tasks and an RAC of 114k or so.


How about 320,000 credits every 5 and 1/2 seconds?

http://www.ukboincteam.org.uk/newforum/viewtopic.php?t=6221

The project is good for credit points only and ranks up there with bitcoin utopia. No scientific value what-so-ever but that is just my honest opinion worth about 2c. I did run up a lot of points on it and also on bitcoin utopia but could have been finding solution for medical problems over at WCG or other more useful work. Again, just IMHO but I didn't know better.
ID: 95092 · Report as offensive     Reply Quote
betreger
Volunteer tester
Help desk expert

Send message
Joined: 18 Oct 14
Posts: 1487
United States
Message 95093 - Posted: 15 Jan 2020, 5:11:07 UTC - in response to Message 95092.  

+1
ID: 95093 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1443
United States
Message 95094 - Posted: 15 Jan 2020, 5:29:27 UTC

Anyone still sober and or still awake?

It is almost 2130 in the evening at Berkeley and SETI is still down!

This Super Duper Massive Grand Mal Outrage is going to be one to remember for the recent history books.
ID: 95094 · Report as offensive     Reply Quote
Previous · 1 . . . 439 · 440 · 441 · 442 · 443 · 444 · 445 . . . 508 · Next

Message boards : The Lounge : The Seti is Slumbering Cafe

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.