Thread 'Exiting with Data Left in Memory'

Message boards : BOINC Manager : Exiting with Data Left in Memory
Message board moderation

To post messages, you must log in.

AuthorMessage
Mike Sr.
Avatar

Send message
Joined: 20 Dec 05
Posts: 10
United States
Message 2266 - Posted: 27 Dec 2005, 10:00:11 UTC

When BOINC is exited normally, i.e. File>Exit, is the
status of any suspended project work that was left in
memory lost? How about running projects? Are they
checkpointed or is any crunching since the last
project initiated checkpoint wasted?

Regards,
Mike


ID: 2266 · Report as offensive
ProfileOffBeatMammal
Avatar

Send message
Joined: 4 Dec 05
Posts: 35
United States
Message 2268 - Posted: 27 Dec 2005, 10:40:43 UTC - in response to Message 2266.  

and as a small adjunct to this... also when a machine goes into standby or hibernate states are they checkpointed... just in case they don't restart for whatever reason?

The wiki answer here seems to indicate that it's only the last data written at a checkpoint interval that's saved
If the BOINC Client Software is halted for what ever reason, processing starts with the most recent Checkpoint

so I'm guessing that the client doesn't to a full tidy-up on a file|exit condition (or for safety when Windows alerts it to a standby or hibernate condition) but I may just be missing something about those conditions as they're more managed than a forced abort etc
Random Thoughts
ID: 2268 · Report as offensive
Paul D. Buck

Send message
Joined: 29 Aug 05
Posts: 225
Message 2274 - Posted: 27 Dec 2005, 14:06:24 UTC

Nope, you got it in one.

If the application checkpoints on a 1 minute basis you lose, on average, 30 seconds work. Those that have longer, or no checkpointing will cause you to lose all the work to that point.

Most project have reasonable checkpoints, but there are gottcha's like the one where I got a "hung" Rosetta@Home work unit that "ate" 25 hours of computer time and got no where... being worked on by the project ...

Or the Predictor@Home work units that threw up a FORTRAN error dialog and halted the CPU till Ok was pressed ... as far as I know they are not seriously looking into this problem.

The BOINC Client cannot do what the Science Application does not allow it to do. So, the responsibility is back on the projects to ensure "safe" computing. Most do a pretty good job, but, some checkpoints are so large that they are not practical to do very often, like CPDN's, I forget what their interval is ... (15 min?), but, this is one of the reasons I run 24/7 :)
ID: 2274 · Report as offensive
ProfileOffBeatMammal
Avatar

Send message
Joined: 4 Dec 05
Posts: 35
United States
Message 2291 - Posted: 28 Dec 2005, 3:48:13 UTC - in response to Message 2274.  

The BOINC Client cannot do what the Science Application does not allow it to do. So, the responsibility is back on the projects to ensure "safe" computing. Most do a pretty good job, but, some checkpoints are so large that they are not practical to do very often, like CPDN's, I forget what their interval is ... (15 min?), but, this is one of the reasons I run 24/7 :)

I guess 15 mins is okay, as the loss is 7 mins on average.... but it's frustrating that there may be science projects that don't appreciate the wasted work!
I imagine there is a balance in terms of efficiency between checkpointing too often and not often enough... the latter wastes time and effort while the former probably affects throughput on a stable machine.... maybe a 'smart' checkpoint system that adapts to how a machine is being used...

Random Thoughts
ID: 2291 · Report as offensive
Bill Michael

Send message
Joined: 30 Aug 05
Posts: 297
Message 2292 - Posted: 28 Dec 2005, 5:26:02 UTC

Checkpointing is easy for some projects, and much more difficult for others, just because of the nature of the work being done. For example, SETI checkpoints "very often", because there you're running a fairly small loop of activities. Rosetta checkpoints very "rarely", because each WU is made up of only 10 "blocks", that are each started with a random seed. On a fast computer, the 10 checkpoints may each be 10 minutes apart; but on a very slow system, they may be over an hour apart. Short of writing the entire contents of memory to the file, there just isn't a "quick" way to checkpoint any more often than they do.

So some projects are much better suited to "intermittent use" machines than others. Rosetta _really_ does best when it runs 24/7, SETI is fine with ten minutes run-time here and there.

ID: 2292 · Report as offensive
Paul D. Buck

Send message
Joined: 29 Aug 05
Posts: 225
Message 2295 - Posted: 28 Dec 2005, 6:24:06 UTC

One more point, there are programs where the simulation *HAS* to be run from end to end and there is no possibility that they can be checkpointed. I am not sure that is case with Rosetta@Home yet, but, for example, CPDN can be very sensitive to stopping and restarting the models.

There was some work (Folding@Home?, which is in Alpha test) that did not or does not check point at all. and the work runs for a day or so ...

So, yes, it can be inattention on the project's part, or, just not done yet, not practical to do, etc.

Like many problems in system design there may not be a simple answer, contrary to some. And I have not "met" a project staffer that wants to have any more waste than is unavoidable.

For all these reasons, *I* recommend

1) Don't run projects in testing
2) If the project is doing something you don't like, vote with your feet.
ID: 2295 · Report as offensive

Message boards : BOINC Manager : Exiting with Data Left in Memory

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.