Message boards : BOINC client : Proposed Fix for *RESET* Problem
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 05 Posts: 4 |
Here's what happened: I was browsing something or other (not SETI related) and the browser got locked up, (!@#& Microsoft!), I had to reboot the non-responsive system. I think BOINC wasn’t able to close its files properly before the reboot, which is strange because it was a task manager shutdown. Here’s what the log showed: StartServiceCtrlDispatcher being called. This may take several seconds. Please wait. 2005-08-28 09:27:45 [---] Starting BOINC client version 4.45 for windows_intelx86 2005-08-28 09:27:45 [---] Executing as a daemon 2005-08-28 09:27:45 [---] Data directory: C:Program FilesBOINC 2005-08-28 09:27:45 [---] BOINC is running as a service and as a non-system user. 2005-08-28 09:27:45 [---] No application graphics will be available. 2005-08-28 09:27:45 [---] Can't parse file info in state file 2005-08-28 09:27:45 [---] State file has different major version (0.00); resetting projects 2005-08-28 09:27:45 [SETI@home] Resetting project 2005-08-28 09:27:45 [---] request_reschedule_cpus: exit_tasks 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found 2005-08-28 09:27:45 [SETI@home] PERS_FILE_XFER_SET::remove(): not found ...etc... The work of a whole week (on a p4 HT going at 3.00 GHz, 2 CPUs), gone, irrevocably, in an instant! (Sob!) Boo hoo hooooooo! I'm not competing with anyone; it's just the waste of energy (effort) that I find appalling. Some of us are really interested in the science of it, and try our best to be as efficient as possible. This problem can easily be eliminated by coding into the software a backup procedure for whenever the software is going to reset. How can losing properly processed work units be good for the project? Isn't the idea of distributed computing all about doing more work in less time? It hurts when our good-hearted efforts vanish for *no good reason*. HEADS-UP, BOINC-GUYS! PROPOSAL TO ELIMINATE THIS DISASTROUS POSSIBILITY: It should not be too hard to program an extra feature into the configuration that backs-up all the critical files in a backup directory, (say, "C:/.../BOINC/BACKUP"), at a user configurable interval, (every 10 minutes, every 1 hour, whatever!), a "RESET" can then just return to that backup-state and at least all would not be lost. Two redundant backup file sets would eliminate the problem altogether since any reboots that happen during the backup-write process would leave the second set intact. If, for example, your interval is set to "every one hour" then the most you'd lose in the worst-case-scenario is two hours of work!!! (Oh, hey, I can dig that!) I suppose this is one update that would be greatly appreciated, I think, by all but the most hard-core, die-hard misanthropes out there. Wouldn't that help to reduce the "BOINC-flame-queues" considerably? I betcha that's bogging down the servers some. :D Mafú - "An opinion is only valid while it does not contradict a fact." |
Send message Joined: 29 Aug 05 Posts: 117 |
... This problem can easily be eliminated by coding into the software a backup procedure for whenever the software is going to reset. ... Actually the main culprit in your description looks like the client_state.xml file. It likely became corrupt from the hard reset. There is an inbuilt mechanism to backup this file each time BOINC updates it. Have a look for client_state_prev.xml in your BOINC folder. We may need to take a look at the conditions surrounding how and when the backup file is used when the primary file has become corrupt. It may be a little hard to repro a corrupt one though. Feel free to dig into the source and make any suggestions if you find a bug there. Kind Regards ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 15563 |
There is an inbuilt mechanism to backup this file each time BOINC updates it. Have a look for client_state_prev.xml in your BOINC folder. That's what I said on the Seti Q&A forums. But the problem with the backup file is, if you are running Boinc as a service it makes a new client_state.xml file from scratch. At least by starting up Windows and going into your Boinc folder without starting Boinc gives you a chance to copy the prev file back to the original and only lose the amount of time from when it was last backed up. But when you start it as a service, the Boinc service starts up before Windows login comes on. So the new state file has already been made and if it's been doing some recalculating, it's also overwritten the prev file. |
Send message Joined: 29 Aug 05 Posts: 117 |
But the problem with the backup file is, if you are running Boinc as a service it makes a new client_state.xml file from scratch. If this is truly the behaviour, then it kinda makes the point of making a backup file useless. Need to dig in the code, but I would hope to find some pseudologic like: if (there is a client_state.xml) and (it can't be properly parsed) then if (there is a client_state_prev.xml file) copy client_state_prev.xml to client_state.xml else create a default client_state.xml file fi fi ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 15563 |
Well no, Chris. After thinking about it, the problem is at any installation that you do, does not know if the client_state.xml file is corrupted, until you start Boinc. Service installs will be hit hardest on this as it will read a possible corrupt file and continue with what? the whole file cannot be corrupt since most people only lose work units. They do not have to re-attach to all their projects. So the backup file doesn't help much either. You don't know if the original file is bad until you start Boinc.... Maybe that a redundancy backup file ever 59 minutes after someone started and updates every hour there-after would help. On Dial-Up. For on ADSL/Cable as a 24/7 online connection, how would one disable this? I am not a programmer. The last thing I programmed was a C64 sequential database program. (I still have it ;)) All I can say is what may be needed. (Unless Boinc checks the contents of both the state file and its backup on startup, see if there are any differences.) |
Send message Joined: 30 Aug 05 Posts: 4 |
If this is truly the behaviour, then it kinda makes the point of making a backup file useless. Not really. But this is a contingency for a very specific kind of problem. I guess the easiest way to illustrate this is by comparing it to the "Last known good configuration" option in the Windows startup. The backup configuration is saved as soon as a successful logon is achieved. It is good if you know you had a problem before you logon again, but if you encounter the problem *after* you log on, it is useless. What we are trying to do here is come up with a backup scheme that can be useful regardless of the nature of the event that caused the problem, (i.e. a BOINC reset.) Need to dig in the code... (sigh...) I guess I'm gonna have to, I'm just into some serious (money-making) coding projects right now and dangerously close to a sleep-debt-bankruptcy. It's a good thing Puerto Rico has such great, strong coffee!!! Mafú - "An opinion is only valid while it does not contradict a fact." |
Send message Joined: 30 Aug 05 Posts: 4 |
The last thing I programmed was a C64 sequential database program. (I still have it ;)) Ohhhhhh.... I'm wounded! I wish I still had my old C64 codes! (ENVY) I hope you have since transfered that treasure to CD, last I heard, magnetic media has a half-life of five years. Live Long And Prosper, Dude! Mafú - "An opinion is only valid while it does not contradict a fact." |
Send message Joined: 29 Aug 05 Posts: 15563 |
I have not bought the cables and PCI card to be able to write my own code to CD through the C64 diskdrive. But I do still use my C64 at times. And my code can be found on the internet as well, someone converted all of that Public Domain Site's data to PC capable data. ;) http://www.binaryzone.co.uk/ |
Send message Joined: 29 Aug 05 Posts: 117 |
Ok, so I pulled out my pick and shovel... Service installs will be hit hardest on this as it will read a possible corrupt file and continue with what? Both service and non-service startups are affected equally. In both cases, once the core client starts, there's no going back. The way it's coded at the moment, if client_state.xml is corrupt you're pretty much done for, except for a single condition, which I'll describe in a bit. the whole file cannot be corrupt since most people only lose work units. They do not have to re-attach to all their projects. The reason clients don't need to re-attach is because project detail is more written to, than read from the state file. At startup, the project info is gathered by iterating all the files in the BOINC dir and parsing out any that basically match account_*.xml. (actually a known neat hack to save bandwidth for dialup users, but I digress...) So the backup file doesn't help much either. You don't know if the original file is bad until you start Boinc.... True, which is why I called it useless. :) My feeling is that BOINC should try and use the previous file, if the current file is bad, and it does.... but only if it can't find the <client_state> tag at the top of the file. If the top half of your state file is corrupt, you get a second chance, however if it's the bottom half, you're done for. It does this on the fly, so only people with corruption lower down will actually notice. They get to see the following:
The first line basically tells you which subsection of the state file the corruption starts in because it doesn't have a closing tag. Now what to do, hmmm? While the function parse_state_file() returns a value that could be used to check for success or failure, it isn't. Immediately after parsing the state file, there is a call to write_state_file. This is the function that overwrites your backup (prev) file. The changes probably wouldn't be that intense, something along the lines of: retval = parse_state_file(STATE_FILE_NAME); if (retval) { // print a warning that the state file is corrupt and we're going to try the previous one retval = parse_state_file(STATE_FILE_PREV); if (retval) { // At this point we just drop through which will reset the projects. // Remember, there is no "prev" file on initial BOINC installation. } } This will require some changes to the parse_state_file() function to accept the file name as a parameter and also to use the passed file name within the function. There's probably more to it than this, but it's a starting point. ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 117 |
What we are trying to do here is come up with a backup scheme that can be useful regardless of the nature of the event that caused the problem, (i.e. a BOINC reset.) I guess that the ideal solution would be to have a full backup of the BOINC folder kept somewhere, but for projects like CPDN it would be impractical. Active CPDN models can occupy Gigabytes of storage in the projects folder. Just a simple copy/paste on one of these folders can take 10 or more minutes. If done on the fly, your BOINC client state could have changed umpteem times between the time the backup started and the time it finished. Alternatively, suspend->backup->resume, but die-hard crunchers (and there's many of those ;-) wouldn't be forgiving if we suspended BOINC for 15 minutes out of every hour, so we could make a backup. ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 15563 |
Chris, does Boinc have a zip function included, or is that only included in the CPDN science application? Since the incoming data files and the outgoing results files of CPDN are zipped. If it were included in Boinc and you were to give the option of backing up, couldn't you use the zip function on hardest to make the whole backup smaller? You'd still be looking at the time it took zipping everything. But giving that as an option as well as an option to turn it off... in my optinion, if one of the die-hard crunchers would complain about him having lost all of his work units due to both his state files being corrupt, all you'd have to ask him is "did you turn off the backup function?" to shut him down. ;) Okay, jokes aside... A full backup of the complete Boinc folder is undoable. But a backup of the XML files? |
Send message Joined: 29 Aug 05 Posts: 117 |
Okay, jokes aside... A full backup of the complete Boinc folder is undoable. But a backup of the XML files? Jord, Well, it's not so much undoable as adding overhead and complexitiy. A couple of things to take into account are, a.) One of BOINC's long term goals is using spare HDD space. Less of that would be available if used for backups. Not much, but some. b.) Wu redundancy means that losing work becomes an inconvenience for the user, but the system still works by getting results from elsewhere. There's nothing wrong with looking into it though. Backing up the XML files is basically where we are now. Lets break that down a bit: The scheduler files are generated on the fly, so unimportant. Pref and master files are retrieved from the server, so unimportant. Which basically leaves the client_state and account files. Account files are pretty static once they've been generated (attached to project), unless the project wants daily gui buttons or some such thing. The client_state file is probably the most important, and it's quite dynamic. It is getting backed up, but the recovery process leaves a bit to be desired. I tried playing with the code and implemented the functionality I mentioned in my previous post with mixed results. A couple of the XML parsing routines look like they need some changes. They lack some error scoping definitions, a final 'else' statement and return 0 when they should really return ERR_XML_PARSE ( (HOST_INFO::parse(MIOFILE& in) for example). I still need to send this info through to Dr. Anderson, so that he can evaluate it and make the necessary changes. I didn't search the entire code base for locations where these functions are called, so they may be purposely returning 0 on failure, although unlikely. These changes would at least allow the return value from parse_state_file() to be examined and acted on, so I played some more. The main problem that I faced was when parsing the state file on the second attempt, because the first attempt fills up project and result stacks. This results in duplication on the second run, so these structures need to be cleared before the second run of the state file. That's where I ran out of time and had to abandon my experiments. ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 15563 |
Okay, I see where you're going. Yet think about it if the main XML file of need was redundantly backed up extra: Boinc would start up; read the corrupt client_state.xml file; (the work scheduler LTD algorythm would most probably already write to the backup file prev;) Boinc finds CS to be corrupt; it finds the prev file has already been written to. Why not make Boinc shut down here, leave the human to put back the backup file? You can let Boinc make this backup file (one extra 90KB can't be that bad) on exit. Since Boinc's projects will only go this bad when they read their data from the SC.xml file, an extra backup on exiting Boinc should be possible. All we need to determine is if the SC.xml file goes bad prior to a Boinc exit. Well, we... you developers. ;) |
Send message Joined: 29 Aug 05 Posts: 117 |
Okay, I see where you're going. Yet think about it if the main XML file of need was redundantly backed up extra: Currently at startup the prev file is only written to after the main file has been parsed. While the parse routine returns a value, it isn't checked, so even after a bad parse the program logic continues on to overwrite the prev file. This is flaw number 1. The decision to switch to the prev file occurs within the parse routine itself, and even then it's only used if a specific condition is true (missing <client_state> tag at start of file). This is not ideal, since in my experience the majority of corruption occurs part way into the state file, although this is difficult to confirm. Why not make Boinc shut down here, leave the human to put back the backup file? I thought the same thing, but we can't just die quietly and we also can't burn CPU cycles waiting for user input... (Think service installs). I also don't think that it's terribly user-friendly to just quit, unless we've at least tried to correct the problem programatically first. I'll go out on a limb here, but I'd be willing to guess that most users just don't want to be bothered with this kind of effort. (There would be some that could enjoy such an option, but I reckon it would be few.) Maybe feasible if both the main and prev state files turn out to be damaged, but that's got to be reasonably unlikely. You can let Boinc make this backup file (one extra 90KB can't be that bad) on exit. Since Boinc's projects will only go this bad when they read their data from the SC.xml file, an extra backup on exiting Boinc should be possible. Ok, so you're suggesting a 3rd (for human use only) backup CS file, but the question that immediately springs is: "Where does it come from?". BOINC already writes the CS file on exit. Do we make another copy of this? Do we copy the prev CS file? Do we write one every 10 seconds? Consider that the state file contains "current" information, including which wu's are "in progress", which have finished, which ones have been uploaded/downloaded. Go back too far and you'll end up with a CS file that is inconsistent with server data. (e.g. a wu that is possibly finished/uploaded in the "current" file but in progress/finished in the backup file.) Get's complicated. Let's face it, the overwhelming majority of these corruption cases occur due to hard machine resets, or lockups, and in those cases BOINC doesn't get a chance to exit cleanly, and so wouldn't get to write its 3rd backup file either. All we need to determine is if the SC.xml file goes bad prior to a Boinc exit. Well, we... you developers. ;) I can't really think of any cases I've seen or heard reported of such a nature. I would guess that a bad disk controller or corrupt HDD could cause such a case, but in those cases I'd reckon the output file generated from the wu would be equally corrupt, wouldn't it? In the scheme of things, I think the current implementation of maintaining a backup (prev) file is ok, but flawed and the flaws need to be addressed. If I had to change anything, I'd suggest an md5 sum of both the CS files be maintained in a separate file(s). Doesn't take long and provided it's not incorporated into the file itself (like the xml signatures) it would still allow us "hackers" the ability to change the file on the fly when necessary, which is sometimes quite convenient. Having an md5 sum of the file would also allow BOINC to immediately detect corruption, without having to parse the file first, thereby eliminating the duplication I mentioned in my previous post. Just more brainstorming. ralic's law of forums: Irrespective of any prior research done, you will find the solution to your question shortly after posting it to a public Internet forum, resulting in readers concluding that you have done no research on the matter whatsoever. |
Send message Joined: 29 Aug 05 Posts: 15563 |
Okay, I see where you're going. Yet think about it if the main XML file of need was redundantly backed up extra: Is it possible to let Boinc scan the CS.xml file when it starts up and compare it against either the prev file or, still going with my redundancy backup, the one it made upon shutting down? Or would that take up too much time in startup terms? Why not make Boinc shut down here, leave the human to put back the backup file? Yes, sorry, I meant with a nice helpful message on screen why it will be closing down, with an OK button even. I am posting all this between heavy thunderstorms. The sky around here at this moment looks like a nuclear attack. ;) Plus I was thinking in the single/multiple user installs mostly. I forgot about the service install. But since you write (or dabble in ;)) the stuff, is it possible to close a Boinc service without user input of net stop? Else I'd be turning back to using the redundancy backup again, let Boinc 'reboot' itself if it comes up with a corrupt original CS.xml file and in the extreme the prev file. Let's face it, the overwhelming majority of these corruption cases occur due to hard machine resets, or lockups, and in those cases BOINC doesn't get a chance to exit cleanly, and so wouldn't get to write its 3rd backup file either. Maybe we're thinking too difficult. Follow me here: I start up Boinc; All the time my CS.xml and prev.xml file are being written to; My computer crashes after an up time of 3 days; I do a hardboot; - at that time my CS.xml file gets corrupt. - my prev.xml file may be corrupt as well. What happens when I start up Boinc, is that it now throws out all of the units it knew. It can't read them from the CS or prev files. But if I had that 3rd redundancy file written elsewhere, like in the projects folder, Boinc could check that file for which files were last being worked on, are they on the machine, if yes, what status do they have, etc. This file would only be written to 1 time per 24 hours. And upon normal Boinc shutdown, for people who don't crunch 24/7. Would you lose data then? Maybe 24 hours? Files that aren't on the drive anymore, must've been uploaded/reported. Else it's too bad. What needs to be fixed in the new Boinc is the auto report, as it only happens every 24 hours afaik. All we need to determine is if the SC.xml file goes bad prior to a Boinc exit. Well, we... you developers. ;) If you had a bad disk controller or bad HDD, you'd lose everything, so that's not something you can anticipate. Most of the times I lost an IDE controller, plugged the harddrive onto an external one, I still had to reformat the harddrive as either the MBR or bootsector were gone, or both! Just more brainstorming. Aren't you one of the people I see on the developer's email list, while you work for LHC, Einstein or CPDN? ;) |
Send message Joined: 30 Aug 05 Posts: 4 |
Hi guys! I was just thinking... Assumption: The information on completed work units is static once written to the client_state file. However many others things might get updated whenever the client_state file is written to, each completed work unit's information remains the same. If I’m right about that, (and I may not be), then: Why not just keep a separate log of the WU stats? Whenever a WU is completed, its information is updated in the CS file and also in the "WU-log" file. The WU-Log file would NOT be overwritten, just appended. That way if the evil freeze/crash event occurs; only the last WU’s information would be incomplete/corrupted. Could not then the CS file be reconstructed from the information in the WU-Log file if the system-crash-reset event arises? Allow the reset to happen as it normally would, then read the WU-Log file to retrieve the completed WU information and incorporate that to the Client_State file. At most, only the last WU’s work would be lost. I confess that I haven’t even looked at the code (I envy you Chris, wish I could join the search) and have no idea how feasible this idea may be so please forgive me if I’m way off course. Afterthought: Cleaning up the WU-Log file: If results_upload was successful then If WU-Log_backup exists then Delete WU-Log Rename WU-Log_backup as WU-Log Endif Copy WU-Log to WU-Log_backup Delete all WUs from the WU-Log that are missing (results were uploaded) Delete backup End if You are more in-the-know about the specifics than I am so, what do you think? I’ve done some (manual) fooling around with the BOINC files and I suspect that BOINC needs more than just the information in the client_state file to pick-up-where-it-left-off. (Sigh!) I really wish I had time to pursue this properly. Mafú - "An opinion is only valid while it does not contradict a fact." |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.