Message boards : Server programs : Aborting Task elapsed time exceede
Message board moderation
Author | Message |
---|---|
Send message Joined: 23 Oct 17 Posts: 17 |
After some jobs , clients keeps giving error aborting task elapsed time exceeded. I have set rsc_fpops_bound enough, I m thinking to set <no_delay_bound> in config.xml. Will this help me ? What are the possible cases in which client gives error "aborting task elapsed time exceeded" as soon as it starts computation it immediately gives this error. |
Send message Joined: 5 Oct 06 Posts: 5130 |
Client and server clocks not properly synchronised? Look on the client machine(s) what is actually being received and stored in client_state.xml for <flops> and <rsc_fpops_bound>, and work out for yourself whether it makes sense. |
Send message Joined: 23 Oct 17 Posts: 17 |
Yes it could be thanks. But what does <no_delay_bound> will do ? |
Send message Joined: 5 Oct 06 Posts: 5130 |
But what does <no_delay_bound> will do ?It's not a documented value, so probably nothing except log an error message. You need a limit, and the default value of a week should be enough to get started. |
Send message Joined: 23 Oct 17 Posts: 17 |
Thanks a lot . I think the same and that's it the time issue. I tried to use various deadlines? Deadlines is set by "<delay_bound>" in your workunit input template ? I tried different values but still the problem persists. Rest everything is working fine except this time limit. Is there anyway to set time limit in a different way ? |
Send message Joined: 23 Oct 17 Posts: 17 |
Yes. Initially I did not set any delay_bound value, but in my boinc db it is set 604800, for new work units I increased the value but it did not help. <rsc_fpops_bound>12e12</rsc_fpops_bound> <rsc_fpops_est>14e14</rsc_fpops_est> Also I have set loosely the fpops bound to debug. All the jobs were launched using above configuration. Secondly my server time is UTC format, but clients are being tested at Asia/Karachi time zone. Howsoever the report deadline should be equal to something like this, report_dealine=create_time+delay_bound. In previous replies Richard Haselgrove pointed out that it could be clock time problem between client and server. But this can be tackled by giving a week's deadline to the clients as our jobs are very compute-intensive and are plenty of them. I have noticed that for 1-2 hours jobs works fine, but after that every job is aborted. |
Send message Joined: 5 Oct 06 Posts: 5130 |
All task deadlines are set, held, and tested as absolute Unix time values in UTC. Your Kararchi clients should also be set to UTC, but with a time zone correction so that most times - including those used by BOINC - are displayed in a format that corresponds to local clocks. None of that should affect the decisions about whether a task has, or has not, exceeded the allowed time. So, you've seen delay_bound of 604800 - that's 7*24*3600, so a week. No problem there. But <rsc_fpops_bound>12e12</rsc_fpops_bound> <rsc_fpops_est>14e14</rsc_fpops_est>is bonkers. You're saying that you will kill jobs ('bound') two orders of magnitude before you expect them to finish ('est'). rsc_fpops_est is really a crucial value. You really need to run some test jobs on a machine with known speed (typically ~1 Gigaflop for tasks run on CPUs, much higher for GPUs). Then set rsc_fpops_est to match. And rsc_fpops_bound - the failsafe that kills tasks if they get into an endless loop - should always be set higher than you expect the tasks to run - usually by a factor of 10, perhaps 100 if you're finding runtime hard to estimate. |
Send message Joined: 23 Oct 17 Posts: 17 |
Yes it worked thanks. The problem was in setting up rsc_flops_est and rsc_flops_bound. I increased them and it worked. Thanks! |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.