Thread 'Aborting Task elapsed time exceede'

Message boards : Server programs : Aborting Task elapsed time exceede
Message board moderation

To post messages, you must log in.

AuthorMessage
Saad

Send message
Joined: 23 Oct 17
Posts: 17
Message 87616 - Posted: 13 Aug 2018, 21:45:13 UTC

After some jobs , clients keeps giving error aborting task elapsed time exceeded. I have set rsc_fpops_bound enough, I m thinking to set <no_delay_bound> in config.xml. Will this help me ? What are the possible cases in which client gives error "aborting task elapsed time exceeded" as soon as it starts computation it immediately gives this error.
ID: 87616 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5130
United Kingdom
Message 87620 - Posted: 13 Aug 2018, 22:03:25 UTC - in response to Message 87616.  

Client and server clocks not properly synchronised?

Look on the client machine(s) what is actually being received and stored in client_state.xml for <flops> and <rsc_fpops_bound>, and work out for yourself whether it makes sense.
ID: 87620 · Report as offensive
Saad

Send message
Joined: 23 Oct 17
Posts: 17
Message 87621 - Posted: 13 Aug 2018, 23:55:19 UTC - in response to Message 87620.  

Yes it could be thanks. But what does <no_delay_bound> will do ?
ID: 87621 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5130
United Kingdom
Message 87622 - Posted: 14 Aug 2018, 7:06:45 UTC - in response to Message 87621.  

But what does <no_delay_bound> will do ?
It's not a documented value, so probably nothing except log an error message.

You need a limit, and the default value of a week should be enough to get started.
ID: 87622 · Report as offensive
Saad

Send message
Joined: 23 Oct 17
Posts: 17
Message 87625 - Posted: 14 Aug 2018, 13:20:43 UTC - in response to Message 87622.  

Thanks a lot . I think the same and that's it the time issue. I tried to use various deadlines? Deadlines is set by "<delay_bound>" in your workunit input template ? I tried different values but still the problem persists. Rest everything is working fine except this time limit. Is there anyway to set time limit in a different way ?
ID: 87625 · Report as offensive
Saad

Send message
Joined: 23 Oct 17
Posts: 17
Message 87627 - Posted: 14 Aug 2018, 15:33:54 UTC - in response to Message 87626.  

Yes. Initially I did not set any delay_bound value, but in my boinc db it is set
604800
, for new work units I increased the value but it did not help.
<rsc_fpops_bound>12e12</rsc_fpops_bound>
    <rsc_fpops_est>14e14</rsc_fpops_est>


Also I have set loosely the fpops bound to debug. All the jobs were launched using above configuration. Secondly my server time is UTC format, but clients are being tested at Asia/Karachi time zone. Howsoever the report deadline should be equal to something like this, report_dealine=create_time+delay_bound. In previous replies Richard Haselgrove pointed out that it could be clock time problem between client and server. But this can be tackled by giving a week's deadline to the clients as our jobs are very compute-intensive and are plenty of them.

I have noticed that for 1-2 hours jobs works fine, but after that every job is aborted.
ID: 87627 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5130
United Kingdom
Message 87645 - Posted: 14 Aug 2018, 21:33:48 UTC

All task deadlines are set, held, and tested as absolute Unix time values in UTC. Your Kararchi clients should also be set to UTC, but with a time zone correction so that most times - including those used by BOINC - are displayed in a format that corresponds to local clocks. None of that should affect the decisions about whether a task has, or has not, exceeded the allowed time.

So, you've seen delay_bound of 604800 - that's 7*24*3600, so a week. No problem there.

But
<rsc_fpops_bound>12e12</rsc_fpops_bound>
    <rsc_fpops_est>14e14</rsc_fpops_est>
is bonkers. You're saying that you will kill jobs ('bound') two orders of magnitude before you expect them to finish ('est').

rsc_fpops_est is really a crucial value. You really need to run some test jobs on a machine with known speed (typically ~1 Gigaflop for tasks run on CPUs, much higher for GPUs). Then set rsc_fpops_est to match.

And rsc_fpops_bound - the failsafe that kills tasks if they get into an endless loop - should always be set higher than you expect the tasks to run - usually by a factor of 10, perhaps 100 if you're finding runtime hard to estimate.
ID: 87645 · Report as offensive
Saad

Send message
Joined: 23 Oct 17
Posts: 17
Message 87846 - Posted: 28 Aug 2018, 17:08:10 UTC - in response to Message 87645.  

Yes it worked thanks. The problem was in setting up rsc_flops_est and rsc_flops_bound. I increased them and it worked. Thanks!
ID: 87846 · Report as offensive

Message boards : Server programs : Aborting Task elapsed time exceede

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.