wiki:AdaptiveReplication

Version 1 (modified by davea, 16 years ago) (diff)

--

Adaptive replication

BOINC's current replication policy replicates a job even if one of the hosts is known to be highly reliable. The overhead of replication is high - at least 50% of total CPU time is spent checking validity.

Adaptive replication is an optional policy that avoids replicating a job if it has been sent to a highly reliable host. The goal of this policy is to provide a target level of confidence with minimal overhead - perhaps only 5% or 10% of total CPU time.

Policy

BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows:

  • It is initialized to 0.1
  • It is multiplied by 0.95 when H reports a correct (replicated) result.
  • It is incremented by 0.05 when H reports an incorrect (replicated) result.

Thus, it takes a long time to earn a good reputation and a short time to lose it.

The adaptive replication policy is as follows.

  • Each job is initially marked as unreplicated.
  • On each request, the scheduler decides whether to trust the host as follows:
    • If E(H) > A, don't trust the host.
    • Otherwise, trust the host with probability 1 - E(H)/A.
  • If we decide to trust the host, preferentially send it unreplicated jobs.
  • Otherwise, preferentially send it replicated jobs. If we have to send it an unreplicated job, mark it as replicated and create new instances accordingly.

Implementation

Database:

  • Add "target_nresults" field to app table. Default is zero (app doesn't use adaptive replication).

Scheduler:

  • Decide whether to trust host as described above.
  • If we send an unreplicated job (i.e., target_nresults=1 and app.target_nresults>1) to an untrusted host, set wu.target_nresults = app.target_nresults and flag the WU for transitioning.

Validator:

  • Don't update host.error_rate for unreplicated results (i.e., wu.target_nresults=1 and app.target_nresults>1).