Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid
Message board moderation
Previous · 1 . . . 32 · 33 · 34 · 35 · 36 · 37 · Next
| Author | Message |
|---|---|
|
Send message Joined: 5 Nov 11 Posts: 8
|
Some of mine from yesterday will not upload, 24 hours so far. |
|
Send message Joined: 27 Aug 22 Posts: 40 |
Same here. |
|
Send message Joined: 25 May 09 Posts: 1393
|
While plenty of tasks are available the validation queue is getting longer and longer - someone needs to give it a prod/kick/enema to get it moving properly. |
|
Send message Joined: 19 Dec 05 Posts: 118
|
Not only that ... Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Sending scheduler request: To report completed tasks. Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Reporting 8 completed tasks Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Requesting new tasks for CPU Wed 05 Nov 2025 02:40:41 PM EST | World Community Grid | Scheduler request failed: HTTP service unavailable Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Sending scheduler request: To report completed tasks. Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Reporting 8 completed tasks Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Requesting new tasks for CPU Wed 05 Nov 2025 02:58:03 PM EST | World Community Grid | Scheduler request failed: HTTP service unavailable
|
|
Send message Joined: 30 Mar 20 Posts: 613
|
WCG website slow as molasses in the winter. Dylan writes: Database crashed, was able to clear the write lock/disk sleeps causing a crash loop and try restarting the container which was stuck, but it seems there is an IO issue with the volume that the BOINC database runs from or some further cleanup I still need to do before I can get the database up and running again. I can r/w to the volume manually so hopefully something I am able to handle without reaching out to hosting about the volume, it is a Ceph RDB and we store backups to a separate NFS mount point so data loss is not expected, but I don't quite know how long this is going to take yet. |
|
Send message Joined: 30 Mar 20 Posts: 613
|
Changes to the Device Profiles, are now propagating to the BOINC client. Example: World Community Grid 2025-11-06 13:17:52 General prefs: from World Community Grid (last modified 06-Nov-2025 13:17:20) Thank you Dylan, for all your hard work (which BTW seems to go on during all hours, even in the middle of the Toronto nights.) |
|
Send message Joined: 10 May 07 Posts: 1704
|
Seems to be working at the moment. I woke up to a few tasks running on my Android phone this morning and it recently uploaded one task with no issues. Big thanks to Dylan and everyone else at Jurisica for making it this far!!! |
|
Send message Joined: 30 Mar 20 Posts: 613
|
In reply to robsmith's message of 5 Nov 2025: While plenty of tasks are available the validation queue is getting longer and longer - someone needs to give it a prod/kick/enema to get it moving properly.Yeah, it's almost no validation going on at the moment. That's expected while Dylan is working on that part of the system. There's also tons of finished tasks to validate that was crunched before (and cached task during) the migration, and uploaded and reported when the system came back. It will take a long time, before everything is back to normal. |
|
Send message Joined: 19 Dec 05 Posts: 118
|
Does not seem to be working now: Lots of these: Fri 07 Nov 2025 12:55:40 PM EST | World Community Grid | Sending scheduler request: To fetch work. Fri 07 Nov 2025 12:55:40 PM EST | World Community Grid | Requesting new tasks for CPU Fri 07 Nov 2025 12:55:42 PM EST | World Community Grid | Scheduler request completed: got 0 new tasks Fri 07 Nov 2025 12:55:42 PM EST | World Community Grid | Server error: feeder not running
|
|
Send message Joined: 30 Mar 20 Posts: 613
|
The feeder is back, but no new work available yet. |
|
Send message Joined: 30 Mar 20 Posts: 613
|
New update from Dylan on the WCG forum. https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,47541_offset,380#707377 It has been resends only over the weekend, the mcm1_create_work daemons lost their database connection during BOINC database maintenance, and I realized they needed some code changes to not skip batches when the database connection won't allow BOINC to receive the new workunits defined by the batch plan, and a few other BOINC daemons like the batch assimilator (all the daemons live in the same container I've stuffed all our legacy code into) also needed some work to setup the fix for the validation fix. There will be a more comprehensive update posted shortly on the lab website "Operational Status" tab (https://www.cs.toronto.edu/~juris/jlab/wcg.html), but the TLDR is that today I plan to restart MCM1 batch production after I push a new build of the BOINC daemons, transitioner, setup a Kafka broker on the BOINC database node to backfill the assimilators with resends and scheduler reported tasks that didn't have the details needed to calc credit when the assimilator first received the upload pair from the validator, and if all that goes well then I am pretty sure I can finally piece together the full validation backlog from over the break and set the assimilators upon it. |
|
Send message Joined: 3 Nov 20 Posts: 8
|
Here is the whole update::: https://www.cs.toronto.edu/~juris/jlab/wcg.html November 11, 2025 Database maintenance over Friday/Saturday completed without issue. We have resolved an issue with the backup scripts, effectively increased memory used to service database queries and added some new indices. We expect better performance from the BOINC database going forward. However, the disk remains slower than initial benchmarking when we stood up the database. We will monitor and reach out to hosting to see if the Ceph placement group expansion (that caused the stuck blocks of that particular disk when the placement group the result table lives on) got stuck in a "peering" state. We were informed that we should expect temporary, possibly intermittent slow IO during this Ceph maintenance window. If we can get faster disks for the BOINC database (which would require restoring the database to a new volume as we did to migrate) we will consider a maintenance window. Right now, we are optimistic the issues revealed in the new system by hanging database queries and database crashes can all be resolved with patches the new BOINC daemons, and current performance will be sufficient. As mentioned, this event identified several issues with the new BOINC daemons. MCM1 workunit creation proceeds in the Kafka topic even though the database is down, the mcm1_create_work daemon for it's Kafka partition on science01...science06 tries to commit it's part of the batch, database isn't there, so it doesn't do anything, but it does commit it's offset/pointer into the batch plan topic and move on to consume the next batch plan. That means every 10-15m while the database is down, a batch is effectively skipped. We were able to fix that, and have restarted MCM1 batch creation at roughly 5:00 p.m. EST, November 10th, 2025. We believe we have finally architected a fix for the pending validation backlog issue. This requires some non-trivial plumbing in the MCM1 batch assimilator, a Kafka connector deployed on the BOINC database node, and transitioner code changes. Workunit supply may remain artificially lower while we roll out the new batch assimilator builds and monitor the transitioner -> Kafka event consumption and result table interaction. We were able to resolve the issue with computing preferences not being updated from the website to BOINC client and vice versa. Generally, when the BOINC database goes down, so does the event listener that handles these messages on the webserver. We are still working on resolving the validation backlog from over the break, with the result table bricked during the Ceph maintenance we architected a "trust the filesystem" solution, and we are hopeful that this issue will be resolved this week. MAM1 was initially planned to be resumed in beta30 last week, to see if 7.07 fairly schedules work and respects --nthreads, which is a blocking issue in promoting the beta application to production. Depending on the error rate and behaviour on BOINC clients, we would then consider the stable code paths for the first production batches. Given our increased control over batch parameters with the new Kafka topic that uses a protobuf schema to fill out the workunit and result table entires, we intend to run work in production on Linux as soon as the beta30 application is stable with an error rate lower than MCM1 excepting the GLIBC dependency, which is typically the only repeated error we see from clients on the current LibTorch code path. We will then rely on iterating the beta30 application to 7.08 and 7.09 to get GPU and Windows support, and Parquet IO for input and uploaded results. Hans S. |
|
Send message Joined: 30 Mar 20 Posts: 613
|
New work is incoming, but there is a strange issue for quite a lot of my new tasks. Not all of them though, but many. Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me. Might be an AI reminder that I need to get myself another cup of Tea, maybe :-) A few examples, of many: https://www.worldcommunitygrid.org/contribution/workunit/772178042 https://www.worldcommunitygrid.org/contribution/workunit/772178045 https://www.worldcommunitygrid.org/contribution/workunit/772178049 |
DaveSend message Joined: 28 Jun 10 Posts: 3060
|
Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me.No coffee reminder for me. (I only have Linux and Android tasks. What I did notice is that Android is picking up _0 and _1 tasks. All my Linux ones are _2 or occasionally, _3. |
DaveSend message Joined: 28 Jun 10 Posts: 3060
|
Now getting freshly generated tasks on Linux as well as Android. Even running only one task at a time on each platform, I am producing results a lot faster than they are getting validated. |
|
Send message Joined: 30 Mar 20 Posts: 613
|
In reply to Dave's message of 13 Nov 2025: Now getting freshly generated tasks on Linux as well as Android. Even running only one task at a time on each platform, I am producing results a lot faster than they are getting validated.Yeah, validation is more or less dead. Only a few tasks finished by both wingmen are validated per day now. Pending Validation are building up fast. They still haven't solved that issue. |
DaveSend message Joined: 28 Jun 10 Posts: 3060
|
Yeah, validation is more or less dead. Only a few tasks finished by both wingmen are validated per day now. Pending Validation are building up fast. They still haven't solved that issue.Interestingly, three tasks completed this morning have validated almost right away. No sign of the older ones getting done though. |
|
Send message Joined: 19 Dec 05 Posts: 118
|
In reply to Dave's message of 12 Nov 2025: Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me.No coffee reminder for me. (I only have Linux and Android tasks. What I did notice is that Android is picking up _0 and _1 tasks. All my Linux ones are _2 or occasionally, _3. My Linux work is picking up mostly _0 and _1, as are my partners.. All seem to be valid. Here is a typical result:
MCM1_0242073_7755
Project name: Mapping Cancer Markers
Created: Nov. 3, 2025 - 03:40 UTC
Name: MCM1_0242073_7755
Minimum Quorum: 2
Replication: 2
Result name MCM1_0242073_7755_0
OS type Linux EndeavourOS
OS version EndeavourOS Linux [6.17.7-arch1-1|libc 2.42]
Status Valid
Sent time 2025-11-14 03:17:36 UTC
Time due 2025-11-14 09:23:46 UTC
Return time 2025-11-14 03:21:33 UTC
Cpu time 2.36
Elapsed time 2.87
Claimed credit 22.8
Granted credit 63
MCM1_0242073_7755_1
Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.83.1.el8_10.x86_64|libc 2.28]
Valid
2025-11-14 07:10:20 UTC
1.71
1.74
103.2
63
|
|
Send message Joined: 12 Jun 09 Posts: 2142
|
Since the changeover, the most I've ever got was 2/3 days worth. This latest batch was for 6 days, last wu completed 23:54 this evening, no other tasks download. What annoys me the most is the stats sites unable to get stats. They can send "We miss you" e-mails but can't set up stats so we can see what we achieve as individuals/teams... |
|
Send message Joined: 1 Jul 16 Posts: 177
|
WCG would need to update the stats. The last update was August 30th. https://download.worldcommunitygrid.org/boinc/stats/ |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.