Thread 'Anything and Everything to do with (WCG) World Community Grid'

Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 32 · 33 · 34 · 35 · 36 · 37 · Next

AuthorMessage
Hadrian

Send message
Joined: 5 Nov 11
Posts: 8
United Kingdom
Message 117329 - Posted: 2 Nov 2025, 14:44:00 UTC - in response to Message 117325.  

Some of mine from yesterday will not upload, 24 hours so far.
ID: 117329 · Report as offensive     Reply Quote
MyrCu

Send message
Joined: 27 Aug 22
Posts: 40
Message 117330 - Posted: 2 Nov 2025, 21:26:46 UTC - in response to Message 117329.  

Same here.
ID: 117330 · Report as offensive     Reply Quote
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1393
United Kingdom
Message 117358 - Posted: 5 Nov 2025, 17:54:49 UTC

While plenty of tasks are available the validation queue is getting longer and longer - someone needs to give it a prod/kick/enema to get it moving properly.
ID: 117358 · Report as offensive     Reply Quote
Jean-David

Send message
Joined: 19 Dec 05
Posts: 118
United States
Message 117359 - Posted: 5 Nov 2025, 20:26:34 UTC - in response to Message 117358.  

Not only that ...

Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Sending scheduler request: To report completed tasks.
Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Reporting 8 completed tasks
Wed 05 Nov 2025 02:40:36 PM EST | World Community Grid | Requesting new tasks for CPU
Wed 05 Nov 2025 02:40:41 PM EST | World Community Grid | Scheduler request failed: HTTP service unavailable
Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Sending scheduler request: To report completed tasks.
Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Reporting 8 completed tasks
Wed 05 Nov 2025 02:57:58 PM EST | World Community Grid | Requesting new tasks for CPU
Wed 05 Nov 2025 02:58:03 PM EST | World Community Grid | Scheduler request failed: HTTP service unavailable

ID: 117359 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117360 - Posted: 5 Nov 2025, 20:55:35 UTC

WCG website slow as molasses in the winter.

Dylan writes:

Database crashed, was able to clear the write lock/disk sleeps causing a crash loop and try restarting the container which was stuck, but it seems there is an IO issue with the volume that the BOINC database runs from or some further cleanup I still need to do before I can get the database up and running again. I can r/w to the volume manually so hopefully something I am able to handle without reaching out to hosting about the volume, it is a Ceph RDB and we store backups to a separate NFS mount point so data loss is not expected, but I don't quite know how long this is going to take yet.
ID: 117360 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117366 - Posted: 6 Nov 2025, 12:32:06 UTC

Changes to the Device Profiles, are now propagating to the BOINC client.
Example: World Community Grid 2025-11-06 13:17:52 General prefs: from World Community Grid (last modified 06-Nov-2025 13:17:20)

Thank you Dylan, for all your hard work (which BTW seems to go on during all hours, even in the middle of the Toronto nights.)
ID: 117366 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1704
United States
Message 117367 - Posted: 6 Nov 2025, 15:04:28 UTC - in response to Message 117366.  

Seems to be working at the moment.
I woke up to a few tasks running on my Android phone this morning and it recently uploaded one task with no issues.

Big thanks to Dylan and everyone else at Jurisica for making it this far!!!
ID: 117367 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117368 - Posted: 6 Nov 2025, 17:48:28 UTC - in response to Message 117358.  

In reply to robsmith's message of 5 Nov 2025:
While plenty of tasks are available the validation queue is getting longer and longer - someone needs to give it a prod/kick/enema to get it moving properly.
Yeah, it's almost no validation going on at the moment. That's expected while Dylan is working on that part of the system. There's also tons of finished tasks to validate that was crunched before (and cached task during) the migration, and uploaded and reported when the system came back. It will take a long time, before everything is back to normal.
ID: 117368 · Report as offensive     Reply Quote
Jean-David

Send message
Joined: 19 Dec 05
Posts: 118
United States
Message 117374 - Posted: 7 Nov 2025, 18:41:10 UTC - in response to Message 117367.  

Does not seem to be working now: Lots of these:

Fri 07 Nov 2025 12:55:40 PM EST | World Community Grid | Sending scheduler request: To fetch work.
Fri 07 Nov 2025 12:55:40 PM EST | World Community Grid | Requesting new tasks for CPU
Fri 07 Nov 2025 12:55:42 PM EST | World Community Grid | Scheduler request completed: got 0 new tasks
Fri 07 Nov 2025 12:55:42 PM EST | World Community Grid | Server error: feeder not running

ID: 117374 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117382 - Posted: 8 Nov 2025, 17:25:27 UTC

The feeder is back, but no new work available yet.
ID: 117382 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117395 - Posted: 10 Nov 2025, 19:19:37 UTC
Last modified: 10 Nov 2025, 19:20:50 UTC

New update from Dylan on the WCG forum.
https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,47541_offset,380#707377

It has been resends only over the weekend, the mcm1_create_work daemons lost their database connection during BOINC database maintenance, and I realized they needed some code changes to not skip batches when the database connection won't allow BOINC to receive the new workunits defined by the batch plan, and a few other BOINC daemons like the batch assimilator (all the daemons live in the same container I've stuffed all our legacy code into) also needed some work to setup the fix for the validation fix. There will be a more comprehensive update posted shortly on the lab website "Operational Status" tab (https://www.cs.toronto.edu/~juris/jlab/wcg.html), but the TLDR is that today I plan to restart MCM1 batch production after I push a new build of the BOINC daemons, transitioner, setup a Kafka broker on the BOINC database node to backfill the assimilators with resends and scheduler reported tasks that didn't have the details needed to calc credit when the assimilator first received the upload pair from the validator, and if all that goes well then I am pretty sure I can finally piece together the full validation backlog from over the break and set the assimilators upon it.
ID: 117395 · Report as offensive     Reply Quote
Hans Sveen

Send message
Joined: 3 Nov 20
Posts: 8
Norway
Message 117400 - Posted: 10 Nov 2025, 21:30:36 UTC
Last modified: 10 Nov 2025, 21:34:22 UTC

Here is the whole update:::


https://www.cs.toronto.edu/~juris/jlab/wcg.html

November 11, 2025
Database maintenance over Friday/Saturday completed without issue. We have resolved an issue with the backup scripts, effectively increased memory used to service database queries and added some new indices. We expect better performance from the BOINC database going forward.
However, the disk remains slower than initial benchmarking when we stood up the database. We will monitor and reach out to hosting to see if the Ceph placement group expansion (that caused the stuck blocks of that particular disk when the placement group the result table lives on) got stuck in a "peering" state. We were informed that we should expect temporary, possibly intermittent slow IO during this Ceph maintenance window. If we can get faster disks for the BOINC database (which would require restoring the database to a new volume as we did to migrate) we will consider a maintenance window. Right now, we are optimistic the issues revealed in the new system by hanging database queries and database crashes can all be resolved with patches the new BOINC daemons, and current performance will be sufficient.
As mentioned, this event identified several issues with the new BOINC daemons.
MCM1 workunit creation proceeds in the Kafka topic even though the database is down, the mcm1_create_work daemon for it's Kafka partition on science01...science06 tries to commit it's part of the batch, database isn't there, so it doesn't do anything, but it does commit it's offset/pointer into the batch plan topic and move on to consume the next batch plan. That means every 10-15m while the database is down, a batch is effectively skipped. We were able to fix that, and have restarted MCM1 batch creation at roughly 5:00 p.m. EST, November 10th, 2025.
We believe we have finally architected a fix for the pending validation backlog issue. This requires some non-trivial plumbing in the MCM1 batch assimilator, a Kafka connector deployed on the BOINC database node, and transitioner code changes.
Workunit supply may remain artificially lower while we roll out the new batch assimilator builds and monitor the transitioner -> Kafka event consumption and result table interaction.
We were able to resolve the issue with computing preferences not being updated from the website to BOINC client and vice versa. Generally, when the BOINC database goes down, so does the event listener that handles these messages on the webserver.
We are still working on resolving the validation backlog from over the break, with the result table bricked during the Ceph maintenance we architected a "trust the filesystem" solution, and we are hopeful that this issue will be resolved this week.
MAM1 was initially planned to be resumed in beta30 last week, to see if 7.07 fairly schedules work and respects --nthreads, which is a blocking issue in promoting the beta application to production. Depending on the error rate and behaviour on BOINC clients, we would then consider the stable code paths for the first production batches. Given our increased control over batch parameters with the new Kafka topic that uses a protobuf schema to fill out the workunit and result table entires, we intend to run work in production on Linux as soon as the beta30 application is stable with an error rate lower than MCM1 excepting the GLIBC dependency, which is typically the only repeated error we see from clients on the current LibTorch code path. We will then rely on iterating the beta30 application to 7.08 and 7.09 to get GPU and Windows support, and Parquet IO for input and uploaded results.


Hans S.
ID: 117400 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117409 - Posted: 12 Nov 2025, 15:27:06 UTC

New work is incoming, but there is a strange issue for quite a lot of my new tasks. Not all of them though, but many.

Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me.
Might be an AI reminder that I need to get myself another cup of Tea, maybe :-)

A few examples, of many:
https://www.worldcommunitygrid.org/contribution/workunit/772178042
https://www.worldcommunitygrid.org/contribution/workunit/772178045
https://www.worldcommunitygrid.org/contribution/workunit/772178049
ID: 117409 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3060
United Kingdom
Message 117411 - Posted: 12 Nov 2025, 15:49:51 UTC - in response to Message 117409.  

Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me.
Might be an AI reminder that I need to get myself another cup of Tea, maybe :-)
No coffee reminder for me. (I only have Linux and Android tasks. What I did notice is that Android is picking up _0 and _1 tasks. All my Linux ones are _2 or occasionally, _3.
ID: 117411 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3060
United Kingdom
Message 117413 - Posted: 13 Nov 2025, 10:01:44 UTC

Now getting freshly generated tasks on Linux as well as Android. Even running only one task at a time on each platform, I am producing results a lot faster than they are getting validated.
ID: 117413 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 613
Sweden
Message 117414 - Posted: 13 Nov 2025, 10:47:16 UTC - in response to Message 117413.  

In reply to Dave's message of 13 Nov 2025:
Now getting freshly generated tasks on Linux as well as Android. Even running only one task at a time on each platform, I am producing results a lot faster than they are getting validated.
Yeah, validation is more or less dead. Only a few tasks finished by both wingmen are validated per day now. Pending Validation are building up fast. They still haven't solved that issue.
ID: 117414 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3060
United Kingdom
Message 117423 - Posted: 14 Nov 2025, 10:52:20 UTC - in response to Message 117414.  

Yeah, validation is more or less dead. Only a few tasks finished by both wingmen are validated per day now. Pending Validation are building up fast. They still haven't solved that issue.
Interestingly, three tasks completed this morning have validated almost right away. No sign of the older ones getting done though.
ID: 117423 · Report as offensive     Reply Quote
Jean-David

Send message
Joined: 19 Dec 05
Posts: 118
United States
Message 117427 - Posted: 14 Nov 2025, 18:42:21 UTC - in response to Message 117411.  

In reply to Dave's message of 12 Nov 2025:
Check the OS types, and versions you're paired with. My Windows 8.1 is paired with "T". That's a new OS and Version for me.
Might be an AI reminder that I need to get myself another cup of Tea, maybe :-)
No coffee reminder for me. (I only have Linux and Android tasks. What I did notice is that Android is picking up _0 and _1 tasks. All my Linux ones are _2 or occasionally, _3.


My Linux work is picking up mostly _0 and _1, as are my partners.. All seem to be valid. Here is a typical result:

MCM1_0242073_7755
Project name: 	Mapping Cancer Markers
Created: 	                Nov. 3, 2025 - 03:40 UTC
Name: 	                MCM1_0242073_7755
Minimum Quorum: 	2
Replication: 	        2
Result name MCM1_0242073_7755_0
OS type        Linux EndeavourOS	
OS version   EndeavourOS Linux [6.17.7-arch1-1|libc 2.42]	
Status           Valid	
Sent time                      2025-11-14 03:17:36 UTC
Time due                      2025-11-14 09:23:46 UTC                
Return time 	             2025-11-14 03:21:33 UTC	
Cpu time            2.36
Elapsed time     2.87
Claimed credit   22.8
 Granted credit  63
 
                         MCM1_0242073_7755_1
                         Linux Red Hat Enterprise Linux	
                         Red Hat Enterprise Linux 8.10 (Ootpa) [4.18.0-553.83.1.el8_10.x86_64|libc 2.28]	
                         Valid	
                         2025-11-14 07:10:20 UTC	
                         1.71
                         1.74	
                         103.2
                           63

ID: 117427 · Report as offensive     Reply Quote
Sirius B
Avatar

Send message
Joined: 12 Jun 09
Posts: 2142
Ireland
Message 117465 - Posted: 19 Nov 2025, 0:05:52 UTC

Since the changeover, the most I've ever got was 2/3 days worth.
This latest batch was for 6 days, last wu completed 23:54 this evening, no other tasks download.
What annoys me the most is the stats sites unable to get stats. They can send "We miss you" e-mails but can't set up stats so we can see what we achieve as individuals/teams...
ID: 117465 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 1 Jul 16
Posts: 177
United States
Message 117466 - Posted: 19 Nov 2025, 0:13:47 UTC
Last modified: 19 Nov 2025, 0:13:56 UTC

WCG would need to update the stats. The last update was August 30th.

https://download.worldcommunitygrid.org/boinc/stats/
ID: 117466 · Report as offensive     Reply Quote
Previous · 1 . . . 32 · 33 · 34 · 35 · 36 · 37 · Next

Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.