Thread 'Anything and Everything to do with (WCG) World Community Grid'

Author	Message
Marco Besozzi Send message Joined: 3 Oct 25 Posts: 1	Message 116994 - Posted: 3 Oct 2025, 4:08:46 UTC - in response to Message 116986. Since IBM moved to the Kremblin Research Institute, it's been one disaster after another. For two years, I've had 28 CPUs processing data for folding@home, and I've never had a single outage. The research projects are very diverse (Alzheimer's, Covid-19, malaria, kidney cancer, epigenetics), unlike WCG's current, very limited focus. A huge, unbridgeable gap. I think I'll soon cease collaborating with WCG. ID: 116994 · Reply Quote

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1460	Message 116995 - Posted: 3 Oct 2025, 7:00:15 UTC - in response to Message 116994. Since IBM moved to the Kremblin Research Institute Actually WCG moved from IBM to Kremblin. This process was an absolute, but not unexpected, shambles, as many delays were encountered in moving the data to its new home, and basically rewriting a large proportion of the server software, both probably further hampered by less than ideal documentation. Hindsight on my behalf says they should have ported the data into a "proper" BOINC database structure running under native SQL, and at the same time used standard BOINC server applications for everything rather than some strange mongrel as they have just now... Now they've moved that mongrel to another set of servers and are hitting the similar issues - oh what a surprise! ID: 116995 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 116996 - Posted: 3 Oct 2025, 9:53:26 UTC Last modified: 3 Oct 2025, 10:24:09 UTC Some progress, although with an error message. I haven't tried with BOINC yet, but: https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi Now responds with: <scheduler_reply> <scheduler_version>701</scheduler_version> <master_url>http://www.worldcommunitygrid.org/</master_url> <request_delay>121.200000</request_delay> <message priority="low">Error in request message: xp.get_tag() failed </message> <project_name>World Community Grid</project_name> </scheduler_reply> Edit: Maybe the "xp.get_tag() failed" message, is because the test request comes from my browser, and not from BOINC. Edit: Uploading works, but reporting and asking for new work gives the following error message in BOINC: World Community Grid 2025-10-03 11:50:07 Another scheduler instance is running for this host We've seen that before, and I guess that is pretty easy to fix. ID: 116996 · Reply Quote

Jean-David Send message Joined: 19 Dec 05 Posts: 124	Message 116999 - Posted: 3 Oct 2025, 12:50:41 UTC - in response to Message 116993. I am getting this now... Fri 03 Oct 2025 08:45:36 AM EDT \| World Community Grid \| Sending scheduler request: To fetch work. Fri 03 Oct 2025 08:45:36 AM EDT \| World Community Grid \| Requesting new tasks for CPU Fri 03 Oct 2025 08:45:38 AM EDT \| World Community Grid \| Scheduler request completed: got 0 new tasks Fri 03 Oct 2025 08:45:38 AM EDT \| World Community Grid \| Another scheduler instance is running for this host Fri 03 Oct 2025 08:45:38 AM EDT \| World Community Grid \| Project requested delay of 121 seconds ID: 116999 · Reply Quote

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1460	Message 117000 - Posted: 3 Oct 2025, 13:22:00 UTC - in response to Message 116999. Fri 03 Oct 2025 08:45:38 AM EDT \| World Community Grid \| Another scheduler instance is running for this host A lot of us have been getting this message for a good few hours.... Have they got a double entry in one of the highly convoluted mass that the mongrel scripts on the WCG servers. In short, there's nothing we can do. (Unless of course one is highly skilled in the incantations required to de-mongrelise the servers and scripts....) ID: 117000 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117001 - Posted: 3 Oct 2025, 17:21:33 UTC - in response to Message 117000. In reply to robsmith's message of 3 Oct 2025: Fri 03 Oct 2025 08:45:38 AM EDT \| World Community Grid \| Another scheduler instance is running for this host A lot of us have been getting this message for a good few hours.... Have they got a double entry in one of the highly convoluted mass that the mongrel scripts on the WCG servers. In short, there's nothing we can do. (Unless of course one is highly skilled in the incantations required to de-mongrelise the servers and scripts....) That particular issue has been seen before, even when all the systems were up and running. See this post on the WCG forum: https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,47541_offset,60#706287 ID: 117001 · Reply Quote

Dr Who Fan Send message Joined: 10 May 07 Posts: 1860	Message 117002 - Posted: 3 Oct 2025, 23:29:21 UTC Looks like another weekend without any WCG crunching and nothing new from Jurisica since this morning. On a side note How soon until the snow starts flying in Toronto? ID: 117002 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117003 - Posted: 4 Oct 2025, 2:43:10 UTC New update: October 3, 2025 We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assmilation pipelines are working correctly. Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition. As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent. ID: 117003 · Reply Quote

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 3392	Message 117007 - Posted: 5 Oct 2025, 13:27:12 UTC - in response to Message 117003. As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent. Since I last checked, my event log does say schedular request completed however I still get the "Another scheduler instance is running." message and my completed and aborted due to time task still not reported. Not too bothered as this is on phone and I have work from my main project to keep desktop going for a few weeks at current estimates. ID: 117007 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117011 - Posted: 6 Oct 2025, 11:02:54 UTC Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me. ID: 117011 · Reply Quote

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 3392	Message 117013 - Posted: 6 Oct 2025, 13:47:09 UTC - in response to Message 117011. In reply to Grumpy Swede's message of 6 Oct 2025: Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me. Not the big day so far looking at my Android. ID: 117013 · Reply Quote

Dr Who Fan Send message Joined: 10 May 07 Posts: 1860	Message 117014 - Posted: 7 Oct 2025, 0:38:55 UTC Almost 8:40 PM in Toronto and still no BOINC connection @ WCG. No new new from Jurisica since Oct 3 (3 days ago). What's another day week month, year, decade or century? ID: 117014 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117015 - Posted: 7 Oct 2025, 8:45:53 UTC Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option. ID: 117015 · Reply Quote

Bill Freauff Send message Joined: 26 Mar 11 Posts: 255	Message 117016 - Posted: 7 Oct 2025, 9:12:45 UTC - in response to Message 117015. In reply to Grumpy Swede's message of 7 Oct 2025: Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option. Christmas 2026 will be an even numbered year. ID: 117016 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117018 - Posted: 7 Oct 2025, 17:02:28 UTC Another deadline extension for not uploaded and reported tasks have happened. ID: 117018 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117024 - Posted: 8 Oct 2025, 1:26:59 UTC Last modified: 8 Oct 2025, 2:01:16 UTC The BOINC system is up. It started to come back with correct replies at 2025-10-08 02:26:08 (UTC+2) No new tasks yet, and I haven't tried to report my hundreds of tasks on another computer yet, But WCG is/was still having issues. Especially the website, which does/did not answer in time at all. So, the website is/was basically dead in the water when I first tried. I'm not surprised of course, since there are many thousands of computers banging on the servers at the same time now. All of them trying to upload and report, and request tasks. Also permission issues when trying to post on the forum now, and other places too. Example for https://www.worldcommunitygrid.org/forums/wcg/addpostprocess 403 Forbidden You don't have permission to access this resource. Also, the same permission issue with the "contact" link. But it's light at the end of the tunnel. Edit, added: The initial downloading of the 43 .png files, goes immediately to "permanent HTTP error", and leaves 43, 0-byte .png files in the BOINC WCG projects folder. I have mailed Igor Jurisica about these problems. ID: 117024 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117025 - Posted: 8 Oct 2025, 3:19:58 UTC New WCG update from IGOR: October 7, 2025 We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture. Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again. Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application. ID: 117025 · Reply Quote

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 3392	Message 117032 - Posted: 8 Oct 2025, 6:31:57 UTC I can report my one completed task from android phone has reported. ID: 117032 · Reply Quote

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1460	Message 117033 - Posted: 8 Oct 2025, 7:12:49 UTC Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with: 08/10/2025 08:02:24 \| \| [http_xfer] [ID#0] HTTP: wrote 16384 bytes 08/10/2025 08:02:24 \| \| [http_xfer] [ID#0] HTTP: wrote 13883 bytes 08/10/2025 08:02:25 \| \| Internet access OK - project servers may be temporarily down. 08/10/2025 08:02:44 \| World Community Grid \| Started download of arp1_00_v02.png 08/10/2025 08:02:45 \| \| [http_xfer] [ID#51] HTTP: wrote 294 bytes 08/10/2025 08:02:45 \| World Community Grid \| Giving up on download of arp1_00_v02.png: permanent HTTP error Typical for ARP1. Now the severs are reporting no tasks available for a large number of the projects. One step forward, two steps back :-( ID: 117033 · Reply Quote

Grumpy Swede Send message Joined: 30 Mar 20 Posts: 741	Message 117034 - Posted: 8 Oct 2025, 7:30:29 UTC - in response to Message 117033. Last modified: 8 Oct 2025, 7:35:30 UTC In reply to robsmith's message of 8 Oct 2025: Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with: 08/10/2025 08:02:24 \| \| [http_xfer] [ID#0] HTTP: wrote 16384 bytes 08/10/2025 08:02:24 \| \| [http_xfer] [ID#0] HTTP: wrote 13883 bytes 08/10/2025 08:02:25 \| \| Internet access OK - project servers may be temporarily down. 08/10/2025 08:02:44 \| World Community Grid \| Started download of arp1_00_v02.png 08/10/2025 08:02:45 \| \| [http_xfer] [ID#51] HTTP: wrote 294 bytes 08/10/2025 08:02:45 \| World Community Grid \| Giving up on download of arp1_00_v02.png: permanent HTTP error Typical for ARP1. Now the severs are reporting no tasks available for a large number of the projects. One step forward, two steps back :-( The only ones you were served Rob, was the 43 .png files, that are downloaded after a big outage. They are not any work tasks, but .PNG picture files. At the moment, they are failing to download for everyone. Every one of them fail with permanent HTTP error, and if you restart BOINC, the same files will try to be sent again, and fail. The team will work on that issue tomorrow (Today). If you look in your BOINC projects folder for WCG, you will find 43 of those 0-Byte .PNG picture files, that failed to download. The WCG team haven't started to send out any new work yet, or even validating the uploaded and reported ones. ID: 117034 · Reply Quote

Copyright © 2026 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.