Thread 'Anything and Everything to do with (WCG) World Community Grid'

Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

AuthorMessage
Marco Besozzi
New member

Send message
Joined: 3 Oct 25
Posts: 1
Italy
Message 116994 - Posted: 3 Oct 2025, 4:08:46 UTC - in response to Message 116986.  

Since IBM moved to the Kremblin Research Institute, it's been one disaster after another. For two years, I've had 28 CPUs processing data for folding@home, and I've never had a single outage. The research projects are very diverse (Alzheimer's, Covid-19, malaria, kidney cancer, epigenetics), unlike WCG's current, very limited focus. A huge, unbridgeable gap. I think I'll soon cease collaborating with WCG.
ID: 116994 · Report as offensive     Reply Quote
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1374
United Kingdom
Message 116995 - Posted: 3 Oct 2025, 7:00:15 UTC - in response to Message 116994.  

Since IBM moved to the Kremblin Research Institute


Actually WCG moved from IBM to Kremblin. This process was an absolute, but not unexpected, shambles, as many delays were encountered in moving the data to its new home, and basically rewriting a large proportion of the server software, both probably further hampered by less than ideal documentation. Hindsight on my behalf says they should have ported the data into a "proper" BOINC database structure running under native SQL, and at the same time used standard BOINC server applications for everything rather than some strange mongrel as they have just now...
Now they've moved that mongrel to another set of servers and are hitting the similar issues - oh what a surprise!
ID: 116995 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 116996 - Posted: 3 Oct 2025, 9:53:26 UTC
Last modified: 3 Oct 2025, 10:24:09 UTC

Some progress, although with an error message. I haven't tried with BOINC yet, but:

https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi

Now responds with:

<scheduler_reply>
<scheduler_version>701</scheduler_version>
<master_url>http://www.worldcommunitygrid.org/</master_url>
<request_delay>121.200000</request_delay>
<message priority="low">Error in request message: xp.get_tag() failed </message>
<project_name>World Community Grid</project_name>
</scheduler_reply>

Edit: Maybe the "xp.get_tag() failed" message, is because the test request comes from my browser, and not from BOINC.

Edit: Uploading works, but reporting and asking for new work gives the following error message in BOINC:
World Community Grid 2025-10-03 11:50:07 Another scheduler instance is running for this host

We've seen that before, and I guess that is pretty easy to fix.
ID: 116996 · Report as offensive     Reply Quote
Jean-David

Send message
Joined: 19 Dec 05
Posts: 111
United States
Message 116999 - Posted: 3 Oct 2025, 12:50:41 UTC - in response to Message 116993.  

I am getting this now...

Fri 03 Oct 2025 08:45:36 AM EDT | World Community Grid | Sending scheduler request: To fetch work.
Fri 03 Oct 2025 08:45:36 AM EDT | World Community Grid | Requesting new tasks for CPU
Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Scheduler request completed: got 0 new tasks
Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host
Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Project requested delay of 121 seconds

ID: 116999 · Report as offensive     Reply Quote
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1374
United Kingdom
Message 117000 - Posted: 3 Oct 2025, 13:22:00 UTC - in response to Message 116999.  

Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host


A lot of us have been getting this message for a good few hours.... Have they got a double entry in one of the highly convoluted mass that the mongrel scripts on the WCG servers.
In short, there's nothing we can do. (Unless of course one is highly skilled in the incantations required to de-mongrelise the servers and scripts....)
ID: 117000 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117001 - Posted: 3 Oct 2025, 17:21:33 UTC - in response to Message 117000.  

In reply to robsmith's message of 3 Oct 2025:
Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host


A lot of us have been getting this message for a good few hours.... Have they got a double entry in one of the highly convoluted mass that the mongrel scripts on the WCG servers.
In short, there's nothing we can do. (Unless of course one is highly skilled in the incantations required to de-mongrelise the servers and scripts....)
That particular issue has been seen before, even when all the systems were up and running.

See this post on the WCG forum: https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,47541_offset,60#706287
ID: 117001 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1640
United States
Message 117002 - Posted: 3 Oct 2025, 23:29:21 UTC

Looks like another weekend without any WCG crunching and nothing new from Jurisica since this morning.

On a side note How soon until the snow starts flying in Toronto?
ID: 117002 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117003 - Posted: 4 Oct 2025, 2:43:10 UTC

New update:

October 3, 2025
We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assmilation pipelines are working correctly.

Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition.

As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent.
ID: 117003 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3010
United Kingdom
Message 117007 - Posted: 5 Oct 2025, 13:27:12 UTC - in response to Message 117003.  

As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent.

Since I last checked, my event log does say schedular request completed however I still get the "Another scheduler instance is running." message and my completed and aborted due to time task still not reported. Not too bothered as this is on phone and I have work from my main project to keep desktop going for a few weeks at current estimates.
ID: 117007 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117011 - Posted: 6 Oct 2025, 11:02:54 UTC

Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me.
ID: 117011 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3010
United Kingdom
Message 117013 - Posted: 6 Oct 2025, 13:47:09 UTC - in response to Message 117011.  

In reply to Grumpy Swede's message of 6 Oct 2025:
Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me.
Not the big day so far looking at my Android.
ID: 117013 · Report as offensive     Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1640
United States
Message 117014 - Posted: 7 Oct 2025, 0:38:55 UTC

Almost 8:40 PM in Toronto and still no BOINC connection @ WCG. No new new from Jurisica since Oct 3 (3 days ago).

What's another day week month, year, decade or century?
ID: 117014 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117015 - Posted: 7 Oct 2025, 8:45:53 UTC

Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option.
ID: 117015 · Report as offensive     Reply Quote
ProfileBill Freauff
Avatar

Send message
Joined: 26 Mar 11
Posts: 216
United States
Message 117016 - Posted: 7 Oct 2025, 9:12:45 UTC - in response to Message 117015.  

In reply to Grumpy Swede's message of 7 Oct 2025:
Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option.


Christmas 2026 will be an even numbered year.
ID: 117016 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117018 - Posted: 7 Oct 2025, 17:02:28 UTC

Another deadline extension for not uploaded and reported tasks have happened.
ID: 117018 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117024 - Posted: 8 Oct 2025, 1:26:59 UTC
Last modified: 8 Oct 2025, 2:01:16 UTC

The BOINC system is up. It started to come back with correct replies at 2025-10-08 02:26:08 (UTC+2) No new tasks yet, and I haven't tried to report my hundreds of tasks on another computer yet,

But WCG is/was still having issues. Especially the website, which does/did not answer in time at all. So, the website is/was basically dead in the water when I first tried. I'm not surprised of course, since there are many thousands of computers banging on the servers at the same time now. All of them trying to upload and report, and request tasks.

Also permission issues when trying to post on the forum now, and other places too.

Example for https://www.worldcommunitygrid.org/forums/wcg/addpostprocess

403
Forbidden
You don't have permission to access this resource.

Also, the same permission issue with the "contact" link.

But it's light at the end of the tunnel.

Edit, added:
The initial downloading of the 43 .png files, goes immediately to "permanent HTTP error", and leaves 43, 0-byte .png files in the BOINC WCG projects folder.

I have mailed Igor Jurisica about these problems.
ID: 117024 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117025 - Posted: 8 Oct 2025, 3:19:58 UTC

New WCG update from IGOR:

October 7, 2025

We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture.

Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again.

Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application.
ID: 117025 · Report as offensive     Reply Quote
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 3010
United Kingdom
Message 117032 - Posted: 8 Oct 2025, 6:31:57 UTC

I can report my one completed task from android phone has reported.
ID: 117032 · Report as offensive     Reply Quote
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 1374
United Kingdom
Message 117033 - Posted: 8 Oct 2025, 7:12:49 UTC

Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with:
08/10/2025 08:02:24 | | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
08/10/2025 08:02:24 | | [http_xfer] [ID#0] HTTP: wrote 13883 bytes
08/10/2025 08:02:25 | | Internet access OK - project servers may be temporarily down.
08/10/2025 08:02:44 | World Community Grid | Started download of arp1_00_v02.png
08/10/2025 08:02:45 | | [http_xfer] [ID#51] HTTP: wrote 294 bytes
08/10/2025 08:02:45 | World Community Grid | Giving up on download of arp1_00_v02.png: permanent HTTP error

Typical for ARP1.
Now the severs are reporting no tasks available for a large number of the projects.

One step forward, two steps back :-(
ID: 117033 · Report as offensive     Reply Quote
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 566
Sweden
Message 117034 - Posted: 8 Oct 2025, 7:30:29 UTC - in response to Message 117033.  
Last modified: 8 Oct 2025, 7:35:30 UTC

In reply to robsmith's message of 8 Oct 2025:
Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with:
08/10/2025 08:02:24 | | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
08/10/2025 08:02:24 | | [http_xfer] [ID#0] HTTP: wrote 13883 bytes
08/10/2025 08:02:25 | | Internet access OK - project servers may be temporarily down.
08/10/2025 08:02:44 | World Community Grid | Started download of arp1_00_v02.png
08/10/2025 08:02:45 | | [http_xfer] [ID#51] HTTP: wrote 294 bytes
08/10/2025 08:02:45 | World Community Grid | Giving up on download of arp1_00_v02.png: permanent HTTP error

Typical for ARP1.
Now the severs are reporting no tasks available for a large number of the projects.

One step forward, two steps back :-(
The only ones you were served Rob, was the 43 .png files, that are downloaded after a big outage. They are not any work tasks, but .PNG picture files. At the moment, they are failing to download for everyone. Every one of them fail with permanent HTTP error, and if you restart BOINC, the same files will try to be sent again, and fail. The team will work on that issue tomorrow (Today). If you look in your BOINC projects folder for WCG, you will find 43 of those 0-Byte .PNG picture files, that failed to download.

The WCG team haven't started to send out any new work yet, or even validating the uploaded and reported ones.
ID: 117034 · Report as offensive     Reply Quote
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.