Message boards : Projects : Anything and Everything to do with (WCG) World Community Grid
Message board moderation
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next
Author | Message |
---|---|
New member Send message Joined: 3 Oct 25 Posts: 1 ![]() |
Since IBM moved to the Kremblin Research Institute, it's been one disaster after another. For two years, I've had 28 CPUs processing data for folding@home, and I've never had a single outage. The research projects are very diverse (Alzheimer's, Covid-19, malaria, kidney cancer, epigenetics), unlike WCG's current, very limited focus. A huge, unbridgeable gap. I think I'll soon cease collaborating with WCG. |
Send message Joined: 25 May 09 Posts: 1374 ![]() |
Since IBM moved to the Kremblin Research Institute Actually WCG moved from IBM to Kremblin. This process was an absolute, but not unexpected, shambles, as many delays were encountered in moving the data to its new home, and basically rewriting a large proportion of the server software, both probably further hampered by less than ideal documentation. Hindsight on my behalf says they should have ported the data into a "proper" BOINC database structure running under native SQL, and at the same time used standard BOINC server applications for everything rather than some strange mongrel as they have just now... Now they've moved that mongrel to another set of servers and are hitting the similar issues - oh what a surprise! |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
Some progress, although with an error message. I haven't tried with BOINC yet, but: https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi Now responds with: <scheduler_reply> <scheduler_version>701</scheduler_version> <master_url>http://www.worldcommunitygrid.org/</master_url> <request_delay>121.200000</request_delay> <message priority="low">Error in request message: xp.get_tag() failed </message> <project_name>World Community Grid</project_name> </scheduler_reply> Edit: Maybe the "xp.get_tag() failed" message, is because the test request comes from my browser, and not from BOINC. Edit: Uploading works, but reporting and asking for new work gives the following error message in BOINC: World Community Grid 2025-10-03 11:50:07 Another scheduler instance is running for this host We've seen that before, and I guess that is pretty easy to fix. |
Send message Joined: 19 Dec 05 Posts: 111 ![]() |
I am getting this now... Fri 03 Oct 2025 08:45:36 AM EDT | World Community Grid | Sending scheduler request: To fetch work. Fri 03 Oct 2025 08:45:36 AM EDT | World Community Grid | Requesting new tasks for CPU Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Scheduler request completed: got 0 new tasks Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Project requested delay of 121 seconds ![]() |
Send message Joined: 25 May 09 Posts: 1374 ![]() |
Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host A lot of us have been getting this message for a good few hours.... Have they got a double entry in one of the highly convoluted mass that the mongrel scripts on the WCG servers. In short, there's nothing we can do. (Unless of course one is highly skilled in the incantations required to de-mongrelise the servers and scripts....) |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
In reply to robsmith's message of 3 Oct 2025: That particular issue has been seen before, even when all the systems were up and running.Fri 03 Oct 2025 08:45:38 AM EDT | World Community Grid | Another scheduler instance is running for this host See this post on the WCG forum: https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,47541_offset,60#706287 |
![]() Send message Joined: 10 May 07 Posts: 1640 ![]() |
Looks like another weekend without any WCG crunching and nothing new from Jurisica since this morning. On a side note How soon until the snow starts flying in Toronto? |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
New update: October 3, 2025 We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assmilation pipelines are working correctly. Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition. As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent. |
![]() Send message Joined: 28 Jun 10 Posts: 3010 ![]() |
As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent. Since I last checked, my event log does say schedular request completed however I still get the "Another scheduler instance is running." message and my completed and aborted due to time task still not reported. Not too bothered as this is on phone and I have work from my main project to keep desktop going for a few weeks at current estimates. |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me. |
![]() Send message Joined: 28 Jun 10 Posts: 3010 ![]() |
In reply to Grumpy Swede's message of 6 Oct 2025: Today is the big day. (Or not). Maybe tomorrow (Or not). Whenever, is OK with me.Not the big day so far looking at my Android. |
![]() Send message Joined: 10 May 07 Posts: 1640 ![]() |
Almost 8:40 PM in Toronto and still no BOINC connection @ WCG. No new new from Jurisica since Oct 3 (3 days ago). What's another day week month, year, decade or century? |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option. |
![]() ![]() Send message Joined: 26 Mar 11 Posts: 216 ![]() |
In reply to Grumpy Swede's message of 7 Oct 2025: Monday was not the big day. Let's see if Tuesday is. There's no hurry, we still have Christmas Eve as an option. Christmas 2026 will be an even numbered year. |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
Another deadline extension for not uploaded and reported tasks have happened. |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
The BOINC system is up. It started to come back with correct replies at 2025-10-08 02:26:08 (UTC+2) No new tasks yet, and I haven't tried to report my hundreds of tasks on another computer yet, But WCG is/was still having issues. Especially the website, which does/did not answer in time at all. So, the website is/was basically dead in the water when I first tried. I'm not surprised of course, since there are many thousands of computers banging on the servers at the same time now. All of them trying to upload and report, and request tasks. Also permission issues when trying to post on the forum now, and other places too. Example for https://www.worldcommunitygrid.org/forums/wcg/addpostprocess 403 Forbidden You don't have permission to access this resource. Also, the same permission issue with the "contact" link. But it's light at the end of the tunnel. Edit, added: The initial downloading of the 43 .png files, goes immediately to "permanent HTTP error", and leaves 43, 0-byte .png files in the BOINC WCG projects folder. I have mailed Igor Jurisica about these problems. |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
New WCG update from IGOR: October 7, 2025 We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture. Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again. Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application. |
![]() Send message Joined: 28 Jun 10 Posts: 3010 ![]() |
I can report my one completed task from android phone has reported. |
Send message Joined: 25 May 09 Posts: 1374 ![]() |
Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with: 08/10/2025 08:02:24 | | [http_xfer] [ID#0] HTTP: wrote 16384 bytes Typical for ARP1. Now the severs are reporting no tasks available for a large number of the projects. One step forward, two steps back :-( |
![]() Send message Joined: 30 Mar 20 Posts: 566 ![]() |
In reply to robsmith's message of 8 Oct 2025: Reported the last couple of outstanding" tasks and got offered a whole pile of new ones from various projects including ARP, MCM1 & MAM1. All have failed to download with:The only ones you were served Rob, was the 43 .png files, that are downloaded after a big outage. They are not any work tasks, but .PNG picture files. At the moment, they are failing to download for everyone. Every one of them fail with permanent HTTP error, and if you restart BOINC, the same files will try to be sent again, and fail. The team will work on that issue tomorrow (Today). If you look in your BOINC projects folder for WCG, you will find 43 of those 0-Byte .PNG picture files, that failed to download. The WCG team haven't started to send out any new work yet, or even validating the uploaded and reported ones. |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.