Thread 'World Community Grid has announced an extended outage from Feb 14 to April 22, 2022'

Message boards : Projects : World Community Grid has announced an extended outage from Feb 14 to April 22, 2022
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110426 - Posted: 12 Nov 2022, 14:07:25 UTC - in response to Message 110424.  

Website down, scheduler still OK, no retries on recent downloads.

Paul.

And still down, 6 hours after your post :-(
ID: 110426 · Report as offensive
PMH_UK

Send message
Joined: 24 Dec 10
Posts: 36
United Kingdom
Message 110428 - Posted: 12 Nov 2022, 20:01:08 UTC - in response to Message 110426.  

Website up, scheduler responding OK.

Paul.
ID: 110428 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110436 - Posted: 13 Nov 2022, 18:42:48 UTC
Last modified: 13 Nov 2022, 19:20:01 UTC

Sigh.....

503 Service Unavailable
No server is available to handle this request


Edit, and shortly thereafter:

System error
World Community Grid is currently experiencing an unexpected error. Please check Facebook or Twitter for more information.


It's no longer an "unexpected error". I suggest they change the message to "World Community Grid is currently experiencing an expected error.

Incredible that they can't get it to work longer than max 1-day, and only if they restrict how much work they send out.
ID: 110436 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110472 - Posted: 16 Nov 2022, 20:00:59 UTC

And down again of course.....
ID: 110472 · Report as offensive
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1442
United States
Message 110473 - Posted: 16 Nov 2022, 20:57:06 UTC

More propaganda from Krembil:
WCG’s 18th Anniversary

Celebrating WCG’s achievements over the past 18 years.


Best part of propaganda [highlighted in bold by me] >>
The WCG is now managed by the Jurisica lab at UHN. As an academic group, we face a significant financial challenge in providing the same level of support to the global research community, still free of charge. One year may seem a long time for “WCG system migration”, but our small team had to transition from a research team that ran Mapping Cancer Markers on WCG to learning, building and configuring the WCG system on our servers, with our architecture and performance constraints (s the servers were not optimally configured for the WCG system)
ID: 110473 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110476 - Posted: 17 Nov 2022, 15:47:04 UTC

And again.....Rinse and repeat......
ID: 110476 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110483 - Posted: 18 Nov 2022, 3:30:33 UTC

And more rinse and repeat.
ID: 110483 · Report as offensive
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1442
United States
Message 110491 - Posted: 18 Nov 2022, 23:06:16 UTC

New post / WCG news from Krembil:
Thread: 2022-11-18 Update (Network and Storage)
Hi everyone, an update on network connection and storage.

We are working together with SHARCNET (an HPC site where WCG servers and storage reside) to resolve the network congestion events we have been experiencing. For volunteers, these events manifest as the arbitrary website/forums database downtime and constant interruptions to volunteers attempting to download workunits. At this time, we believe the root cause to be a limitation or bug in the OpenStack software through which our virtual environment is provisioned at SHARCNET.

To help ameliorate the worst effects of this issue, SHARCNET have re-routed all WCG traffic through a new network node. Effectively, this separates WCG traffic from that of other users and deployments unrelated to the WCG that are colocated at the SHARCNET HPC facility. We have already seen a benefit from this change, and it could help us to further diagnose and optimize additional performance issues.

We have also reduced the maximum concurrent connections permitted on the download servers at SHARCNET’s request, and reduced the maximum number of packages available at any one time for download. Although these adjustments suggest a lower throughput, they have been active since November 11 and are in fact helping the overall throughput of WCG by stabilizing the network to a degree. However, we are still seeing events inside our environment where the load balancer and servers behind it are sometimes unable to communicate with each other.

Importantly, the bandwidth that the WCG environment is provided with at SHARCNET is nowhere near saturated during these events. It is not an issue of capacity. We are working to resolve this and will provide an update on our progress as soon as we have new information. Once resolved, we will be in a position to fully restart, and bring new projects to the Grid.

The new and faster storage server is physically installed at SHARCNET now and will be connected to the rest of the WCG servers next week. The primary benefit of the new storage array is the SSD storage that comes with it, which will increase performance of many key components that currently rely on NFS shares of logical volumes that are composed of HDD storage only.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team at Krembil Research Institute
ID: 110491 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110523 - Posted: 22 Nov 2022, 15:24:43 UTC

And time again for another "System error"
ID: 110523 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110550 - Posted: 24 Nov 2022, 15:29:43 UTC

Geeze, this is really getting old.....
503 Service Unavailable

And as usual followed by:
System error
World Community Grid is currently experiencing an unexpected error. Please check Facebook or Twitter for more information.


Of course there is no information about this on Facebook.....
ID: 110550 · Report as offensive
PMH_UK

Send message
Joined: 24 Dec 10
Posts: 36
United Kingdom
Message 110586 - Posted: 29 Nov 2022, 15:46:30 UTC - in response to Message 110428.  

Down, 503 etc.

Paul.
ID: 110586 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110617 - Posted: 3 Dec 2022, 2:27:25 UTC
Last modified: 3 Dec 2022, 3:16:24 UTC

Web and the BOINC part of WCG, both down this time. So, this time the entire WCG project is DEAD.
The web part does not respond at all, and the BOINC part responds with
6372	World Community Grid	2022-12-03 03:27:02	update requested by user	
6373	World Community Grid	2022-12-03 03:27:04	Sending scheduler request: Requested by user.	
6374	World Community Grid	2022-12-03 03:27:04	Requesting new tasks for CPU and NVIDIA GPU	
6375	World Community Grid	2022-12-03 03:27:27	Scheduler request failed: Couldn't connect to server	
6376			        2022-12-03 03:27:28	Project communication failed: attempting access to reference site	
6377			        2022-12-03 03:27:29	Internet access OK - project servers may be temporarily down.	

EDIT: Seems as if it's WCG's IT-provider sharcnet.ca that is totally down.

"SHARCNET is a consortium of universities in Ontario, Canada, that aggregate funding to purchase supercomputer systems,
which are shared among the members to perform research, rather than individually purchasing smaller systems at each university.
It was formed to allow members access to larger, faster, and more modern computer resources than they would otherwise be able
to afford, and to retain researchers at their organizations. SHARCNET is part of the larger Compute Canada umbrella."
ID: 110617 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110620 - Posted: 3 Dec 2022, 5:04:18 UTC

Geeze, WCG via sharcnet came back for short while, but now the WCG site is gone again with the usual:
System error
World Community Grid is currently experiencing an unexpected error. Please check Facebook or Twitter for more information.


BOINC, and Sharcnet is working though, so this time it seems to be the "normal" WGG problem.
ID: 110620 · Report as offensive
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1442
United States
Message 110677 - Posted: 9 Dec 2022, 7:11:44 UTC

News from WCG
2022-12-08 Update (OPNG workunits, storage update, and missing devices on Contributions page)
OPNG workunits
Over the past 2 weeks many volunteers noticed the lack of OPNG workunits. The shortage is not a technical issue, but rather it has a “science-based” reason. Here is a short message the OPN team shared with us:

“The scientists are analyzing the available data to choose a set of protein conformations for the next rounds of molecular docking in GPUs. The focus is on PLPro, one of the SARS-CoV-2 proteases. The dockings will include covalent and non-covalent compounds.”

When new OPNG tasks will be prepared, we will make a small announcement informing everyone.

SSD delay
Unfortunately, the new storage upgrade that we mentioned in this previous update has been delayed by a late shipment of the SSDs, with no guaranteed ETA for implementation after arrival at this time. We are in constant communication with the SHARCNET team as to the status of the new drives, and will update you when significant progress has been made.

Missing devices in Contributions page
Several volunteers experienced missing devices and results in the Contributions page. This issue has been problematic for some time and we are now putting more effort into solving it. For the time being, we have scheduled a batch update from the BOINC database directly, bypassing the message passing via IBM MQ that notifies the website database when new devices have been inserted into the BOINC database. Insertion into the BOINC database occurs at time of first contact with our scheduler. The new record is then placed on a service-to-service queue, to which the webserver and website database server are subscribed. When processed, this message should result in visibility of the device in both databases and to users on the website immediately.

The fact that only some devices do not show up as a result of this process and remain “hidden” is what we must understand before we can fully resolve the issue. For now, re-running the batch update procedure regularly will help to recover visibility of most devices newly added to WCG. We highly encourage those having this issue to send an email to [email protected] with screenshots and extra details you can provide.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team at Krembil Research Institute
ID: 110677 · Report as offensive
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2691
United Kingdom
Message 110681 - Posted: 10 Dec 2022, 14:26:27 UTC

And ARP tasks just back after an absence. Downloads are so dodgy it makes my bored band upload limit of 100KB/s seem fast!
ID: 110681 · Report as offensive
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1442
United States
Message 110682 - Posted: 10 Dec 2022, 15:00:21 UTC - in response to Message 110681.  

My experience is virtually the same today.

One ARP data file been in constant backoffs/stuck download for almost 9 hours so far. Don't matter how many times mash the retry, it ends with backoff.
ID: 110682 · Report as offensive
ProfileDave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2691
United Kingdom
Message 110685 - Posted: 10 Dec 2022, 17:41:55 UTC - in response to Message 110682.  

My experience is virtually the same today.

One ARP data file been in constant backoffs/stuck download for almost 9 hours so far. Don't matter how many times mash the retry, it ends with backoff.
7 running, 7 inqueue and one uploading at the moment. I am trying to manage it so only one task is downloading at a time so that downloads happening from lots of tasks at once don't mean that much longer till all files from the first task finish downloading.

Anyway, a few more thousand CPDN tasks coming next week so ARP will go to back burner.
ID: 110685 · Report as offensive
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 296
United Kingdom
Message 110687 - Posted: 10 Dec 2022, 17:47:47 UTC

The sad thing is that I don’t take ARP but my WCG is still screwed because of the download problem - one held file overnight and the whole of WCG stops until the blockage is cleared.
ID: 110687 · Report as offensive
Grumpy Swede
Avatar

Send message
Joined: 30 Mar 20
Posts: 416
Sweden
Message 110693 - Posted: 11 Dec 2022, 0:51:03 UTC
Last modified: 11 Dec 2022, 0:52:35 UTC

It's almost 10 months now, since the migration from IBM to Krembil/Jurisica begun, and we're not even close to anything that even resembles
how WCG worked before the migration. Nothing has gone according to plans....

WCG Data Transfer Underway, Stress Test of New Infrastructure Scheduled For Feb 28th
ID: 110693 · Report as offensive
Dr Who Fan
Avatar

Send message
Joined: 10 May 07
Posts: 1442
United States
Message 110707 - Posted: 12 Dec 2022, 7:46:57 UTC

lol.... Another day, more "stuck" downloads for ARP tasks
ID: 110707 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Projects : World Community Grid has announced an extended outage from Feb 14 to April 22, 2022

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.