Service incident

Incident Report for WorkBoard

Postmortem

System access has been restored. At this time, the facts point to no issue with our production code or any changes introduced to the environment by WorkBoard but rather a failure in the Microsoft API layer between Azure services. As you know, we rely on Microsoft for cloud infrastructure. The second failure point was in the support SLA from Microsoft which was < 1 hour to critical situations. Today, it took 90 minutes to get support on a call, and 2 hours to get a capable support person on a call. Even with those resources, they needed to escalate to engineering and to initiate a reboot of their API layer to mitigate the issue. That reboot in turn required other services in our environment to be restarted, which relieved the problem.

We are expecting an immediate explanation for why the API layer failed (expected today) and a full root cause analysis from Microsoft (expected in few days). All parties were focused on resolving the issue and are now turning to diagnosing and preventing further occurance.

While this issue does not appear to be caused by WorkBoard, we understand its full brunt was on you as a customer who relies on the platform and that it comes on the heels of several recent frustrating issues that were our errors. We are actively exploring what additional level of SLA we can purchase (our current level is Premier) and alternative providers that can more rapidly diagnose issues and be more helpful in resolutions on a wider spectrum and productive in preventing issues in the first instance.

Posted May 03, 2023 - 20:24 UTC

Resolved

Bad Gateway errors have ceased and we are seeing most pages return to standard responsiveness. We are calling the all clear on this service incident and we will continue to monitor the system and follow through on page optimizations in the immediate term.

Posted May 03, 2023 - 19:11 UTC

Identified

We are still working directly with Microsoft Support to restore full functionality to the application. We remain all hands on deck and are continuing to monitor the situation.

Posted May 03, 2023 - 19:03 UTC

Monitoring

We are beginning to see signals that the application is being restored and we are continuing to monitor the situation.

Posted May 03, 2023 - 18:37 UTC

Update

We rely on Microsoft Azure for cloud and server infrastructure and we are pursuing parallel steps that are slowly restoring some components and services. Bad Gateway errors have receded but pages are still loading slowly. We will continue to work with Microsoft and share progress as it is made.

Posted May 03, 2023 - 18:19 UTC

Update

Microsoft Support and Engineering are treating this as an urgent issue and we are troubleshooting live to restore service as soon as possible.

Posted May 03, 2023 - 18:02 UTC

Update

We're continuing to investigate the issue with our downstream clusters' components live with Microsoft Support. We will share progress as it is made. Thank you for your patience.

Posted May 03, 2023 - 17:47 UTC

Update

We're actively investigating the issue with our downstream clusters' components and working closely with Microsoft Support to resolve it. All hands remain on deck, and we will continue to update as we progress.

Posted May 03, 2023 - 17:32 UTC

Update

A thorough investigation of the downstream clusters is ongoing. In the interim we took recommended mitigation steps from Microsoft and are engaged with Microsoft Support investigating the factors that our cluster is dependent upon. We will provide the next update shortly.

Posted May 03, 2023 - 17:16 UTC

Identified

Our engineers have identified the issue related to the downstream Azure cluster and are working with Microsoft Product support.

Posted May 03, 2023 - 16:50 UTC

Update

We are still investigating this issue. All hands remain on deck.

Posted May 03, 2023 - 16:32 UTC

Update

We are still investigating this issue. All hands remain on deck.

Posted May 03, 2023 - 16:04 UTC

Update

Some US users continue to receive a Bad Gateway error when attempting to access Workboard. We are still investigating the issue and all hands remain on deck.

Posted May 03, 2023 - 15:37 UTC

Investigating

We’re investigating an issue where some users are receiving a Bad Gateway error when attempting to access WorkBoard.

Posted May 03, 2023 - 15:10 UTC

This incident affected: Web App and API.