Forms not loading on JSD portal
Incident Report for ThinkTilt
Postmortem

On 8 April 2020 (UTC), a partial outage occurred involving ProForma’s cloud servers. During the outage forms submitted through the customer portal or changes to the contents of forms may not have been saved.

The cause of this partial outage was that ProForma failed to handle a large volume of requests coming from multiple Jira instances. These requests were triggered by multiple customers performing bulk updates on thousands of issues at once.

This post mortem is to update those affected with further information about the incident timing, it’s potential impact to yourself and how you can respond to the situation. We appreciate how important form data is and we want to do our utmost to ensure your processes aren’t unduly affected.

Impact

ProForma suffered a partial outage and failed to respond to some requests. This meant that users may not have been able to:

  • Access a form attached to an issue;
  • Save changes to the contents of a form; or
  • Submit a form via the customer portal.

This potentially means that requests or issues could have been submitted/created which do not have forms attached, or changes to the contents of a form were not saved. Most users will have seen an error message on their web page, indicating that the server was not available. This should alert them that their work was not saved.

Timing

We have reviewed our logs and the partial outage affected customers between 13:00 UTC and approximately 17:00 UTC on Wednesday 8 April 2020. For different timezones this incident occurred at:

  • Sydney : 11:00pm, 8 April to 3:00am 9 April
  • London : 2:00pm to 6:00pm, 8 April
  • San Francisco : 6:00am to 10:00am, 8 April

Possible repair pathways

Depending on the importance of the data being received, you will need to consider the following:

For internal Jira Users (Software & Business Projects)

Whether to advise your users that they need to check that any changes to the forms were saved during the outage. Unfortunately this will need to be a manual process.

For Customers using the Jira Service Desk Portal

We recommend checking the requests created during the outage to ensure that they have forms attached. For those requests that do not have forms attached, you could advise customers their request was not submitted properly due to a processing error.

To find the possible issues affected you can use the following JQL queries (adjusting the project key as appropriate). Note: In cloud the created date is relative to the timezone set for your instance:

  • For Sydney: project = KEY AND created >= "2020-04-08 23:00" AND created <= "2020-04-09 03:00" order by created DESC
  • For London: project = KEY AND created >= "2020-04-08 14:00" AND created <= "2020-04-08 18:00" order by created DESC
  • For San Francisco: project = KEY AND created >= "2020-04-08 06:00" AND created <= "2020-04-08 10:00" order by created DESC

Once you have identified the issues you can either

  1. Ask the requestor to create a new request; or
  2. You can manually attach the required form to the existing issue and ask them to resubmit it.

To attach a form to the existing issue:

  1. View the affected requests where a ticket was created but there was no form.
  2. Add the relevant form to the affected request.
  3. Mark the form as “External” so that it will be visible on the portal to the user. This will allow the user to resubmit their information with the least impact.
  4. Send a comment to the user requesting them to complete the form again.
  5. The requestor can log into the portal, view the request and fill out the form.
  6. You can then process the request as normal.

It may be useful to also label all the affected tickets for monitoring.

Next steps by ProForma:

We know this problem is caused when multiple customers perform bulk updates on thousands of issues. We have now changed how our servers operate, so that the bulk update requests are managed in a separate cluster of servers to the rest of ProForma, which ensures that normal ProForma function can always continue to operate. We have also doubled our server capacity by splitting our ProForma full and ProForma Lite customers onto their own server clusters.

We sincerely apologize for this incident and we will do our utmost to avoid an incident of this nature happening again. We are also examining how we can improve our response time to this incident as we believe it took too long for us to restore normal operations.

We appreciate how important form data is, and we are genuinely sorry for the inconvenience this will cause you and your customers. We remain committed to ensuring this issue doesn’t impact you or your customers again.

Posted Apr 09, 2020 - 13:26 AEST

Resolved
This issue was resolved by 4am Australian Eastern Time. We will provide full post mortem later today.
Posted Apr 09, 2020 - 09:56 AEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 09, 2020 - 03:39 AEST
Identified
Our servers were hit by an unprecedented amount of requests from Jira relating to updates to issues. We don’t know what caused these updates, as it was not ProForma; however, ProForma had to process the requests and determine whether the update related to our forms. Unfortunately given the huge number of requests this overloaded our cluster of servers.

We have now implemented a filter to ignore the specific type of request. The only impact on ProForma of this filter is that the automation rule to add a form on status change will not work.

We will investigate what caused such a spike in update requests to come from Jira, and we will restore the add form automation rule as soon as we can.
Posted Apr 09, 2020 - 03:39 AEST
Update
We are continuing to investigate the cause of this issue. We have activated/woken every member of the team who can help to resolve this issue as quickly as possible.
Posted Apr 09, 2020 - 02:20 AEST
Investigating
We are investigating reports that forms are failing to load on the JSD portal. We will provide an update within the hour.
Posted Apr 09, 2020 - 00:10 AEST
This incident affected: ProForma Cloud - Primary AWS servers.