Helping IT Operations and DevOps deliver on the promise of agility, performance, and uptime.
Homepage: https://www.pagerduty.com
Status Page: https://status.pagerduty.com
Action | Date | Description |
---|---|---|
Resolved | 2017-01-19 20:27:00 | This issue has been resolved & incidents can now be created via the web UI. Additionally, some users may have experienced problems saving changes to Account Settings. This problem has been resolved. If you receive a 404 error while trying to manually create incidents via the web UI or save changes to Account Settings, please refresh the page and clear your browser's cache if you still experience problems. |
Investigating | 2017-01-19 20:11:00 | PagerDuty is investigating an issue affecting the ability to manually trigger incidents via the web UI. Events received via email and our API are not affected, and notification delivery is working as expected. |
Action | Date | Description |
---|---|---|
Resolved | 2017-01-07 00:10:00 | We resolved the issues with our incident log entries component and the web/mobile apps no longer display errors when viewing log entries. |
Investigating | 2017-01-06 23:29:00 | There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating. |
Action | Date | Description |
---|---|---|
Resolved | 2017-01-04 02:21:00 | We have resolved the issues with generic webhooks and all webhooks are being delivered properly. |
Investigating | 2017-01-04 01:01:00 | We are experiencing issues with generic webhook delivery. Check status.pagerduty.com for updates as they occur. |
Action | Date | Description |
---|---|---|
Resolved | 2016-11-30 11:14:00 | PagerDuty is again processing alerts at full speed. |
Identified | 2016-11-30 11:10:00 | We are now in recovery. There are a small number of delayed notifications still. |
Investigating | 2016-11-30 11:01:00 | PagerDuty is experiencing issues affecting the deliverability of alerts for a small number of accounts. Check @pagerdutyhelp for updates as they happen. |
Action | Date | Description |
---|---|---|
Resolved | 2016-11-05 15:10:00 | These issues have been resolved. |
Investigating | 2016-11-05 14:43:00 | We are currently experiencing delays in delivering notifications and some calls have blank messages. |
Action | Date | Description |
---|---|---|
Postmortem | 2016-10-27 00:26:00 | (Cross-posted from the postmortem published by Tim Armandpour, SVP, Product Development, on [our blog](https://www.pagerduty.com/blog/service-disruption-root-cause-analysis-follow-actions-october-21st-2016/)):Following up on our [blog post](https://www.pagerduty.com/blog/service-disruption-timeline-october-21st-2016/) from Monday, we wanted to share the actions we will be taking based on our initial root cause analysis.#Primary and Secondary Root CausesAs we looked through our timeline of events during the outage, we discovered that there were two issues:1. Our failover approach to DNS problems2. The quality of monitoring to assess the end-to-end customer experience.As we have talked about in the past, we prefer to design our systems in a multi-master architecture as opposed to a failover architecture to achieve continuous availability. This approach while requiring significant systems design investment, has the benefits of: predictable capacity in degraded scenarios, forcing increased automation, and making incremental changes easier and safer. We did not have multi-master architecture in place for our DNS systems however. Instead, we required a manual failover to a secondary provider during the outage.Measuring the end-to-end customer experience is always a challenge in the midst of DNS problems. After all, if a customer cannot talk to your systems, how can you tell what their experience is? We rely heavily on monitoring and alerting on every part of PagerDutys services. We have teams of engineers dedicated to making sure that each part of the customer experience is what our customers expect. During this outage, we were unable to properly diagnose customer facing problems due to the fact that customers were not able to reach our systems. This led to an increased resolution time for our customers.#Follow-up ActionsIn the coming weeks, we are looking at making several enhancements to our infrastructure, processes, and automation. These enhancements will help decrease the chance of a system-wide outage for the same root causes identified.####Taking a Multi-Master Approach towards DNSOur top priority underway is redesigning and implementing a new DNS architecture that allows for multiple DNS providers to be leveraged in a multi-master approach. We are updating our internal tooling and automation to make sure that both our external customer facing DNS records are leveraging multiple DNS providers, as well as making sure our internal servers leverage a similar system.####Auditing all DNS TTLsWe have multiple endpoints that our customers use to interact with PagerDuty: Our website, our APIs, and our mobile applications. To ensure a consistent experience across all of these, we will be auditing DNS TTLs for our zones, including NS and SOA records for each zone.####Runbook for DNS Cache FlushingMany public DNS providers offer the ability to proactively flush caches when records have changed. For example, Google provides this functionality via a web interface. We will be examining what our customers top DNS providers are, and determining the steps for each provider to proactively flush caches to provide up to date records faster when possible.####Improve Real User MonitoringWe leverage a combination of both internal monitoring systems and external providers. During this outage, we used these monitoring systems to assess what the customer impact was and determine how best to prioritize resolution steps. Unfortunately, most of the internal systems are designed to be a view from within our infrastructure, and did not adequately describe our end-to-end user experience, especially for our customers on the east and west coasts of the US. We will invest additional resources in global monitoring that takes an external and customer experience view of our systems and overall service offering. This includes our Website, APIs, and Mobile experiences, and our Notification experience as well.####Improve Prioritization of Resolution StepsAt PagerDuty, we leverage a service oriented architecture to support multiple features that our customers leverage. For the majority of our customer facing incidents, there is only one part of our service that becomes affected when a disruption of service occurs. With a central component like DNS not being available, multiple components of our service were impacted. When bringing our services back up in the future, we need the ability to prioritize the most critical and important services that matter most to our customers.####Improve Multi-team Response ProcessAs called out in the previous section, we have multiple teams on-call continuously for helping PagerDuty works properly. While we leverage our own product to assist us with our people orchestration efforts, we did not have all of the supporting tooling in place for certain teams involved. We plan to implement processes and improve upon our best practices so that each team is able to address problems in their own services effectively.#ConclusionThis past Friday was a difficult day for nearly every on-call engineer. At PagerDuty, we take great pride in providing a service that we know thousands of customers rely on. We did not meet the high expectations that we set for ourselves, and we are taking critical steps to continuously enhance the reliability and availability of our systems. From this experience, I am confident we will provide an even more reliable service that will be there when our customers need us the most.As always, if you have any questions or concerns, please do not hesitate to follow-up with our Support team at [support@pagerduty.com](mailto:support@pagerduty.com). |
Resolved | 2016-10-24 19:34:00 | We will be issuing a full post mortem and next steps here over the next few days. |
Monitoring | 2016-10-21 23:38:00 | The issue with duplicate notifications has been resolved. If you are experiencing this problem, please reach out to us at support@pagerduty.com. |
Identified | 2016-10-21 22:20:00 | We are aware of some customers receiving duplicate phone and SMS notifications. We are actively working on resolving this issue. |
Monitoring | 2016-10-21 20:24:00 | Acknowledgements and resolutions should now be working correctly. If you are experiencing any issues, please reach out to us at support@pagerduty.com. |
Update | 2016-10-21 20:00:00 | At this time, notifications are no longer delayed. We are working on correcting the inability to acknowledge or resolve incidents via phone and SMS. |
Update | 2016-10-21 19:32:00 | We are still investigating issues related to customers not able to acknowledge or resolve incidents properly. If you are having issues reaching any pagerduty.com address please flush your DNS cache to resolve the issue. |
Update | 2016-10-21 18:32:00 | We're still investigating issues related to acknowledging and resolving incidents by phone and SMS. |
Update | 2016-10-21 17:37:00 | Notifications are delayed at this time. We are working to resolve the issue. |
Identified | 2016-10-21 17:10:00 | We are still investigating issues related to accessibility of our services due to DNS failures. Some customers may not be able to connect with pagerduty.com. We are working on alternative measures. |
Investigating | 2016-10-21 16:25:00 | We are investigating an issue with the accessibility of PagerDuty. |
Action | Date | Description |
---|---|---|
Resolved | 2016-10-21 13:42:00 | We are no longer experiencing issues due to our DNS provider outage and web access has recovered. We are continuing to monitor the situation. |
Update | 2016-10-21 13:13:00 | PagerDutys DNS provider is currently experiencing problems. IP Addresses to use for subdomain.pageduty.com and api.pagerduty.com traffic: 50.112.113.204, 50.112.113.201, 54.203.252.221 |
Monitoring | 2016-10-21 13:12:00 | PagerDutys DNS provider is currently experiencing problems. IP Addresses to use for events.pagerduty.com: 104.45.235.10, 104.42.125.229, 54.241.36.40, 54.241.36.66, 54.244.255.45, 54.241.36.66 |
Identified | 2016-10-21 12:40:00 | Customers may experience issues accessing the website due to an issue with our DNS provider. We are currently working to resolve the issue. |
Action | Date | Description |
---|---|---|
Resolved | 2016-10-17 02:51:00 | All systems are operational. |
Monitoring | 2016-10-17 02:21:00 | We are still actively monitoring the affected systems. All systems have been fully operational since approximately 5:50 p.m. Pacific time. |
Identified | 2016-10-17 01:07:00 | The issue has been identified and we are working towards a resolution. |
Investigating | 2016-10-17 00:35:00 | We are currently experiencing degraded event processing causing delays with inbound events |
Action | Date | Description |
---|---|---|
Postmortem | 2016-11-07 23:24:00 | #SummaryOn October 16, 2016 at 7:24 pm EST, PagerDuty had a service degradation lasting approximately 106 minutes that was related to our alerting pipeline. Customers would have experienced difficulties sending events to PagerDuty's integration endpoint as well as slightly increased time between events arriving and corresponding incidents being created.During the service degradation, approximately 6.5% of our notifications were delayed and the integration API failed to accept events (HTTP 500) for 0.04% of requests.##What Happened?Over the course of the weekend, a time when PagerDuty typically sees lower aggregate traffic, we were downscaling over-provisioned nodes in the Cassandra database cluster used by the integration API for storing events. The new, smaller nodes were properly sized to handle the regular amount of in-flight data. Unfortunately, they were insufficiently sized to handle the extraneous temporary files created through a Cassandra operation called compaction. The act of downscaling the cluster caused an increase in writes to the cluster, which triggered a much higher rate of compactions than is typically seen. This caused the nodes to head towards a point of running out of disk space.PagerDuty engineers began reverting the cluster back to the known good state with the original larger nodes. This triggered even more compactions. Some of these new compactions consumed all of the CPU on the nodes, which degraded the cluster to the point that event ingestion was affected.##What Are We Doing About This?The cluster has been resized back to its original size and has been functioning smoothly since.Part of the issue in this incident was that the entire cluster was downscaled over the course of one weekend. Future Cassandra operations will be performed on one node at a time with sufficient time for each node to "bake", allowing the cluster to go through its regular cycle of repairs and compactions. This should isolate any misbehaviour to a single node in the cluster and opposed to the entire cluster.We sincerely apologize if this degradation negatively impacted your team's visibility or response. We take service degradations very seriously and the steps outlined above should prevent this type of routine maintenance from escalating to a degradation in the future. If you have questions or concerns please contact us at [support@pagerduty.com](mailto:support@pagerduty.com). |
Resolved | 2016-10-16 23:57:00 | The issue has been resolved and events are processing normally. |
Investigating | 2016-10-16 23:38:00 | We are currently experiencing degraded event processing causing delays with inbound events |
Action | Date | Description |
---|---|---|
Resolved | 2016-10-07 22:51:00 | The issue has been fixed. Events are processing normally. |
Investigating | 2016-10-07 22:50:00 | PagerDuty is currently experiencing a delay in processing some incoming events. We are investigating the issue. |