StatusBacon

Server Density

server-density

Homepage: https://www.serverdensity.com/
Status Page: http://status.serverdensity.com/


Open Incidents

No open incidents

Scheduled Incidents

No scheduled incidents

Previous Incidents

Slowdown on displaying metrics graphs

2016-12-30 08:46:12
Action Date Description
Resolved 2016-12-30 12:27:00 This incident is now resolved.We are considering this a new occurrence of the 23rd incident http://status.serverdensity.com/incidents/2wt5tffd150t, on which we are still actively working with our provider and will provide a detail postmortem when this work is completed.
Monitoring 2016-12-30 09:10:00 We have now restored capacity on our metrics processing cluster. Some devices will show a metrics gap between 08:20 and 08:51 which will even out.We'll continue to monitor this closely.
Identified 2016-12-30 08:46:00 We're experiencing a slowdown in displaying metrics graphs on dashboards. The issue is identified and we expect to recover soon.Alerting is not affected.

PagerDuty Notification Improvements

2016-12-23 15:25:52
Action Date Description
Scheduled 2017-12-23 15:25:00 On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty.
Completed 2017-01-09 15:35:00 This maintenance was completed without issue.
Scheduled 2017-01-03 12:00:00 On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty.
Scheduled 2016-12-23 15:25:00 On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty.

Missing graph data

2016-12-22 18:19:41
Action Date Description
Resolved 2016-12-23 11:48:00 Our provider has confirmed the cause for the outage and is not expecting to have a re-occurrence. As such we are considering this resolved and will publish our provider root cause together with our own postmortem.
Monitoring 2016-12-22 19:37:00 We now have restored redundancy and have contacted our provider about the initial failure. We'll continue to monitor this closely.
Identified 2016-12-22 19:04:00 We have identified the cause for the second cluster failure and recovered from it - graphs and metrics collection are functional again.Now we will be determining the cause for the first and initial cluster failure.
Update 2016-12-22 18:45:00 Another metrics cluster is down so the metrics write is still affected.
Investigating 2016-12-22 18:19:00 We suffer a downtime in one of our metrics clusters which make us to have gaps in graphs starting at 17:41 UTC.Alerts processing were not affected.

USA, New York service monitoring location down

2016-12-16 09:55:08
Action Date Description
Resolved 2016-12-23 11:45:00 This incident has been resolved.
Identified 2016-12-16 09:55:00 Both our redundant nodes at USA, New York monitoring location are unreachable.We have received the following notification from our provider:"Our efforts to revive node5 have not been successful. A spare server is currently being shipped to replace the faulty one."Until we are able to access one node again, this location checks will not take place. We recommend setting up 3 monitoring locations per service check to avoid relying on a single location, not being the desired case we have North Virginia closer to NYC that may be used instead.https://support.serverdensity.com/hc/en-us/articles/201091476-Monitoring-node-locations-and-IP-addresses

Manchester monitoring location node IP address change

2016-12-15 11:04:11
Action Date Description
Completed 2016-12-15 11:30:00 The scheduled maintenance has been completed.
Scheduled 2016-12-15 11:15:00 Per request of our hosting provider, we are changing one of our Manchester monitoring location nodes IP address from 185.120.34.52 to 185.157.232.52. Customers that use this location and have whitelisted the previous address will need to update their rules. The tests issued from that location will continue to work as only one of redundant pair has been changed.
Scheduled 2016-12-15 11:04:00 Per request of our hosting provider, we are changing one of our Manchester monitoring location nodes IP address from 185.120.34.52 to 185.157.232.52. Customers that use this location and have whitelisted the previous address will need to update their rules. The tests issued from that location will continue to work as only one of redundant pair has been changed.

Payload processing slowdown

2016-12-03 22:46:17
Action Date Description
Resolved 2016-12-03 22:46:00 The incident has been resolved.
Monitoring 2016-12-03 22:19:00 We have enabled no data alerts and are monitoring the recovery.
Identified 2016-12-03 22:10:00 We've identified the cause as one failed DB node which has been taken out of rotation, payload processing is normalizing. We will be turning no data alerts on in some minutes.
Update 2016-12-03 21:20:00 We're continuing to resolve the issue with our payload processing clusters. Alerts are being processed at a slower rate, we are still keeping no data off.
Investigating 2016-12-03 20:56:00 We're looking into an issue with processing payloads. Alerts may trigger slower than expected or not at all while this incident is ongoing and we have disabled no data alerts as a precaution.

Virtual Server Maintenance

2016-11-19 13:14:50
Action Date Description
Completed 2016-11-19 13:14:00 This maintenance is now complete.The impact was: graphs did not receive updates between 08:04 and 08:08 and will show that gap until it evens out in 3 days. The UI was inaccessible for a total of 8 minutes between 09:32 and 09:35 UTC and between 11:05 and 11:10.
Verifying 2016-11-19 11:33:00 The third and last batch of this maintenance is complete. UI was inaccessible between 11:05 and 11:10 UTC.
Update 2016-11-19 09:49:00 The second batch of this maintenance is complete. UI was inaccessible between 09:32 and 09:35 UTC. The next batch will start at 11:00 UTC.
Update 2016-11-19 08:26:00 The first batch of this maintenance is complete. Graphs did not receive updates between 08:04 and 08:08 UTC. The next batch will start at 09:30 UTC.
Scheduled 2016-11-19 08:00:00 Our infrastructure provider has been notified of a potential vulnerability affecting our Virtual Servers. The remediation will require maintenance to the hypervisor nodes and a reboot of all the virtual server instances on those nodes.Given our redundancy, historically these reboots have just caused a few minutes of downtime while the rebooted nodes recover. We're scheduling a wider maintenance window to account for all the reboots across the entire fleet, however we don't expect more than a few minutes during this time window.

Decrease in processed payloads

2016-11-13 00:22:44
Action Date Description
Identified 2016-11-13 00:22:00 A network issue degraded payloads processing, no data protection caused some false positives
Resolved 2016-11-13 00:22:00 Network issue is now solved and everything is back to normal
Investigating 2016-11-12 23:38:00 We are seeing a decrease in processed payloads, we have disabled no data alerts while investigating

Slowdown in payload processing

2016-11-29 18:34:41
Action Date Description
Postmortem 2016-11-29 18:34:00 On November 8, between 16:18 and 17:15 UTC we experienced elevated latency to and from one of our database clusters primary server which caused elevated response times on payload processing and alerting.We run our MongoDB clusters using a replica set configuration. This allows to quickly failover to a secondary database server when the primary fails. Unfortunately, this wasn't a complete failure as the elevated latency caused the secondaries to fall behind on operations replicating from the primary. Our monitoring picking up secondaries behind allowed us to quickly identify the root cause of the issue but, by that time, with all secondaries behind, meant that we no longer had the option of a fast failover. We then decided to pause the secondaries replication to alleviate the traffic and shortly after we saw improved latency measurements. Once the network latency returned to normal we proceeded to synchronize and resume the secondaries replicas.We have been working with our infrastructure provider Softlayer for the last 2 weeks but they are still to pinpoint the actual cause. Another update will be made once we have a root cause analysis from Softlayer.
Resolved 2016-11-09 16:04:00 We have not observed high latency in the affected cluster in the last 16h. All systems have been running normally.We're still working with our provider to get a root cause for this event and we'll share it when we get it.
Monitoring 2016-11-08 22:56:00 We have now completed the synchronization of the affected database cluster secondaries and will continue to monitor the network latency that caused this issue for the next hours.
Update 2016-11-08 18:22:00 Although the measured latency is still showing high variance, we are starting to see mostly normal values. As such device nodata alerts have been enabled. Server Density is operating normally again now.We will continue to work with our provider until full recovery as well as work on catching up the affected database cluster secondaries.
Update 2016-11-08 17:16:00 We are measuring improved latency now but still higher than normal. This will result in normal service except for the device nodata alerts that we are keeping disabled until we measure normal latency again.
Update 2016-11-08 17:02:00 The database cluster slowdown as been identified as caused by high network latency. We are now working with our provider to restore normal service.We have identified the following service impact:- device payload processing is slower and gaps may show on graphs- alerts may be up to 5 minutes delayed- device nodata alerts are disabled- api calls may ocasionaly return 500 errors
Identified 2016-11-08 16:33:00 We're experiencing a slowdown on one of our database clusters. We're working to restore service responsiveness.
Investigating 2016-11-08 16:00:00 We're currently investigating a slowdown in device payload processing.

Dashboard graphs not loading

2016-11-29 11:59:46
Action Date Description
Postmortem 2016-11-29 11:59:00 On November 3, between 12:25 and 12:32 UTC, during a planned maintenance operation on one of our MongoDB clusters, an election took place earlier than expected causing one server still in maintenance to assume primary. As the service clusters had not yet been configured to allow access to this server, the UI dashboards didn't load. This election was manually reverted a couple of minutes later and service restored.We have reviewed and updated this particular maintenance procedure to prevent this from happening again.
Resolved 2016-11-03 15:52:00 We have now forced a new election on this cluster to confirm it working as expected. A postmortem with this event will be published in the next days.
Update 2016-11-03 12:58:00 Only those devices that received a provisioning update and a new agent key (using puppet, chef, ansible) during the election showed missing data.
Update 2016-11-03 12:47:00 Some devices are showing missing data. We're continuing to investigate.
Identified 2016-11-03 12:32:00 One of our MongoDB clusters was in mid-election and taking more time than usual. Graphs on dashboards are loading now and we'll soon confirm if there's any other underlying issue.
Investigating 2016-11-03 12:27:00 We're investigating a slowdown in loading dashboard graphs.