StatusBacon

Datadog

Monitoring that tracks your dynamic infrastructure.

Homepage: https://datadoghq.com
Status Page: https://status.datadoghq.com/


Open Incidents

No open incidents

Scheduled Incidents

No scheduled incidents

Previous Incidents

Delayed AWS metrics.

2017-01-30 21:36:05
Action Date Description
Resolved 2017-01-30 21:36:00 EC2 metrics have been backfilled for impacted customers.
Monitoring 2017-01-30 21:09:00 We have identified the issue and fixed it. Some aws.ec2 metrics might still present gap for a subset of customers, we are working on backfilling them.
Update 2017-01-30 20:45:00 We are continuing to investigate transient issues for EC2 metrics.Metrics are being collected and alerting has been enabled for AWS integrations.If you identify gaps in `aws.ec2` metrics, please email help@datadog.com
Investigating 2017-01-30 20:22:00 We're seeing an issue with some AWS API endpoints used to retrieve metrics.As an effect of latency, AWS metrics on graphs may be delayed.These delays may result in "no data" alert conditions for Metric Monitors, to avoid spurious alerts we've temporarily disabled these alert types.It's important to note metrics are delayed but will be backfilled once all services are operational.

False positives on a subset of monitors

2017-01-27 00:19:57
Action Date Description
Resolved 2017-01-27 00:19:00 The issue has been resolved.
Identified 2017-01-27 00:02:00 We've identified an issue causing false positives on a subset of monitors. A fix has been put in place.

Delayed Metrics

2017-01-26 16:22:08
Action Date Description
Resolved 2017-01-26 16:22:00 This incident has been resolved.
Investigating 2017-01-26 16:02:00 We're actively investigating increased metric intake latencies. As an effect of latency, metrics on graphs may be delayed.These delays may result in "no data" alert conditions for Metric Monitors, to avoid spurious alerts we've temporarily disabled these alert types.It's important to note metrics are delayed but will be backfilled once all services are operational.

Delayed Alert Evaluation

2017-01-19 13:17:49
Action Date Description
Resolved 2017-01-19 13:42:00 We are all caught up and the issue is resolved.
Monitoring 2017-01-19 13:35:00 The fix is in place and we are monitoring the recovery from the delays
Identified 2017-01-19 13:17:00 Some customers may be experiencing delays in their alerts. We have identified the issue and we are putting a fix in place.

Delayed Metrics

2017-01-13 21:36:15
Action Date Description
Resolved 2017-01-13 22:30:00 This incident has been resolved.
Monitoring 2017-01-13 22:19:00 Metrics are caught-up, no data alerts are on. Customers may observe missing data points on a subset of metrics between 9:28 and 9:37pm UTC.
Identified 2017-01-13 21:48:00 The issue has been identified and a fix has been put in place. No data alerts are disabled while we catch up.
Investigating 2017-01-13 21:36:00 There is an issue with delayed metrics.

False positives from event monitors

2017-01-11 22:23:27
Action Date Description
Resolved 2017-01-11 22:37:00 We've fixed the issue that caused some false positive alerts to be sent from event monitors between 17:02 and 17:16 EST. Event monitors have been reenabled.
Monitoring 2017-01-11 22:34:00 We've fixed the issue that caused some false positive alerts to be sent from event monitors between 17:02 and 17:16 EST. Event monitors have been reenabled.
Investigating 2017-01-11 22:23:00 We are investigating reports of false positives from event monitors. Event monitors are disabled while we investigate.

Agent metric processing is currently delayed

2017-01-10 21:37:05
Action Date Description
Resolved 2017-01-10 22:48:00 The metrics, event, and service check pipeline is fully operational. All "no data" alerts are re-enabled. This issue is resolved.
Monitoring 2017-01-10 22:37:00 Metrics, event, and service check processing has caught up. "No data" alerts have been re-enabled. We are monitoring progress.
Update 2017-01-10 22:10:00 A fix has been implemented and we are recovering from delays in metric, event, and service check processing. As we recover metrics, events and service checks may be delayed for some customers.These delays may result in "no data" alert conditions for monitors. To avoid spurious alerts "no data" alerts remain disabled.It's important to note metrics, events and service checks are delayed but will be backfilled once all services are operational.
Identified 2017-01-10 21:42:00 The issue has been identified and a fix is being implemented. No data alerts are disabled.
Update 2017-01-10 21:38:00 We are investigating delays in metric processing. No data alerts are disabled while we investigate.
Investigating 2017-01-10 21:37:00 We are investigating delays in metric processing. Alerts are disabled while we investigate.

AWS metrics are delayed

2017-01-10 21:30:11
Action Date Description
Resolved 2017-01-10 22:22:00 Errors from AWS EC2 endpoints have subsided and aws.* metrics delays have recovered. This issue is resolved.
Monitoring 2017-01-10 22:08:00 AWS EC2 API endpoints are recovering. We are progressively resuming collection of aws.* metrics. "No data" alerts are still disabled for aws.* metrics.
Identified 2017-01-10 21:52:00 Metrics in the aws.* namespace are currently delayed due to increased error rates from AWS EC2 endpoints. We've identified the issue and are working with AWS to resolve it. No data alerts are disabled for aws.* metrics.
Investigating 2017-01-10 21:30:00 Metrics in the aws.* namespace are currently delayed. We're investigating the issue.

Delays in metrics processing

2017-01-04 06:32:54
Action Date Description
Resolved 2017-01-04 07:14:00 This incident has been resolved.
Monitoring 2017-01-04 07:13:00 The issue has been solved and we're monitoring the situation.
Identified 2017-01-04 06:49:00 The issue has been identified and a fix is being implemented.
Investigating 2017-01-04 06:32:00 We are investigating delays in metric processing that affect a small fraction of our customers.

Some screenboards responding with 500 errors

2017-01-03 19:15:53
Action Date Description
Resolved 2017-01-03 19:21:00 This incident has been resolved.
Monitoring 2017-01-03 19:15:00 A fix has been deployed. We're monitoring the situation.
Identified 2017-01-03 19:13:00 Some screenboards are returning 500 errors. We've identified the culprit and we're working on fixing it.