2023-07-31 Incident Postmortem

2023-07-31 Incident Postmortem

Julien Danjou

In the dynamic landscape of software development, unforeseen challenges can occasionally arise, bringing with them valuable lessons and reinforcing the importance of collaboration. We'd like to provide an in-depth account of a recent incident that put these principles to the test.

The Genesis of the Incident

On July 31, 2023, our day started with an alert. At 11:53 UTC, a support case flagged an issue with the merge queue: a pull request was unexpectedly dequeued with the message Base does not exist. Swiftly, we mobilized our team to investigate.

As the hours progressed, our internal monitoring tools sounded an alarm: there was an uptick in unexpected GitHub API status codes. Specifically, while our system—Mergify—was in the process of creating or deleting draft pull requests.

Pinpointing the Culprit

By 15:12 UTC, we uncovered a peculiar behavior. The Git branches we meticulously crafted and the changes we initiated using GitHub's Git Database API were facing a delay in recognition by GitHub's Repository and Pulls API. It was as though we were putting together pieces of a puzzle, but some pieces, although correctly placed, would momentarily vanish.

This behavior manifested in two primary ways for our customers:

  • Pull requests would be incorrectly dequeued, leading to error messages such as No commits between XXXX and YYYY or Base does not exist.
  • Some experienced a merge queue lag, displaying: This queue is waiting for a batch to fill up.

In the face of this challenge, we took decisive steps to mitigate the situation.

Our Immediate Actions

Understanding the gravity of the issue, we implemented a retry mechanism in various code paths to ensure our services remained as seamless as possible. Our dedication to transparency prompted us to make the incident public, ensuring our customers remained informed.

Parallelly, our team was in constant communication with GitHub support. Armed with comprehensive logs and data, we highlighted the API anomalies. Through diligent monitoring and continuous refinement of our strategies, we kept the situation under control.

The Resolution

By August 1st, GitHub's dedicated team acknowledged the unexpected change in API behavior, attributing it to a feature flag related to caching introduced earlier. This change inadvertently induced replication lags, causing the aforementioned 404 errors for newly created refs. Thankfully, once identified, the change was promptly reversed.

On August 2nd, GitHub confirmed that this incident would be documented in their upcoming availability report, reflecting their commitment to transparency.

GitHub Availability Report Archives
Updates, ideas, and inspiration from GitHub to help developers build and design software.

In Retrospect

This incident, while challenging, highlighted the strength of our partnership with GitHub and our shared commitment to the developer community. We remain dedicated to providing an unparalleled service experience and are continuously refining our processes to adapt to the ever-evolving software environment.

Thank you for your patience, trust, and understanding throughout this incident. Together, we learn, adapt, and grow.