Incident documentation/2021-05-12 LocalisationCache

From translatewiki.net

Summary

Site was down for everybody for 6 minutes, and had issues with displaying some interface messages for 30 minutes after that.

Primary issue: MWException: Error: invalid magic word 'template_params'. LocalisationCache is shared, so as soon it was updated from the working directory, site went down. The magic word was introduced recently, but it is unclear why it would go missing from the cache unless we downgraded the PageForms extension. PageForms is fixed to a certain hash in our composer.local.json.

Secondary issue: Some interface messages were missing. This was noticed on Special:Translate. Root cause is that our tags did not contain Translate/i18n/core directory, so when we rebuild LocalisationCache from there, those messages were not present (though grepping did seem to indicate matches in the cdb files). We already had a fix for this, but it was only deployed during today's window, which did not help us as we used a previously deployed tag with this issue.

Timeline

  • We ran twn-update-all
  • 14.29Z outage started as twn-update-all finished
  • 14.35Z outage ends, as we rebuild LocalisationCache
  • We try to debug and fix the issue of missing interface messages in various ways
  • 15.05Z all issues should be fixed, as we deployed new tag
  • 15.43Z deployment is concluded after QA

Conclusions

What went well?

  • We had three people on a video call for the duration of deployment, so we could effectively work together figuring out solutions.
  • We noticed quickly that site was down, thanks to both monitoring and watching error logs.

What went poorly?

  • Things were confusing because we were working outside normal conditions. We were running on 1.36 instead of master due to Phab:T281688 and that caused extra issues and it was not clear which of them were harmless. We were able to fix the primary issue, but then we hit a secondary issue because we did something we do not normally do (rebuild the localisation cache from a deployed tag).

Where did we get lucky?

  • We had a patch that would have prevented the secondary problem from happening. We were unlucky that if this issue had happened a week after, we would have avoided it thanks of the patch.

How many people were involved?

  • 3 deployers

Action items

  • (done) Phab:T282994 LocalisationCache should be part of the tags.
  • (open) Phab:T282995 Consider whether we should simplify our code update system to not use mix of composer and git checkouts.