Transcoder service: Jobs stalling

Incident Report for Xvid MediaHub

Resolved

This incident has been resolved.
Posted Oct 01, 2021 - 06:00 EEST

Update

We are continuing to monitor for any further issues.
Posted Oct 01, 2021 - 04:39 EEST

Monitoring

It's actually not been CentOS 7 but Python also brings it own CA bundle with certifi which did not trust the new Letsencrypt root. We have implemented a fix now but it takes time until it is deployed to all running workers. At the same time, a lot of jobs got queued up and so backlog is large. Therefore, transcoding speed is still slower than usual. We're monitoring it and hope it's going to improve soon now once the backlog is processed.
Posted Oct 01, 2021 - 03:17 EEST

Identified

The IdentTrust DST Root CA X3 certificate, which is the root certificate of LetsEncrypt expired today and the new certificate root that our upstream cloud provider uses is not trusted anymore in CentOS 7, which caused all our workers to disconnect. We are trying to manually add the new CA and re-establish a new valid trust chain.
Posted Oct 01, 2021 - 01:13 EEST

Investigating

Currently, many jobs hang and do not complete. Our upstream cloud provider has logged an incident related to the expiry of the LetsEncrypt root CA certificate today. Upstream API is unavailable. We are investigating.
Posted Oct 01, 2021 - 00:53 EEST
This incident affected: Transcoder Service.