Transcoder Service: Increased error rate

Incident Report for Xvid MediaHub

Resolved

This issue is now resolved. Our cloud provider meanwhile confirmed that they had rolled out a faulty upgrade to their load-balancers which caused the random corruptions on file retrievals from the storage and that they have meanwhile rolled back the change again everywhere. And after a prolonged monitoring phase now we've confirmed that the issue is indeed resolved.
Posted Sep 08, 2023 - 17:27 EEST

Update

The situation has largely normalized and we did not experience data corruption issues during the past few hours. However, we continue to monitor all systems closely.
Posted Sep 06, 2023 - 21:54 EEST

Update

We see that sporadic data corruption issues have appeared now also in the failover region we migrated to indicating that our upstream provider continues rolling out the faulty patch to further regions instead of rolling it back. So the situation is unfortunately not improving. The overall success rates of transcoder jobs is currently still OK but AutoGraph-enabled jobs appear to be more impacted as they internally need to store/retrieve a lot more files which then increases the probability of a corruption occurring and the job failing.
Posted Sep 06, 2023 - 13:15 EEST

Monitoring

The backlog has now been processed. We continue monitoring the situation.
Posted Sep 06, 2023 - 03:25 EEST

Identified

Our upstream provider has still not been able to solve the issue. So we reconfigured and redeployed our cluster to use a storage in another region and things look a lot better again now. There's still a backlog of pending jobs to process now though and so performance is not yet back to normal levels.
Posted Sep 06, 2023 - 01:38 EEST

Update

We are still investigating the issue. We are suspecting a problem with our internal storage and are in contact with our upstream provider.
Posted Sep 05, 2023 - 17:45 EEST

Investigating

We are seeing an increased error rate with a higher number of jobs failing than usual. We are investigating.
Posted Sep 05, 2023 - 16:05 EEST
This incident affected: Transcoder Service.