Transcoder Service: Increased error rate
Incident Report for Xvid MediaHub
Resolved
This issue is now resolved. Our cloud provider meanwhile confirmed that they had rolled out a faulty upgrade to their load-balancers which caused the random corruptions on file retrievals from the storage and that they have meanwhile rolled back the change again everywhere. And after a prolonged monitoring phase now we've confirmed that the issue is indeed resolved.
Posted Sep 08, 2023 - 17:27 EEST
Update
The situation has largely normalized and we did not experience data corruption issues during the past few hours. However, we continue to monitor all systems closely.
Posted Sep 06, 2023 - 21:54 EEST
Update
We see that sporadic data corruption issues have appeared now also in the failover region we migrated to indicating that our upstream provider continues rolling out the faulty patch to further regions instead of rolling it back. So the situation is unfortunately not improving. The overall success rates of transcoder jobs is currently still OK but AutoGraph-enabled jobs appear to be more impacted as they internally need to store/retrieve a lot more files which then increases the probability of a corruption occurring and the job failing.
Posted Sep 06, 2023 - 13:15 EEST
Monitoring
The backlog has now been processed. We continue monitoring the situation.
Posted Sep 06, 2023 - 03:25 EEST
Identified
Our upstream provider has still not been able to solve the issue. So we reconfigured and redeployed our cluster to use a storage in another region and things look a lot better again now. There's still a backlog of pending jobs to process now though and so performance is not yet back to normal levels.
Posted Sep 06, 2023 - 01:38 EEST
Update
We are still investigating the issue. We are suspecting a problem with our internal storage and are in contact with our upstream provider.
Posted Sep 05, 2023 - 17:45 EEST
Investigating
We are seeing an increased error rate with a higher number of jobs failing than usual. We are investigating.
Posted Sep 05, 2023 - 16:05 EEST
This incident affected: Transcoder Service.