Bitmovin Status - Incident history

Encoding Job Processing in AWS eu-west-1 (Ireland)

Wed, 28 Jan 2026 17:15:00 +0000

Type: Incident

Duration: 2 hours and 9 minutes

Jan 28, 17:15:00 GMT+0
Identified - Jobs started in AWS eu-west-1 (Ireland) may be affected by an upstream networking issue on AWS EC2..

Jan 28, 17:48:00 GMT+0
Monitoring - We implemented a fix and are currently monitoring the result..

Jan 28, 19:56:54 GMT+0
Resolved - This incident has been resolved..

Scheduled Database Maintenance

Tue, 9 Dec 2025 09:00:00 +0000

Type: Maintenance

Duration: 5 minutes

Affected Components: , , , , , , , , , , , , , , , , ,

Dec 9, 09:00:00 GMT+0
Identified - Our database will undergo scheduled maintenance on Tuesday, 2025-12-09, between 09:00 AM and 10:00 AM (UTC) for routine security updates. **Expected Impact: None**. Thanks to our improved infrastructure, we don't anticipate any service disruption. However, we're notifying you as a precaution. Your encodings and workflows should continue without interruption. Thank you for your understanding..

Dec 9, 09:00:01 GMT+0
Identified - Maintenance is now in progress.

Dec 9, 09:05:00 GMT+0
Completed - Maintenance has completed successfully.

Increased Error Rates on Encoding API

Sun, 23 Nov 2025 17:07:07 +0000

Type: Incident

Duration: 5 hours and 25 minutes

Affected Components: , , , , , , , , ,

Nov 23, 21:30:18 GMT+0
Identified - We have identified the root cause of the issue and implemented a fix, encoding backlog processing has sped up but will take more time to fully recover the overall system state..

Nov 23, 22:02:10 GMT+0
Identified - Our internal system state has recovered and we are slowly increasing parallel processing throughput for the queued encoding backlog..

Nov 23, 22:24:37 GMT+0
Monitoring - We are increasing parallel encoding slots every 10 minutes and the queued encoding backlog built up in during the incident will be finished soon. We encourage our customers to restart potentially stopped workflows again..

Nov 23, 22:31:53 GMT+0
Resolved - Our systems are stable, and we are no longer observing any errors or delays. A Root Cause Analysis (RCA) will follow once our internal review is completed. Thank you for your patience throughout this disruption..

Nov 23, 17:07:07 GMT+0
Investigating - We are currently investigating this incident..

Nov 23, 17:39:54 GMT+0
Identified - We are currently experiencing a **partial outage** of the Encoding API. Error rates remain elevated and queued encodings continue to accumulate. To stabilise the platform, we have temporarily reduced maximum processing concurrency. While this measure helps improve system stability, it also impacts throughput and leads to longer wait times for queued jobs. Our engineering team is actively working on restoring normal performance. We will provide further updates as we make progress..

Nov 23, 18:18:50 GMT+0
Identified - We have successfully reduced the Encoding API error rate; however, our services are still struggling to process the backlog of queued service messages. To help stabilise the system, we have further lowered the VOD processing limits and continue to actively work on clearing the queue and restoring full service performance. Webhook notifications are also lagging behind. Further updates will follow as we make progress..

Nov 25, 10:45:28 GMT+0
Postmortem - ## **Root Cause Analysis – Encoding Service Outage 23 Nov 2025** **Summary** On 23 Nov 2025, the Bitmovin Encoding Platform experienced a service outage affecting our VoD and Live encoding pipeline. The issue resulted in temporary unavailability of encoding operations including starting and stopping of encodings and unavailability of status updates of running encodings for the time of the outage. ## **Root Cause** A **bug in one of our encoding services** caused memory usage and database (DB) data transfer volumes to increase slowly but steadily over the course of approximately one month. This resulted in: * Gradually **escalating DB read traffic** from the encoding services due to increasing object size with time * Overwhelming the service with **growing message queues** for this specific service due to decreased throughput * Data transfer reaching **\~900 MB/s outbound** from the database to the service instances As memory usage kept rising, multiple encoding service instances ultimately **crashed simultaneously**, leading to an abrupt stop in processing and further service degradation due to accumulation of **\~300,000 unprocessed service messages during the outage from 18:00 CET**. ## **Impact** * Encoding jobs failed to start after queueing * Running encoding jobs failed to update their status to “In Progress” or “Finished” during the incident ## **Detection** * 18:06 CET – Investigation initiated based on internal monitoring and alerts. * 18:07 CET – Outage confirmed and posted on status page. Internally, teams observed service pods **crashing due to memory limit exhaustion**, causing message processing interruptions and queue growth. ## **Investigation, Mitigation & Recovery** After the service crashes, we initiated a controlled process: 1. Increased service memory limits on our kubernetes cluster 18:25 CET 2. Reduced parallel message queue worker logic of the affected service to reduce DB read load by about 30% 19:02 CET 3. Implemented further monitoring and tracing in the service to increase visibility into the massive DB data streaming causes 19:38 CET 4. Re-routed several messages on our messaging system to further decrease DB read load 19:51 CET 5. Sped up processing of messages causing excessive read operations through service based skipping with a new service version 20:23 CET 6. Deployed new service which decreased DB read load to normal levels 20:56 7. Increased message throughput for services to restore the system state by processing the message backlog of about 350k messages in RabbitMQ 21:30 CET 8. Restored all service configuration and gradually increased encoding throughput 23:00 CET 9. Full service restored 23:31 CET ## **Next Steps and Preventive Actions** To prevent recurrence, we are implementing the following: * Implement a holistic fix for the bug causing excessive DB read load * Expand message queue and memory usage monitoring with stricter alerts for read amplification * Comprehensive review of encoding service code paths related to heavy DB reads * Message queue pressure safeguards including throttling for struggling services.

Nov 23, 20:33:10 GMT+0
Investigating - Our team has likely identified the root cause of high DB load causing delays in encoding status updates in our API and dashboard. We are deploying a new version of the affected service and will have an update in a few minutes. .

Nov 23, 19:24:41 GMT+0
Investigating - Our team is continuing to investigate the issue impacting slow API requests, delayed encoding status updates and delayed manifest generations (manifests generated with the encoding start calls). Encoding output is not affected. At this time, we have not yet identified the root cause. Diagnostic efforts are ongoing, and all necessary teams are engaged. We sincerely apologize for the inconvenience and will provide updates as soon as we have more information. Thank you for your patience and understanding..

Encoding Failures in us-east-1 Due to EC2 Instance Creation Issues

Mon, 20 Oct 2025 10:30:00 +0000

Type: Incident

Duration: 9 hours and 18 minutes

Affected Components:

Oct 20, 10:30:00 GMT+0
Monitoring - Since 12:30 UTC on October 20, our encoding service in the us-east-1 region has again been impacted by an AWS issue preventing the creation of new EC2 instances. As a result, new encoding jobs in this region may fail to start. Running encodings on existing capacity are not affected. Workaround: We recommend all customers switch to another cloud region that is not affected or make use of the fallbackRegion setting when creating encodings, as described in our Documentation: . We are monitoring AWS’s recovery efforts closely and will provide further updates as more information becomes available..

Oct 20, 13:33:00 GMT+0
Monitoring - The AWS incident affecting EC2 instance creation in us-east-1 is still ongoing. As a result, encoding jobs in this region may continue to fail to start. There are currently no actions we can take on our side until AWS resolves the underlying issue. Therefore, we will pause posting further updates until AWS has marked the incident as resolved. For the latest information, we recommend monitoring the AWS Health Dashboard: In the meantime, we continue to recommend switching to another unaffected cloud region or using the fallbackRegion setting as described in our Encoding Incident Operational Playbook: .

Oct 20, 12:58:00 GMT+0
Monitoring - The incident is still ongoing. Our encoding service in us-east-1 continues to be impacted by AWS EC2 instance creation issues. We are waiting for further updates from AWS and will continue to monitor their recovery efforts. For more detailed and up-to-date information, we recommend customers check the AWS Health Dashboard: In the meantime, we continue to recommend switching to another unaffected cloud region or using the fallbackRegion setting as described in our Encoding Incident Operational Playbook: .

Oct 20, 19:48:00 GMT+0
Resolved - The AWS incident affecting EC2 instance creation in us-east-1 has been resolved by AWS as of Oct 20 21:48 UTC.

Elevated API Error Rate on Muxing creation

Thu, 16 Oct 2025 05:50:00 +0000

Type: Incident

Duration: 40 minutes

Affected Components:

Oct 20, 22:04:00 GMT+0
Postmortem - # Incident Post-Mortem – Encoding Cleanup Gap-Lock **Date:** 2025-10-16 **Duration:** 07:49–08:30 UTC (41 minutes) **Customer Impact:** Calls to the `https://api.bitmovin.com/v1/encoding/encodings/{encoding_id}/muxings/` endpoints failed, leading to workflow stoppages. The failure rate remained below monitoring thresholds, so the issue was not immediately detected. ## Summary At 07:49 UTC, our encoding cleanup process (responsible for deleting old encodings outside of retention) triggered a **gap-lock** on one of our database tables. While the long-running delete transaction was active, the API was unable to insert new records into the table. This directly impacted the **muxings endpoints**, which failed during this period and caused encoding workflows to stop. Because the error rate was below our monitoring alert thresholds, the incident went unnoticed until the team was alerted at 08:20 UTC. ## Root Cause The issue was caused by **outdated MySQL table statistics**, which led the **MySQL query optimizer** to select an inefficient execution plan: * Instead of using an index scan on the large table, the query optimizer chose a **full table scan**. * Under **REPEATABLE\_READ isolation**, this resulted in **gap-locks** across wide portions of the table. * In practice, this behaved like a **table-level lock**, blocking all inserts until the cleanup query was stopped. ## Timeline * **07:49 UTC** – Encoding cleanup process starts. Optimizer chooses full table scan → gap-locks prevent inserts. * **07:49–08:20 UTC** – Calls to `/encoding/encodings/{encoding_id}/muxings/` fail, workflows stop. Monitoring does not alert as errors are below configured threshold. * **08:20 UTC** – Engineering team is alerted. * **08:20–08:30 UTC** – Cleanup process identified as root cause. Process is stopped, lock released. * **08:30 UTC** – Incident resolved, API resumes normal operation. ## Immediate Actions After resolving the incident, we immediately stopped the cleanup process and investigated why it suddenly caused issues despite having run successfully for years. Since then, we have: * **Tightened timeouts** for deletion commands in the cleanup process. * **Changed to a more forgiving isolation level** to prevent broad blocking locks. * **Adjusted the cleanup process to use smaller batch sizes and higher concurrency**, which reduced database impact and increased throughput. * **Added significantly more monitoring** to the cleanup process to better track query performance and catch anomalies early. ## Next Steps * **Update MySQL statistics** so the query optimizer selects the correct execution plan. * _Expected completion:_ **by end of this week (2025-10-24)** * **Rework alerting strategy** to detect low-rate but workflow-blocking errors earlier. * _Expected completion:_ **within 3 weeks (by 2025-11-06)**.

Oct 16, 05:50:00 GMT+0
Investigating - We are currently investigating this incident..

Oct 16, 06:30:00 GMT+0
Resolved - We experienced elevated API error rates on the encoding endpoints between 07:50 and 08:30 UTC. The issue, caused by the encoding service during muxing creation, was identified and promptly resolved. The engineering team is preparing a post-mortem and will share it here once available..

Planned Database Downtime

Sat, 27 Sep 2025 07:00:00 +0000

Type: Maintenance

Duration: 2 hours

Affected Components: , , , , , , , , , , ,

Sep 27, 07:00:00 GMT+0
Identified - Bitmovin’s Encoding services will undergo a database upgrade on Saturday September 27, 2025 from 09:00 - 11:00 UTC to improve performance and stability. Please do not start any new encoding jobs during the maintenance window, as they will not be processed. API status updates will be delayed, and the dashboard and Analytics will be unavailable. Please plan accordingly..

Sep 27, 09:00:00 GMT+0
Completed - Maintenance has completed successfully..

Encoding failures on AWS

Thu, 4 Sep 2025 03:45:00 +0000

Type: Incident

Duration: 1 hour and 8 minutes

Affected Components:

Sep 4, 03:45:00 GMT+0
Investigating - We are currently experiencing failures with encodings on AWS. Our engineering team is currently investigating the issue.

Sep 4, 04:53:00 GMT+0
Resolved - We have identified the root cause of the encoding failures: a misconfiguration in our S3 bucket. The S3 configuration has been corrected, and encoding jobs on AWS are now recovering. We are seeing encoding tasks completing successfully again. We are actively monitoring the system to confirm full recovery..

Sep 4, 04:53:00 GMT+0
Resolved - # Root Cause Analysis: Encoding Failures on AWS ### Summary On **September 4, 2025**, between **06:31 AM and 08:10 AM CEST**, encoding jobs running on AWS failed due to an issue accessing our S3 storage. ### Root Cause The incident was caused by an error during a routine S3 key rotation. The old access key was deleted before the new key was in use, which temporarily prevented our encoding service from accessing storage. ### Impact * Only encoding jobs running on **AWS** were affected. * Encodings on other cloud providers and all other Bitmovin services were **not impacted**. ### Resolution The configuration was corrected at **08:10 AM CEST**, restoring access to S3\. Encoding operations on AWS recovered immediately and have been stable since. ### Preventive Measures To prevent this from happening again, we are: * Updating our key rotation procedure to ensure keys are not deleted prematurely. * Automating the key rotation process to reduce the chance of operator error..