Part-5: Failure Modes and Operational Considerations
While the DynamoDB and Kinesis-based architecture significantly improved scalability and throughput, it was important to understand and plan for worst-case scenarios.
1. Kinesis Ingestion Capacity Exhaustion
If the incoming data generation rate exceeds the provisioned throughput of the Kinesis Data Stream, producers may be unable to publish records at the required rate.
In our architecture, the analytics platform temporarily buffers data before publishing it to Kinesis. During a sustained overload condition, this buffer can continue to grow until the application exhausts available memory or storage resources, potentially resulting in job failures.
Mitigation:
Continuous monitoring of Kinesis write throughput and shard utilization.
Automated shard scaling based on traffic patterns.
Backpressure mechanisms to slow data generation when required.
Sufficient local buffering capacity to absorb short-term spikes.
2. DynamoDB Throughput Saturation
A second failure scenario occurs when the rate at which data is consumed from Kinesis and written to DynamoDB exceeds the maximum write capacity available to the DynamoDB table.
In this situation, records begin to accumulate within the Kinesis stream. While Kinesis acts as a durable buffer, it is not an infinite one. If the backlog continues to grow and records remain unprocessed beyond the configured retention period, they will eventually expire before being written to DynamoDB.
For example, with a 24-hour retention period, a sustained DynamoDB bottleneck could result in data loss once records become older than one day.
Mitigation:
Provision sufficient DynamoDB write capacity for peak workloads.
Monitor Kinesis iterator age and consumer lag.
Configure extended Kinesis retention where required.
Implement automated alarms and scaling policies for DynamoDB consumers.
Design operational runbooks to address prolonged backlogs before retention limits are reached.
Key Lesson
Kinesis provides an effective buffer between data producers and DynamoDB, but it should not be viewed as an unlimited queue. The overall system remains constrained by the slowest component in the pipeline. Effective capacity planning, monitoring, and automated scaling are essential to ensure that transient traffic spikes do not evolve into sustained bottlenecks.
