Skip to main content

Command Palette

Search for a command to run...

Part-5: Failure Modes and Operational Considerations

Updated
2 min read

While the DynamoDB and Kinesis-based architecture significantly improved scalability and throughput, it was important to understand and plan for worst-case scenarios.

1. Kinesis Ingestion Capacity Exhaustion

If the incoming data generation rate exceeds the provisioned throughput of the Kinesis Data Stream, producers may be unable to publish records at the required rate.

In our architecture, the analytics platform temporarily buffers data before publishing it to Kinesis. During a sustained overload condition, this buffer can continue to grow until the application exhausts available memory or storage resources, potentially resulting in job failures.

Mitigation:

  • Continuous monitoring of Kinesis write throughput and shard utilization.

  • Automated shard scaling based on traffic patterns.

  • Backpressure mechanisms to slow data generation when required.

  • Sufficient local buffering capacity to absorb short-term spikes.

2. DynamoDB Throughput Saturation

A second failure scenario occurs when the rate at which data is consumed from Kinesis and written to DynamoDB exceeds the maximum write capacity available to the DynamoDB table.

In this situation, records begin to accumulate within the Kinesis stream. While Kinesis acts as a durable buffer, it is not an infinite one. If the backlog continues to grow and records remain unprocessed beyond the configured retention period, they will eventually expire before being written to DynamoDB.

For example, with a 24-hour retention period, a sustained DynamoDB bottleneck could result in data loss once records become older than one day.

Mitigation:

  • Provision sufficient DynamoDB write capacity for peak workloads.

  • Monitor Kinesis iterator age and consumer lag.

  • Configure extended Kinesis retention where required.

  • Implement automated alarms and scaling policies for DynamoDB consumers.

  • Design operational runbooks to address prolonged backlogs before retention limits are reached.

Key Lesson

Kinesis provides an effective buffer between data producers and DynamoDB, but it should not be viewed as an unlimited queue. The overall system remains constrained by the slowest component in the pipeline. Effective capacity planning, monitoring, and automated scaling are essential to ensure that transient traffic spikes do not evolve into sustained bottlenecks.

Scaling a Predictive Analytics Application on AWS - A Case Study

Part 5 of 6

Over the last few years, our Predictive Analytics application experienced significant growth in both data volume and workload complexity. What started as a traditional analytics solution running on an on-premises relational database eventually reached its scalability limits as business demand increased. To meet these challenges, we re-architected the application using Amazon DynamoDB and Amazon Kinesis Data Streams, transforming it into a highly scalable, event-driven system capable of processing over 500 jobs per day, including 30% more complex workloads, while maintaining a maximum data availability time of less than five minutes. This series documents the architectural journey, key design decisions, trade-offs, operational challenges, and lessons learned along the way. What You'll Learn How to identify when a relational database is no longer the right fit. When DynamoDB is a better choice than a traditional RDBMS. Designing single-table data models for large-scale immutable datasets. Handling burst traffic using Kinesis Data Streams. Cost optimization techniques for DynamoDB. Failure modes, capacity planning, and operational considerations. Measuring scalability through business outcomes rather than technical benchmarks.

Up next

Part 6: Results, Team Presentation, and Key Lessons Learned

Results Overall we have evolved from a Legacy RDBMS --> DynamoDB --> DynamoDB + Kinesis Solution and here are some metrics captured: Metric Legacy RDBMS Solution DynamoDB + Kinesis Solution Jobs