How We Scaled a Predictive Analytics Application to Handle 5x Volume Growth Using DynamoDB and Kinesis Data Streams
In this post, I’ll share how we transformed our Predictive Analytics Platform from an on-premises relational data infrastructure to a cloud-native NoSQL architecture, enabling us to process significantly larger data volumes with higher throughput while optimizing operational costs.
Our platform executes proprietary predictive algorithms against user-defined inputs to generate large-scale analytical datasets. In its original design, workloads ran on a fixed-capacity, on-premises infrastructure, which limited our ability to scale with growing demand. Depending on the input parameters and job type, a single execution could generate anywhere between 500 GB and 1 TB of domain-specific event data.
The platform operated synchronously, with job execution times ranging from a few minutes to several hours. As demand increased, this architecture began to expose scalability and throughput bottlenecks. The generated datasets serve as critical inputs for downstream applications, which perform additional calculations and analyses to support important business decisions.
A key characteristic of this workload is that the generated data is immutable—once written, it is never updated. This access pattern made it an ideal candidate for a NoSQL datastore optimized for high write throughput, horizontal scalability, and cost-efficient storage.
LEAN QFD to determine the right Technology Choice
We performed a Quality Function Deployment (QFD) assessment across multiple datastore options. While Aurora PostgreSQL provided strong relational capabilities and S3/Athena offered the lowest storage costs, DynamoDB combined with Kinesis Data Streams achieved the highest overall score due to its ability to handle high-volume concurrent reads and writes, support aggregate-oriented event data, provide near real-time data availability, and scale predictably with growing workload demands. As a result, DynamoDB + Kinesis emerged as the most suitable architecture for our predictive analytics platform.
| Requirement | Weight | Aurora PostgreSQL | DynamoDB + Kinesis | S3 + Lambda + Athena |
|---|---|---|---|---|
| High Read/Write Throughput | 9 | 3 | 9 | 1 |
| Data Model Supports Aggregate-Oriented Events | 9 | 3 | 9 | 3 |
| Cost-Efficient Storage | 6 | 3 | 6 | 9 |
| Time-to-Availability | 9 | 3 | 9 | 1 |
| Scalability for 500+ Jobs/Day | 9 | 3 | 9 | 3 |
| Concurrent Reads & Writes | 9 | 3 | 9 | 1 |
| Immutable Event Data Pattern | 6 | 3 | 9 | 9 |
| Operational Simplicity | 3 | 6 | 9 | 3 |
| Query Performance for Known Access Patterns | 9 | 6 | 9 | 1 |
| Total Weighted Score | 177 | 513 | 225 |
Evolving NoSQL approach using DynamoDB
Our initial DynamoDB implementation followed a familiar relational mindset: each generated event was stored as an individual DynamoDB item, similar to storing records as rows in a relational database. While this approach simplified the migration, it quickly exposed scalability and cost challenges as workload volumes increased.
A single predictive analytics job could generate millions of events, resulting in a correspondingly large number of write operations. Although DynamoDB's VPC Gateway Endpoint reduced network overhead by keeping traffic within the AWS network, the overall ingestion latency remained significant due to the sheer volume of items being written for each job.
More importantly, this design introduced three key operational challenges:
1. High Write Capacity Requirements
To support peak workloads, we had to provision a substantial number of Write Capacity Units (WCUs). As job complexity increased, write throughput requirements became increasingly difficult to predict, resulting in significant DynamoDB operating costs.
2. Auto Scaling Lag During Sudden Bursts
DynamoDB Auto Scaling works well for gradual workload growth but is not designed to react instantaneously to sudden traffic spikes. In our platform, a user could trigger a complex analytics job that generated a massive burst of write traffic within seconds. Since Auto Scaling typically requires a few minutes to adjust capacity, the workload was susceptible to temporary throttling during these bursts.
3. Cost and Scaling Limitations of On-Demand Capacity
We also evaluated DynamoDB On-Demand capacity mode to eliminate the need for capacity planning. While it simplified operations, it did not fully address our workload characteristics. The platform experienced highly unpredictable spikes driven by user-triggered analytics jobs, and On-Demand capacity still requires DynamoDB to scale underlying resources and partitions. Extremely large bursts could therefore experience initial throttling while the service adapted to the new traffic pattern.
As a result, both Provisioned and On-Demand modes presented trade-offs. Provisioned capacity led to significant costs due to over-provisioning for peak demand, while On-Demand capacity could not always guarantee immediate absorption of sudden, large-scale traffic spikes.
These observations led us to rethink the ingestion architecture itself. Rather than writing millions of events directly into DynamoDB, we needed a buffering and aggregation mechanism that could absorb burst traffic, smooth write patterns, reduce WCU consumption, and provide predictable scalability. This ultimately led us to introduce Amazon Kinesis Data Streams as a decoupling layer between the analytics workloads and DynamoDB.
Handling 5× Data Growth
To address the scalability, performance, and cost challenges of our initial DynamoDB implementation, we introduced optimizations at both the application and architecture layers.
Code-Level Optimizations
Event Aggregation
Our initial design stored each generated event as an individual DynamoDB item. While straightforward, this approach resulted in millions of write operations for a single analytics job, driving up both network overhead and DynamoDB write costs.
To optimize this, we leveraged DynamoDB's maximum item size of 400 KB by aggregating multiple related events into a single item before persisting them. This significantly reduced the number of write operations, network round trips, and Write Capacity Unit (WCU) consumption.
An additional benefit was improved read efficiency. Since related events were co-located within the same item, downstream applications could retrieve larger logical datasets with fewer database requests.
Data Compression
We further optimized storage by compressing event payloads before persisting them to DynamoDB.
Although decompression introduces a small overhead during reads, the benefits far outweighed the cost. Compression reduced storage consumption, lowered write throughput requirements, and decreased overall DynamoDB operating costs. In addition, the smaller payload sizes reduced network transfer volumes, resulting in an observed performance improvement of approximately 15%.
Architecture-Level Optimizations
Introducing Kinesis Data Streams as a Buffer Layer
While the code-level optimizations reduced the volume of data written to DynamoDB, they did not fully address the challenge of sudden traffic spikes caused by user-triggered analytics jobs.
To solve this, we introduced Amazon Kinesis Data Streams as a flow-control layer between the analytics workloads and DynamoDB.
Instead of writing directly to DynamoDB, producers publish events to Kinesis Data Streams. Consumer applications then process records from the stream and persist them to DynamoDB at a controlled rate.
This architecture provided several benefits:
Absorbed sudden bursts of write traffic generated by complex analytics jobs.
Protected DynamoDB from short-lived throughput spikes and throttling.
Reduced dependence on aggressive DynamoDB auto-scaling.
Improved workload isolation between data producers and consumers.
Enabled predictable throughput and operational costs.
By combining event aggregation, compression, and Kinesis-based flow control, we transformed the platform from a write-intensive architecture into a highly scalable event-processing pipeline capable of supporting more than 500 jobs per day, including 30% more complex workloads, while maintaining a maximum data availability time of less than five minutes.
Overall we have evolved from a Legacy RDBMS --> DynamoDB --> DynamoDB + Kinesis Solution and here are some metrics captured:
| Metric | Legacy RDBMS Solution | DynamoDB + Kinesis Solution |
|---|---|---|
| Jobs Processed Per Day | ~200 | >500 |
| Complex Job Mix | Baseline | +30% |
| Scalability | Fixed Capacity | Horizontally Scalable |
| Data Availability | Batch-Oriented | < 5 Minutes |
| Concurrent Processing | Limited | Significantly Higher |
Failure Modes and Operational Considerations
While the DynamoDB and Kinesis-based architecture significantly improved scalability and throughput, it was important to understand and plan for worst-case scenarios.
1. Kinesis Ingestion Capacity Exhaustion
If the incoming data generation rate exceeds the provisioned throughput of the Kinesis Data Stream, producers may be unable to publish records at the required rate.
In our architecture, the analytics platform temporarily buffers data before publishing it to Kinesis. During a sustained overload condition, this buffer can continue to grow until the application exhausts available memory or storage resources, potentially resulting in job failures.
Mitigation:
Continuous monitoring of Kinesis write throughput and shard utilization.
Automated shard scaling based on traffic patterns.
Backpressure mechanisms to slow data generation when required.
Sufficient local buffering capacity to absorb short-term spikes.
2. DynamoDB Throughput Saturation
A second failure scenario occurs when the rate at which data is consumed from Kinesis and written to DynamoDB exceeds the maximum write capacity available to the DynamoDB table.
In this situation, records begin to accumulate within the Kinesis stream. While Kinesis acts as a durable buffer, it is not an infinite one. If the backlog continues to grow and records remain unprocessed beyond the configured retention period, they will eventually expire before being written to DynamoDB.
For example, with a 24-hour retention period, a sustained DynamoDB bottleneck could result in data loss once records become older than one day.
Mitigation:
Provision sufficient DynamoDB write capacity for peak workloads.
Monitor Kinesis iterator age and consumer lag.
Configure extended Kinesis retention where required.
Implement automated alarms and scaling policies for DynamoDB consumers.
Design operational runbooks to address prolonged backlogs before retention limits are reached.
Key Lesson
Kinesis provides an effective buffer between data producers and DynamoDB, but it should not be viewed as an unlimited queue. The overall system remains constrained by the slowest component in the pipeline. Effective capacity planning, monitoring, and automated scaling are essential to ensure that transient traffic spikes do not evolve into sustained bottlenecks.
AWS Event Presentation
The architecture and scalability journey described in this article was also presented at an AWS community event by members of our engineering team.
As the Principal Architect for the platform, I had the opportunity to contribute to the overall architecture, design decisions, and technical direction of the solution. However, I consciously chose not to participate as a presenter, instead providing an opportunity for members of the team to share their work, present the challenges we faced, and showcase the engineering solutions we implemented together.
I strongly believe that creating visibility and growth opportunities for engineers is an important aspect of technical leadership. Seeing the team confidently present the architecture, lessons learned, and business outcomes to a broader audience was particularly rewarding.
You can watch the session recording here:
Session Recording: Video Presentation on AWS Events
The presentation provides additional insights into the architecture, scalability challenges, implementation details, and key lessons learned during the migration to DynamoDB and Kinesis Data Streams.
