As described in siloed architecture at Proglove we went with a 100% SILO’d architecture aka a multi-account setup. It’s probably easier to read that article before you start with this one.
However, with this architecture, there are quite some challenges to overcome:
- Costs & Architecture
- Observability: Monitoring & Logging
Costs & Architecture
As mentioned in the introduction, we’re running our architecture more than 2000 times. If we were to spin up even a minimal Kubernetes cluster in each tenant account, we would be bankrupt in a month. The costs of running the smallest (possible) EC2 instance for a month in each customer account would be around 6000$. And let’s not talk about the management effort involved to maintain 2000 EC2 instances on 2000 accounts.
Therefore, this kind of architecture is only possible with a) a 100% serverless application and b) EVERYTHING as configuration-as-code.
For us this means that everything we use and have running should have NO upfront and “hourly” costs. Some common AWS services that this precludes are Aurora, Kinesis (at least parts) and Redis.
On the cost front, we manage three KPIs: costs for active (paid) accounts, costs for inactive accounts and costs for a known benchmark. To keep the costs of inactive accounts down, we have to make sure that if nothing is used, nothing incurs costs. While that’s not quite possible, our architecture allows us to keep the costs per (unused) account at the minimum. And once the scans are rolling in, the accounts scale up instantly thanks to the magic of serverless ❤️.
Observability: Monitoring & Logging
Getting observability into this system also requires a different approach. We don’t have 3 endpoints to monitor, we have 6000. But as said above, we don’t have to worry about cluster uptimes… because we can’t afford any
We’re currently running ~150 lambdas per account times 2000, meaning roughly 300.000 lambdas in the complete system.
Just collecting/streaming the logs and gathering any sort of meaning out of this sheer number of “things” is a serious challenge. You have to find a system that makes it easy to analyze logs and metrics for that number of systems, gives you a decent overview, and ideally is easy to integrate into each account. Cloudwatch will be very pricy and difficult to use at this scale of log-groups, lambdas and metrics.
We’ve choosen Datadog as our central operations tools with log ingestion, security monitoring, metric gathering and alerting. Datadog is great for overview on both the metric and logs side. And we could even use the provided metric ingestion from Datadog, but we’ve had to build our own tooling to forward the logs.
Still, its’ a constant effort to keep this up and running across all of the accounts, and fight the effects of scale that creep in. Having a datadog metric per account for example would cost us already a couple of hundred $ per month.
To be honest, this one probably came as the biggest surprise. While we love terraform, we quickly realized that AWS Cloudformation Stacksets is the go-to tool in this setup. The basic setup is simple:
You define a Stack as you would normally do in AWS cloudformation. We generally have a root stack for each service and then one to x nested stacks insides this root stack, described in several .yaml files - as SAM requires us to have.
During development, we use normal stacks, just deploying with cloudformation into stacks. As soon as a change hits our production pipelines, we use stacksets. In these, we have separate “management” accounts that “hold” the stackset, and each customer account is registered to these. Once a change is deployed into these stacksets, it’s applied to each so-called stack-instance.
This works really well for a lower number of accounts, which is our understanding of the target use-case for stacksets. It’s great for deploying a small, rather static number of stack instances - for example for multi-region deployments.
We are, however, running with over 2000 stack-instances in different regions, and we struggle. Deployment times can be around 1 hour, even if configured to be fully parallel. Occasionally, we’ll have AWS errors on one or two stack instances, which just screws up an SREs day. And service quotas make our life pretty difficult - we can only do so many stackset operations at a time, etc.
On the other hand we are successfully deploying to production multiple times a day with this setup. So while it feels like stretching the tool to it’s limit, it’s still working okay.
By the number the benefits of our architecture (as outlined in the 100% silo article) pale in comparison with the challenges we’ve shown here, we still believe it’s the right choice for our (and many other) use-cases. Assuring our customers of the highest possible data-security & data-separation is worth a bit of effort in other areas. And not having to worry about tenant isolation does wonders to your development speed!