logo

In the modern world, AWS accounts are often spawned by the hundreds as it is best practice to limiting the attack surface by using a multi-account strategy to segregate workloads. It becomes more complicated to manage and maintain Governance over Infrastructure as it does hit large scales. Control Tower (which is AWS’ current version of a landing zone) goes a good way to helping improve that. But like most things AWS have done, they introduce the building blocks and then they throw it to the developers and architects to take it further.

I thought I would, as a Cloud Architect, take the opportunity to dive into one of my favourite ways of thinking in Cloud terms now, an EVENT DRIVEN ARCHITECTURE.

I will explain the key things you need to know about an event Drive architecture and then explain how they can apply to the building of a centralised Governance model (which is a major part of the landing zone architecture we have at Datacom).

 

What is an event driven architecture?

I wont go into a huge amount of detail on the definition as there are plenty of good reads on that already. I recommend starting with AWS references such as this which gives an insight into how Microservices are a key element of Event Driven Architectures.

In a nutshell, an event driven architecture is a system of responding to a material change in the state of something and taking action. The Action can be to trigger notifications (ie alert in the case of a server becoming unresponsive) or taking action (ie reboot a server if it becomes unresponsive). Many people think of event driven approaches as being around triage and remediation and that is certainly true in some cases. But an event driven architecture can also be used to deliver modern applications without the overhead of EC2 in their operation.

In the case of a landing zone architecture we have deployed an event driven approach to handling how accounts are baselined when they are created or to respond to events triggered by end users. By doing this we can scale the delivery of service with the workload as it grows in an efficient and optimal way for customers.

Key events in a Landing Zone Deployment

When using an AWS Control Tower Deployment one of the Key things to look out for is the Landing Zone’s native event notifications. In the case of AWS Control Tower the ones we look for the most are control tower Lifecycle Events. If you are wondering what a Control tower Lifecycle Event is then the pattern diagram below should help you. It is in essence a notification that a Control Tower Action has commenced or completed. It will include the status of the action (ie success/failure) and those events are registered in Cloud Trail on the Control Tower Account.

The pattern above is very high level but it shows generally what happens in AWS Control Tower’s Account factory when an AWS account is vended. In the context of event driven architectures the completion of the Account creation is the event that matters. This is because when AWS Finish provisioning the account using their Control Tower under the hood there is still a number of things that can be done beyond that.

The Control Tower Lifecycle event looks like the below in Cloud Trail:

{
“eventVersion”: “1.08”,
“userIdentity”: {
“accountId”: “123456789012”,
“invokedBy”: “AWS Internal”
},
“eventTime”: “2022-03-11T00:09:47Z”,
“eventSource”: “controltower.amazonaws.com”,
“eventName”: “CreateManagedAccount”,
“awsRegion”: “ap-southeast-2”,
“sourceIPAddress”: “AWS Internal”,
“userAgent”: “AWS Internal”,
“requestParameters”: null,
“responseElements”: null,
“eventID”: “112243-3937-40c1-8fe1-3234012349”,
“readOnly”: false,
“eventType”: “AwsServiceEvent”,
“managementEvent”: true,
“recipientAccountId”: “123456789012”,
“serviceEventDetails”: {
“createManagedAccountStatus”: {
“organizationalUnit”: {
“organizationalUnitName”: “Production Workload”,
“organizationalUnitId”: “ou-1u2k-gz45ae8e”
},
“account”: {
“accountName”: “dev-prod6”,
“accountId”: “987654321012”
},
“state”: “SUCCEEDED”,
“message”: “AWS Control Tower successfully created an enrolled account.”,
“requestedTimestamp”: “2022-03-10T23:56:08+0000”,
“completedTimestamp”: “2022-03-11T00:09:47+0000”
}
},
“eventCategory”: “Management”
}

And from this you can respond using a Eventbridge Rule within the Control Tower account. If you configure a rule in Eventbridge to look for this eventname you can then trigger a Lambda Function to perform actions after the fact. I have pasted reference to a couple of items that we are doing in response to this event below. The Event Rule would appear as follows and would be targeted at an appropriate lambda function to perform changes:

{
“source”: [“aws.controltower”],
“detail-type”: [“AWS Service Event via CloudTrail”],
“detail”: {
“eventName”: [“CreateManagedAccount”]
}
}

 

Action Description
Enable EBS Volume default Encryption When a new account is created we trigger a function from Lifecycle Events to Enable the account level EBS Setting of Enabling Encryption
Create Azure Active Directory Groups for AWS Single Sign On As part of our Integration with 3rd Party Providers like Azure AD we have functions that when enabled will automatically create Azure AD Groups that can be provisioned back to AWS Single Sign On for access
Enable s3 Bucket Default Encryption Similar to EBS Default Encryption we automatically enable appropriate default Encryption for s3 buckets using either AWS-KMS or AWS-SSE

 Events occurring after Account Setup

The above is nice to get started with managing AWS accounts at scale but the real power of Event Driven Architectural thinking is responding to events as they happen in near real time. This can be a remediation action or to add Governance and baselines to resources created by end users after provisioning.

This post is not going to focus on fixing what’s broken (I plan to cover that shortly) but there are 2 key elements of event driven approach I wanted to highlight.

1. Responding to a resource creation

2. Remediating an unauthorised Configuration Change

Resource Creation Events

These 2 items are design patterns in themselves. For the resource creation item I have 2 architectural patterns to show. The first is an event of creating a new VPC created in any account under management and responding by adding VPC Flow Logs to that Event (leveraging a dedicated s3 bucket). The Second event is creation of a new s3 Bucket and responding by enabling access logs to a dedicated s3 bucket.

 

 

These events are both cross account architectures which is a key element of being able to automate and Govern at scale with as little overhead as possible

 

 

When a New VPC is Created it triggers an event in Cloudtrail and event bridge can be configured to respond to that and trigger the workflow demonstrated above. We have this deployed to create VPC Flow logs in Cloudwatch and s3.

 

In a similar workflow to the Create VPC event the event to respond to create s3 bucket triggers an event rule that forwards to the centralised account where the lambda function assumes an appropriate rule to perform the updates

A Key requirement to ensure is that you also have appropriate roles for each function to operate. For our purposes we use AWS cloudformation Stacksets to deploy specific roles for the functions we use. In order to satisfy the least privilege doctrine I recommend a role per function that needs cross account access. That way each role can get the granular access it needs

 

Resource Change Remediation Events

And finally I will touch a little on how we can use a similar approach to handle AWS Config Changes (such as if a particular config rule is made non compliant). AWS Config records and aggregates all resources within Accounts and regions (When enabled). In Control Tower Config is configured to operate in Governed regions. This can be helpful for the remediation of issues because it can record the state change (event) and allow notification or action.

AWS Security Hub relies heavily on AWS Config for its compliance rules and AWS have a good solution for remediating compliance failures inside Security hub (check it out here ) but for the purpose of this article I would just touch on at a high level how that sort of solution works in an event driven approach.

The event in this case is an event rule going from compliant to non compliant (for this example I am using the rule of an AWS Security Group with 0.0.0.0/0 open to the public on port 3389 (a big no no from a security point of view).

At a high level AWS Config uses rules and remediation actions to handle response to config changes. These can be create manually or using an AWS Config Conformance Pack which creates benchmark rules for the appropriate standard. For a custom rule remediation actions can be created and these remediation action can either use an AWS Managed Action or a custom action. For the example above we would use a custom action with Lambda to revoke the rule change. The Pattern would look something like this

 

These Config Rule actions can be targeted to specific resources or any resource monitored by AWS Config. 

 

Conclusion

In wrapping up what I hope was interesting and the first of several discussions on event driven infrastructure management I wanted to stress that in order to manage and Govern hundreds or thousands of AWS accounts with hundreds or thousands of resources in each an Event driven approach is key.

 In a Landing Zone architecture you dont want to be an ogre of a controller that also serves as a bottle neck for new resources being created. I have seen that in the past where things that take 5-10 minutes to do are delayed by days due to slow moving processes. Ideally a self service architecture would let users create and deploy resources themselves. Combining that with an event driven approach will allow you to quickly apply Governance, standards and requirements to that resources when its provisioned completely transparently to the end user.

Its not just about application delivery (although that is crucial and I will start discussing how that factors in in the future) as building a landing zone where it is necessary to respond in near real time to changes whilst giving the users the most flexibility to operate within the landing zone.