Strategies to prevent Terraform state drift

In a perfect IaC world, all changes would be executed through updating code in a git repo and pushed through a CI/CD pipeline. However, customers sometimes ask me about combining manual operations with IaC. While at first, this may sound like a bad idea, with some good guidelines a workable model can be achieved.

Why a hybrid approach may be desired

But before we figure out the how, lets discuss some scenario's in which it is desirable to have a hybrid approach over a pure IaC approach.

The first scenario you can run into, is the fact that some changes may be to dynamic to push trough a pipeline. Having to change code and go trough a full deployment may be to heavy a process for changes that are minor and occur often. Whether this is the case, is in the eye of the beholder. In a well oiled IaC environment, this should rarely be the case in my opinion.

The second scenario may be, where operations engineers should be allowed to make changes to certain aspects of the environment, but lack the skill to use IaC to accomplish this.

However, if you do not have such requirements, I consider going with a pure IaC approach as most desirable. Don't go "hybrid" just for the sake of it. A pure IaC approach has it's advantages (versioning, validation, testing etc.) and giving these up should not be taken lightly.

Prevent multiple control mechanisms

Probably the worst mistake to make in this area, is to have multiple mechanisms control a single resource (attribute). For each resource, you should make a deliberate choice whether you want Terraform to control it's state, or whether it is managed outside of Terraform.

You do not want to end up having to "sync" changes between manual and IaC operations all the time. This is time consuming, cumbersome and error prone.

Limit manual operations through privileges

Once you have made a deliberate choice as to which parts to manage through Terraform and which through manual operations, it is best practice to solidify this in the privileges the user has. Ideally you would make sure that any attribute managed through Terraform, cannot be updated manually by a user.

Exclude from Terraform entirely

The most simple approach, if possible, is to completely exclude the resource from our Terraform code. This way, creation, manipulation and deletion are completely handled outside of our code. Separation of control is very clear, as each resource is either managed 100% through code, or 100% outside of code. A simple example would be, to create a VPC it's subnets and routing tables through Terraform, but manually add or remove the actual route entries to the routing table.

Exclude attributes of a resource

If we want to handle the creation and deletion of a resource through Terraform, but we want to manage parts of the attributes of our resource outside of Terraform, we have a more complex scenario. This could be the case, if the part we want to manage outside of Terraform belongs to a more complex resource.

A common example is management of tags outside of Terraform. Maybe your using an external solution to handle tag management of your cloud resources. Every time your tags change, this leads to state drift, which you need to resolve.

Another good example would be the instance size of an Aviatrix spoke gateway. Lets say you want to have your operations team be able to resize these gateways through the Aviatrix controller UI, in stead of having to modify the Terraform code. Doing so will result in state drift, as shown below.

Before resizing through the UI:

Now using the Aviatrix controller UI, I resize the spoke gateway from t3.medium to t3.large:

Now we can see, Terraform detects the change which is different from the desired configuration when we execute a terraform plan:

In order to prevent Terraform from changing our gateway back to a t3.medium one, we could alter our code to reflect the change to t3.large. This becomes a never ending game of catch-up between manual operations and our code and state file. I highly recommend against this, as it will inevitably lead to errors.

So in stead, what we could do is tell Terraform which parts of our spoke gateway to ignore. We can do this by leveraging the lifecycle argument in our resource:

This describes how Terraform should deal with certain lifecycle aspects of our resource. ignore_changes is one of those lifecycle aspects we can influence here. By adding certain attributes to the list of ignore_changes, Terraform knows it should not trigger a change if they deviate from what's recorded in the state file and Terraform code.

Now that we have added this statement, lets see what Terraform does.

As you can see, Terraform still spots the difference between the state file and the real world. Upon apply, Terraform will update the state file to reflect the new status. However, as you can see, it is no longer trying to modify it the the value described in our Terraform code.