Talking about IaC: Infrastructure as Code

In fact, the concept of IaC has been around for a long time. This article briefly talks about the past, present, and future of IaC.

IaC’s Past

IaC actually has a long enough history. First, let’s look at the core features of IaC.

The end product is a machine readable product. It could be a piece of code, or it could be a preparation file.
based on machine readable products, can further rely on existing VCS systems (SVN, Git) to do versioning.
the product based on machine readable, can further rely on existing CI/CD systems (Jenkins, Travis CI), etc. for continuous integration/continuous delivery.
consistency of state, or idempotency. That is, theoretically, the final behavior of the product built on the same Code with the same set of parameters should be the same.

In fact, we can now understand the reasons for the rise of IaC through some core features of IaC, which actually emerged in the context of the rapid iteration of the Internet world after the millennium, when the traditional manual maintenance faced several problems.

the human factor introduced by interactive changes is too large, resulting in uncontrollable changes.
manual changes are not enough to face the rapid iteration of Infra.
Interactive changes make it difficult to control, making version control and other tools empty talk.

In the context of such times, everyone is pursuing a more technical and elegant means to solve these problems. Thus, the concept of IaC.emerged

If I were to divide IaC into several stages, then I think it can be divided into the following stages.

slash-and-burn stage.
modern IaC.

As mentioned earlier, IaC is actually a spontaneous drive, where we choose to use code to eliminate as much uncertainty as possible in the face of uncertainty (a principle that actually continues to this day).

In the earliest days, then, people chose to use the most basic form of code to do the work of IaC. This was characterized by a precise, procedural description of the various interactive tools that preceded it. One might choose to do all of this directly with bash. Or one may choose to do the required procedural description in a simple wrapper based on a framework like Python Fabric.

But when we look back at this phase, we can intuitively feel some defects.

code reusability is poor.
each team has a set of historical IaC infrastructure, there is no unified industry standard, resulting in a high barrier to entry for newcomers.

So in the face of such a set of problems. More modern IaC facilities came into being, some of the typical products of which are.

Ansible
Chef
Puppet

In fact, these tools may have design trade-offs (such as the Pull/Push model trade-offs), but their core features do not change.

the framework provides common internal features such as SSH link management, multi-machine parallel execution, auto retry, etc.
based on the above described set of basic features, provides a set of DSL package . Let developers focus more on the logic of IaC , rather than the details of the base level.
it is open source, and has formed a set of perfect plug-in mechanism. The community can provide a richer ecology based on this set. For example, the SDN community has provided various playbooks for switches based on ANSIBLE.

As of now, IaC has actually reached a relatively complete level of development. Many of these tools, too, still run through to the present.

The New Generation of IaC

Starting on August 25, 2006, Amazon officially announced the availability of EC2 services. The entire infrastructure started to fast forward to the Cloud era. Up to now, various cloud vendors have provided various services. Through more than a decade of evolution, various service models such as IaaS, PaaS, DaaS, FaaS, etc. have been created. These service models have made it easier and faster to build our infrastructure. However, these service models also bring some new problems.

Maybe some people have realized the problem by this point: in the time of getting computing power and resources more and more quickly. How can we manage such resources?

To solve such problems, it seems that we need to consider how to manage these resources in code or declarable configuration.

Initially, we would each choose to encapsulate a set of IaC tools based on the APIs and SDKs provided by each cloud vendor, as described above. This posed some additional problems:

code reusability is poor.
each company has a set of ancestral IaC infrastructure, there is no unified industry standard, resulting in a high barrier to entry for newcomers.

At this time, the need for new IaC tools for cloud resource management in the cloud era is becoming more and more urgent. At this time, a new tool like Terraform was born.

In Terraform, the opening of an EC2 Instance may be defined in a short paragraph like this.

resource "aws_vpc" "my_vpc" {
  cidr_block = "172.16.0.0/16"

  tags = {
    Name = "tf-example"
  }
}

resource "aws_subnet" "my_subnet" {
  vpc_id            = aws_vpc.my_vpc.id
  cidr_block        = "172.16.10.0/24"
  availability_zone = "us-west-2a"

  tags = {
    Name = "tf-example"
  }
}

resource "aws_network_interface" "foo" {
  subnet_id   = aws_subnet.my_subnet.id
  private_ips = ["172.16.10.100"]

  tags = {
    Name = "primary_network_interface"
  }
}

resource "aws_instance" "foo" {
  ami           = "ami-005e54dee72cc1d00" # us-west-2
  instance_type = "t2.micro"

  network_interface {
    network_interface_id = aws_network_interface.foo.id
    device_index         = 0
  }

  credit_specification {
    cpu_credits = "unlimited"
  }
}

From this foundation, we can continue to code/describe our infrastructure such as Database, Redis, MQ, etc. to improve the effectiveness of our resource maintenance.

Also, as each SaaS evolves, the developers try to code/descriptively configure these SaaS services as well. Take Terraform for example, we can interface with Terraform’s Provider. For example, the Provider provided by newrelic, the Provider provided by Bytebase, etc.

Also, after the IaC tool helps us standardize the infrastructure description, we can do more interesting things on top of that. For example, we can calculate the change in resource spend per resource change based on Infracost. We can do advanced work like centralized resource changes based on atlantis.

So far, we have enough choices of IaC products to meet most of our needs. So is it possible that the entire IaC product development has actually reached a point of relative completeness? The answer is clearly no.

The Future of IaC

So let’s talk about some of the problems facing IaC products today, and some of my thoughts on the future.

Deficiency 1: Deficiency of the existing DSL-based syntax system

Let me show you an example.

locals {
  dns_records = {
    # "demo0" : 0,
    "demo1" : 1,
    "demo2" : 2
    "demo3" : 3,
  }
  lb_listener_port  = 80
  instance_rpc_port = 9545

  default_target_group_attr = {
    backend_protocol     = "HTTP"
    backend_port         = 9545
    target_type          = "instance"
    deregistration_delay = 10
    protocol_version     = "HTTP1"
    health_check = {
      enabled             = true
      interval            = 15
      path                = "/status"
      port                = 9545
      healthy_threshold   = 3
      unhealthy_threshold = 3
      timeout             = 5
      protocol            = "HTTP"
      matcher             = "200-499"
    }
  }
}

module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "~> 6.0"

  name                       = "alb-demo-internal-rpc"
  load_balancer_type         = "application"
  internal                   = true
  enable_deletion_protection = true


  http_tcp_listeners = [
    {
      protocol           = "HTTP"
      port               = local.lb_listener_port
      target_group_index = 0
      action_type        = "forward"
    }
  ]

  http_tcp_listener_rules = concat([
    for rec, pos in local.dns_records : {
      http_tcp_listener_index = 0
      priority                = 105 + tonumber(pos)
      actions = [
        {
          type               = "forward"
          target_group_index = tonumber(pos)
        }
      ]
      conditions = [
        {
          host_headers = ["${rec}.manjusaka.me"]
        }
      ]

    }
    ], [{
      http_tcp_listener_index = 0
      priority                = 120
      actions = [
        {
          type = "weighted-forward"
          target_groups = [
            {
              target_group_index = 0
              weight             = 95
            },
            {
              target_group_index = 5
              weight             = 4
            },
          ]
        }
      ]
      conditions = [
        {
          host_headers = ["demo0.manjusaka.me"]
        }
      ]
  }])

  target_groups = [
    merge(
      {
        name_prefix = "demo0"
        targets = {
          "demo0-${module.ec2_instance_demo[0].tags_all["Name"]}" = {
            target_id = module.ec2_instance_demo[0].id
            port      = local.instance_rpc_port
          }
        }
      },
      local.default_target_group_attr,
    ),
    merge(
      {
        name_prefix = "demo1"
        targets = {
          "demo1-${module.ec2_instance_demo[0].tags_all["Name"]}" = {
            target_id = module.ec2_instance_demo[0].id
            port      = local.instance_rpc_port
          }
        }
      },
      local.default_target_group_attr,
    ),
    merge(
      {
        name_prefix = "demo2"
        targets = {
          "demo2-${module.ec2_family_c[0].tags_all["Name"]}" = {
            target_id = module.ec2_family_c[0].id
            port      = local.instance_rpc_port
          },
        }
      },
      local.default_target_group_attr,
    ),

    merge(
      {
        name_prefix = "demo3"
        targets = {
          "demo3-${module.ec2_family_d[0].tags_all["Name"]}" = {
            target_id = module.ec2_family_d[0].id
            port      = local.instance_rpc_port
          },
        }
      },
      local.default_target_group_attr,
    ), # target_group_index_3
    merge(
      {
        name_prefix = "demonew"
        targets = {
          "demo0-${module.ec2_instance_reader[0].tags_all["Name"]}" = {
            target_id = module.ec2_instance_reader[0].id
            port      = local.instance_rpc_port
          }
        }
      },
      local.default_target_group_attr,
    ),
  ]
}

This TF configuration description may look long, but what it actually does is very simple, it forwards traffic to different instances based on different domains *.manjusaka.me. Then for the domain demo0.manjusaka.me, the traffic is grayed out separately.

The problem we can see with a DSL solution like Terrafrom is that there will be significant limitations in its expressiveness for such dynamic and flexible scenarios.

The community is fully aware of this problem. That’s why IaC products like Pulumi, which is based on a full range of programming languages like Python/Lua/Go/TS, were created. For example, let’s rewrite the above example in Pulumi + Python (powered by ChatGPT here).

from pulumi_aws import alb

dns_records = {
    # "demo0" : 0,
    "demo1": 1,
    "demo2": 2,
    "demo3": 3,
}
lb_listener_port = 80
instance_rpc_port = 9545

default_target_group_attr = {
    "backend_protocol": "HTTP",
    "backend_port": 9545,
    "target_type": "instance",
    "deregistration_delay": 10,
    "protocol_version": "HTTP1",
    "health_check": {
        "enabled": True,
        "interval": 15,
        "path": "/status",
        "port": 9545,
        "healthy_threshold": 3,
        "unhealthy_threshold": 3,
        "timeout": 5,
        "protocol": "HTTP",
        "matcher": "200-499",
    },
}

alb_module = alb.ApplicationLoadBalancer(
    "alb",
    name="alb-demo-internal-rpc",
    load_balancer_type="application",
    internal=True,
    enable_deletion_protection=True,
    http_tcp_listeners=[
        {
            "protocol": "HTTP",
            "port": lb_listener_port,
            "target_group_index": 0,
            "action_type": "forward",
        }
    ],
    http_tcp_listener_rules=[
        {
            "http_tcp_listener_index": 0,
            "priority": 105 + pos,
            "actions": [
                {
                    "type": "forward",
                    "target_group_index": pos,
                }
            ],
            "conditions": [
                {
                    "host_headers": [f"{rec}.manjusaka.me"],
                }
            ],
        }
        for rec, pos in dns_records.items()
    ]
    + [
        {
            "http_tcp_listener_index": 0,
            "priority": 120,
            "actions": [
                {
                    "type": "weighted-forward",
                    "target_groups": [
                        {"target_group_index": 0, "weight": 95},
                        {"target_group_index": 5, "weight": 4},
                    ],
                }
            ],
            "conditions": [{"host_headers": ["demo0.manjusaka.me"]}],
        }
    ],
    target_groups=[
        alb.TargetGroup(
            f"demo0-{module.ec2_instance_demo[0].tags_all['Name'].apply(lambda x: x)}",
            name_prefix="demo0",
            targets=[
                {
                    "target_id": module.ec2_instance_demo[0].id,
                    "port": instance_rpc_port,
                }
            ],
            **default_target_group_attr,
        ),
        alb.TargetGroup(
            f"demo1-{module.ec2_instance_demo[0].tags_all['Name'].apply(lambda x: x)}",
            name_prefix="demo1",
            targets=[
                {
                    "target_id": module.ec2_instance_demo[0].id,
                    "port": instance_rpc_port,
                }
            ],
            **default_target_group_attr,
        ),
        alb.TargetGroup(
            f"demo2-{module.ec2_family_c[0].tags_all['Name'].apply(lambda x: x)}",
            name_prefix="demo2",
            targets=[
                {
                    "target_id": module.ec2_family_c[0].id,
                    "port": instance_rpc_port,
                }
            ],
            **default_target_group_attr,
        ),
        alb.TargetGroup(
            f"demo3-{module.ec2_family_d[0].tags_all['Name'].apply(lambda x: x)}",
            name_prefix="demo3",
            targets=[
                {
                    "target_id": module.ec2_family_d[0].id,
                    "port": instance_rpc_port,
                }
            ],
            **default_target_group_attr,
        ),
        alb.TargetGroup(
            f"demo0-{module.ec2_instance_reader[0].tags_all['Name'].apply(lambda x: x)}",
            name_prefix="demonew",
            targets=[
                {
                    "target_id": module.ec2_instance_reader[0].id,
                    "port": instance_rpc_port,
                }
            ],
            **default_target_group_attr,
        ),
    ],
)

You see, the overall usage is not closer to our usage habits, and its expressive power is better.

Deficiency 2: Gap between business requirements

In fact, IaC tools in the cloud era are more about the existence of infrastructure. There is actually a relatively large Gap in the orchestration and more rational utilization of existing infrastructure. How do we deploy applications to these infrastructure resources. How do we schedule these resources. It is actually a very interesting problem.

In fact, perhaps surprisingly, Kubernetes/Nomad is actually trying to solve such a problem. Some people may be thinking, “What? Is this an IaC tool? If you don’t believe me, you can check the core features of IaC we listed earlier.

The end product is the product of machine readable. It may be a piece of code, or it may be a preparation file.
the machine readable based product can further rely on existing VCS systems (SVN, Git) to do versioning (manifest goes with the repository).
The product of machine readable can further rely on existing CI/CD systems (Jenkins, Travis CI) to do continuous integration/continuous delivery (argocd and other platforms provide further support).

At the same time, we can declare in the corresponding configuration file, we need CPU/Mem, need the disk/remote disk, need the gateway and so on. This framework actually abstracts the computational Infra in a relatively generic way, so that 80% of the business scenarios do not need to consider the details of the underlying Infra.

But in fact, this set of solutions already exists and there are some problems. For example, its complexity spikes, the cost of self-hosted operations and maintenance, and some abstraction leakage problems.

Deficiency 3: Deviation of qualitative nature

The scope of the new IaC in the cloud era is much larger and more ambitious than traditional IaC tools such as ansible. A side effect of this is a deviation in quality. This topic can be divided into two aspects.

First, IaC tools such as Terraform support AWS/Azure/GCP platforms through an official Provider. But even with official support, some of the logic designed into the Provider is not consistent with the logic designed into the interactive interface on the platform side. For example, I’ve complained before that “for example, the delete protection of Aurora DB Instance is turned on by default when the Console is created, while it is turned off by default in TF”. This actually puts an extra mental burden on the developer when using it.

The second point is that IaC tools rely heavily on the community (which in this case includes open source communities and commercial companies of all kinds). Unlike old-timers such as Ansible, where the quality of the peripheral facilities is relatively stable, the quality of the new generation of IaC peripherals such as Terraform is difficult to say. For example, domestic vendors such as Ali Cloud, Huawei Cloud, Tencent Cloud and other vendors have been criticized for the Provider. Many large SaaS platforms for developers do not have an official Provider (e.g. Newrelic).

At the same time, some of the features provided by cloud vendors actually conflict with general-purpose IaC tools. For example, in AWS WAF tool, one of the features is based on IPSet for interception, so if the IPSet is very large, it would be disastrous to use a generic IaC tool for description. This time for similar scenarios, can only be based on the cloud vendor’s own SDK for encapsulation, the cloud vendor to provide the SDK quality qualified okay. If the SDK design is as poor as AliCloud’s, then you are on your own.

Deficiency 4: Facing the lack of developer experience

Developer experience is actually a relatively hot topic right now. After all, no one wants to spend their precious life doing repetitive work. As of now, the main IaC tools are For Production Server, not For Developer Experience, resulting in a mediocre experience when we use them.

For example, we now have a scenario where we need to open a batch of EC2 Instance on AWS as development machines for our R&D colleagues. How to ensure that the R&D colleagues can use these machines out of the box is a big problem.

Although we can provide a relatively unified environment through pre-built images and so on. However, we may need to go further to fine tune the environment, then it will be more painful.

For similar scenarios, older ones have Nix, and newer ones have envd to solve some of these problems. But for now, there is still some gap with existing IaC products, and it may be an interesting topic to follow how to interface.

Deficiency 5: Some shortcomings in the face of new technology stacks

The most typical is the Serverless scenario. For example, I have a simple requirement to implement a simple SSR rendering using Lambda.

export default function BlogPosts({ posts }) {
  return posts.map(post => <BlogPost key={post.id} post={post} />)
}

export async function getServerSideProps() {
  const posts = await getBlogPosts();
  return {
    props: { posts }
  }
}

The function itself is very simple, but if we were to deploy this function to Production Enviorment it would be a bit of a pain. For example, let’s think about what kind of infra we need to prepare for this simple function

a lambda instance
an S3 bucket
an APIGateway and routing rules
access to the CDN (optional)
DNS preparation

So with IaC Manifest + business code separated from each other, our change and resource management will be a big problem, as described by Vercel in his recent Blog Framework-defined infrastructure. How we can further evolve to Domain Code as Infrastructure will be a challenge in the future.

Ref

https://www.manjusaka.blog/posts/2023/03/12/a-simple-introduction-about-iac/

Table of Contents