By Jordan Bar and Martin Iliev
Infrastructure as Code
Previously, data center resource management relied on manual interactions through user interfaces and command-line interfaces. However, the introduction of the Infrastructure as Code (IaC) approach has completely transformed this process. It now enables the provisioning and management of compute, network, and storage resources programmatically, leveraging various frameworks and automation tools.
Although some teams might opt for manual resource management during the proof-of-concept phase or when dealing with smaller teams or a limited number of services, this approach can lead to several issues. These problems include a lack of historical information about configuration changes and insufficient code documentation.
In contrast, the adoption of IaC brings several advantages. It supports both machine-oriented and human-readable workflows, facilitating analysis and enabling a more rigorous approach to managing numerous resources. Additionally, it allows for easier review and version control of configuration settings and facilitates a controlled, phased approach to implementing changes across different environments.
There are various IaC frameworks available, including Puppet, Chef, Ansible, Terraform, Pulumi, and VMware’s iDEM, among others. While each framework has its pros, cons, user community, and use cases, different development teams can decide on the best toolset for their environments and services.
In this article, we’ll discuss the VMware approach to support any IaC framework chosen by the service team while supporting enterprise scale management of resources.
Managing resources at scale
What does it mean to manage resources?
Efficient resource life cycle management is crucial in data center environments. While manual management is suitable for the proof-of-concept stage or when dealing with a small number of services, it can lead to problems in the long run, particularly within enterprise settings. IaC and automation may seem overwhelming initially, but they contribute to more efficient resource management over time. By preserving code history and documentation, new team members can better understand the context, and the risk of configuration change history becoming unavailable is mitigated.
The problem becomes a lot more complicated when scale is a factor, that is when we have to deal with a large number of accounts across different tenants.
Managing a service at scale entails complex tasks involving hundreds or thousands of accounts, services, servers, and multiple environments like development, staging, and production, each with distinct service level objectives (SLOs). Additionally, there might be tenancy-based requirements that necessitate security and compliance teams to enforce specific controls across the entire cloud resource fleet. Each tenant may have multiple services with different needs. Further, there may be requirements to roll back or roll forward specific change sets to different services, servers, accounts, or tenants. To tackle this complexity, teams can leverage IaC as a valuable framework.
One of the most crucial advantages of IaC is the ability to review and test changes before deployment. At scale, manually managing and updating configurations is not feasible, as even minor errors can have catastrophic consequences for the entire fleet. Therefore, the capability to review and test changes before deployment plays a critical role in ensuring infrastructure reliability.
Another benefit of IaC is the ability to define the desired state of the infrastructure, which the framework uses to address configuration drift. Instead of relying on manual updates and hoping for the best, teams can define the desired state and use IaC to maintain the desired configuration across the entire fleet of cloud resources.
Lastly, IaC allows teams to track the artifacts used in source control systems. These artifacts are versioned and consumed by automation, enabling teams to release new infrastructure versions in a manner similar to managing software releases. This ensures consistency and reliability throughout the entire infrastructure.
Operating a service at scale is a complex endeavor that necessitates careful management of multiple accounts, services, servers, various environments, and tenancy-based requirements. IaC serves as a valuable construct that helps teams navigate this complexity by enabling them to review and test changes before deployment, define the desired infrastructure state, and track artifacts used in source control systems. These benefits contribute to the reliability and consistency of the infrastructure across the entire fleet of cloud resources.
In the case of VMware, operating at scale involves rolling out configurations to thousands of public cloud accounts across clouds, hardening its infrastructure, and ensuring compliance with security policies. To achieve this, a security operations center monitors account activity 24×7, tracking and investigating any unusual activity. Additionally, all accounts across clouds have a baseline configuration applied to maintain a consistent security posture. This is enabled by a service called CloudGate that is used within VMware.
CloudGate specializes in managing public cloud accounts at a large scale across clouds, tenants by providing access control, account lifecycle management, and basic governance. This service offers various policy packages, namely security basic, security advanced, and IAM management, which are applied to individual accounts. CloudGate allows the account owners to bring in their own policy bundles to be applied on top. This post goes deeper into the policy side of CloudGate.
The security basic and security advanced policy packages in CloudGate are designed to enhance the security of public cloud accounts. They enable continuous monitoring of the account to identify and mitigate potential security threats by enabling specific logging and aggregating them to be consumed by a central team. Additionally, these policies provide mechanisms to restrict or grant access to specific public cloud resources based on defined rules and requirements. Furthermore, they facilitate the configuration of additional logging settings to enhance visibility and auditability within the account.
The IAM management policy bundle focuses specifically on managing identity and access within the public cloud environment. It ensures that access to public cloud resources is granted with the principle of least privilege, meaning that users are given only the necessary permissions based on the requirements of the team or entity that owns the account. By adhering to the least privileged access principle, the IAM management policy bundle helps to minimize the potential attack surface and reduce the risk of unauthorized access or misuse of resources.
Service teams that use these accounts must provision and manage their discrete environments and services, which may require them to use their own IaC artifacts on top of the basic configuration applied. However, not all teams use the same IaC technology, which can result in inconsistencies across the infrastructure. Despite this challenge, CloudGate ensures that all teams are able to provision and manage their environments securely and efficiently.
The CloudGate service enables account owners to incorporate their own policy packages, which are then applied, in addition to the existing packages, to a specific account. The package lifecycle can be managed through CloudGate APIs.
Requirements for deploying custom policy packages for a cloud account
We collected feedback from many teams on their preferences and policy needs and identified some key characteristics for the desired solution:
- Support arbitrary IaC binary and dependencies
.
- It is also important to manage the supply chain security of software updates/releases to securely load the artifacts from a trusted repository.
- Support common operations and evaluate how those translate to each binary invocation.
- Different IaC technologies implement the same functionality but may execute it differently.
- Support running against arbitrary IaC code artifacts; enable service teams to reuse and consume a managed framework for change management while being able to continue to focus on their specific product/core.
- Maintain the basic security and compliance posture of the public cloud resources.
- Prevent and block unauthorized access to cloud resources.
Custom packages with different IaC frameworks
Teams must be able to apply their own policies to the accounts in which their services are run. This may include filtering specific events (for example, monitoring an EC2 host being decommissioned in an account in AWS) and taking remedial action or capturing billing info to be consumed or operated in different ways.
We provide an interface that translates to the specific implementation based on the IaC tool or framework. The service also provides the ability to deploy the packages to a specific account, image signing, scanning, provenance verification, and ensuring these operations have been properly completed with signed software bill of materials (SBoMs). The service provides the ability to deploy the packages to specific designated accounts that the account owner has permission to. We provide an interface that translates to the specific implementation based on the IaC tool or framework. The service also provides the ability to deploy the packages to a specific account, perform image signing, scanning, and provenance verification, and ensure these operations have been properly completed with signed software bill of materials (SBoMs). The service can deploy the packages to specific designated accounts if the account owner has permission.
This functionality is provided using APIs, and under the hood, an adapter pattern is used to support different IaC technologies/stacks. The API implementation is done in each module, allowing implementation flexibility in the underlying IaC tool/binary.
In CloudGate the implementation is done as follows. A Factory pattern is used to retrieve a Typed adapter instance that can process the provided payload using the standard APIs. The concrete adapter instance has the implementation details that correspond to the predefined methods in the Interface.
For example, a Pulumi IaC artifact will get processed by an instance of the Pulumi adapter. A Terraform IaC artifact will get processed by an instance of the Terraform Adapter. This is illustrated in the figure below.
As with any software system, this managed service consumes input and generates output. Those are structured using an event-driven architecture, which is a cloud-native design pattern for massively distributed SaaS services. The primary standard is CloudEvent, which is a cross-cloud data structure that is supported out of the box by some cloud vendors, while with vendors that do not support it, wrapper code can be used to encode/decode messages.
Using a unified yet flexible data structure, enables SaaS services to integrate with a growing number of other systems, routing events to the correct processing pipelines, and producing structured, machine-friendly, and human-readable output events, while all components consume and produce interoperable payloads.
Users of such frameworks may choose to consume the raw data events right in the Producer stream or leverage the available APIs to process aggregated results and parsed data.
Putting it all together
In this article, we discussed how massively distributed SaaS services can give teams the freedom to use the IaC of their choice and support discrete operations in the public cloud accounts across the fleet. Interoperability is achieved using industry standards like common data structures (CloudEvents) and a proven event-driven distributed architecture.
There are foundational pieces that are provided as part of a service offering. The service allows teams to bring their own policy bundles that can be deployed in addition to the basic packages provided by the service. The policy bundles can be applied at scale to the entire fleet of accounts owned by a team.
This is implemented internally within VMware in the CloudGate Service. Allowing the flexibility of bringing one’s own IaC for specific actions while managing the entire fleet across clouds using a common service with basic governance brings significant benefits in a large enterprise consisting of different services and teams.