"Sisu"
Strength, perseverance and resilience to commit oneself fully to a task and to bring that task to an end.
From the very start, Pivotal's engagement with federal agencies in the US was conscious to collaborate around common or shared values. We did this, however, with the opinionated intent not to sacrifice important outcomes Pivotal is known for, like the speed at which we enable customers to deliver innovation.
Efficient vs Effective Matrix by Andrew Cooke.
This careful balance required not just federal customers to adopt lean methods, but for Pivotal to clearly understand the values and measures of success for a federal agency and its clients—or what "speed to market" really means when it's not just products on the line, but the unpredictable, capricious dynamics of a threat or mission.
Such is the case with US national security, where how fast an idea gets to production is more specifically defined as how fast one delivers a compliant solution or app to a critical mission need, for some of the most sophisticated and coordinated operations on the planet. In these conditions, velocity and low margins of error determine not just the difference between success and failure, but the line between life and death.
Stakes like that require accountability in every piece of the process and are why compliance is essential throughout delivery. In the world of national security, the best Key Performance Indicator (KPI) in measuring software delivery is the "mean time between fielding".
Fielding is the multi-dimensional web of procedures, timelines, and authorizations needed to field a system or system increment. It requires close and frequent coordination among the acquisition, sustainment, and operational communities in order to field material that meets user’s needs.
Continuously monitoring this fielding KPI, is how we measure the ability deliver at a velocity that lets our customer maintain the advantage over an adversary—or what we call effective operational agility. The lower the mean time to fielding (delivering) a solution, the greater the advantage (or effective operational agility), and the higher the likelihood of mission success.
Shortly after the engagement commenced, the discovery process yielded a big insight: a significant lag in the time it took to get an Initial Operating Capability (IOC) to an operator, and then maturing it IOC to a Full Operating Capability (FOC—or in commercial terms, getting into full production, being deployed in ways necessary for operational success). These breaks in fielding momentum had in the past led to several promising technologies stalling before they could mature into solid operational capabilities. That maturity process became our focus in overlaying how Pivotal does software delivery for enterprise customers with the necessary acquisition wickets to deliver to the operational environment.
It would take extensive collaboration with national security teams to overcome hardened assumptions on the relatively low value assigned to fielding times and to elevate its relative status in the delivery cycle. But in the end, we helped them and develop and implement a breakthrough solution, or what came to be known as the continuous fielding model.
Continuous Fielding is a laser focus in maturing a minimally useful, but deployable software product to full operational capability. It's a comprehensive approach to rapid information assurance and accreditation through inherited controls, leveraging a unified platform approach, and enabling operator teams to employ and maintain it. A key aspect was the operator-centric design that helped shorten the consensus of a completed FOC.
The outcomes have led US national security apparatus to get much closer to their vision to collectively overcome the delivery challenges of fielding to the last mile, out to the edge, on any mission for anyone in the field, at any time—continuously, and at velocity.
The Long and Winding Road to the Field/User
An early hurdle was coming to terms with processes far removed from test-driven thinking. Before it reaches the field, every application passes through a host of organizations that own a piece of the delivery. Each delivery piece has its own leads, reviews, and timelines. With no overarching cross-functional lead (think product manager), team, or platform, each silo was accountable for its own checklist, and not the overall success of the product. Like many waterfall models, progress across silos lacked transparency and governance.
Image by Reneseas.
This resulted in complex, high-risk deployments per vertical. Resolving issues required specialized skill sets and a high-touch approach. This delayed delivery times and the acquisition community could not respond to changes to threats without accepting significant security risks, and potential system impacts.
Moreover, budgets varied between delivery teams, while developing code in pieces led to disparate codebases and misaligned development teams.
All of these tollgates pulled like gravity on velocity. Delivery or "fielding time" suffered inevitable delays, and it was cost-prohibitive to field even a single capability without increasing the likelihood of failure in production.
Product requirements did not sufficiently accommodate for non-functional testing that needed to happen early and often. It's hard to overstate the increased risk of failure that resulted by the inadequate attention to non-functional testing aspects. Non-functional requirements are often the first areas cut when budgets are reduced. The subsequent budget cuts for non-functional requirements translated into even less interest in addressing them, creating a cycle of diminishing returns.
We had two options here.
The first was to continue responding to failure reactively. The second was to offload these important process concerns to a platform technology that makes the non-functional testing aspects of development a first-class citizen.
We needed to combine technical development and enablement with a kind of 'cultural enablement' to move from reactive to proactive operations, the very hallmark of a successful continuous fielding model.
How Operators Deploy and Run Software
Having formulated a path around a big organizational hurdle to the test-driven mindset, we unpacked the engineering culture inside the Federal government. With organizational transformation in mind, we soon ran into fresh challenges.
The first was the federal acquisition process, which incentivizes schedules over product performance. This was underscored by testing processes that often met minimum requirements, with little incentives to deliver exceptional products.
We needed to convince engineering that "functionally acceptable" was not a high enough bar. Using the warfighters, analysts, and operators as the benchmarks, we reminded engineering teams that they expected excellence from themselves, and within and between their teams.
We empathized with the great difficulty delivery teams had under conventional processes in deploying safe software for mission-critical national security concerns. Separate teams and splintered delivery accountability meant less shared understanding of the whole. Low deployment frequency meant higher rates of error in production.
That complexity expands exponentially when applied across the broad spectrum of national security concerns. To simplify that complexity, we put a lens on the following:
-
Creating a unified and common platform strategy and governance control point for delivery teams to rally around, and to build a consistent, repeatable knowledge base. To address the first factor we needed repeatable deliveries with validated results that were thus defensible. This included version control for all components in the platform inventory. Automation here would significantly lower the traditional System Operational Verification Test (SOVT), and eliminate the far more time-consuming, manual System of System Enterprise (SOSE) tests.
-
Establishing and strengthening communication flows that ran perpendicular to the parallel delivery flow initiated by developers. In simpler terms, what the end user knows and needs is as important to the development of a product as the developer writing its code. We reorganized around operational, mission-driven value streams, and helped recruit end users—in this case, direct warfighter involvement—to drive product development. This gave delivery teams a far better understanding of operational relevance. They could better understand the decisions while maintaining a high level of application knowledge.
These changes resulted in higher and wider levels of accountability, and ownership throughout the lifecycle of an application. This applied to both the delivery team, and end users receiving the application. We now had a solution for this perpendicular communication problem that mirrored the one for our parallel delivery flow.
The Platform's Answer to Security and Compliance in Continuous Fielding
One of the pillars of national security work is security and accreditation. It's also a natural challenge to continuous fielding. From an operations perspective, managing security breaks down into a number of unique smaller challenges that play to a signature platform strength: the ability automate critical operations uniformly and securely, at scale.
Patches
Maintaining patches, and upgrading software without scheduling downtime or the end users' knowledge is fundamental to supporting mission-critical workloads.
Traditionally, operators scheduled and coordinated patch changes, and developers tested them. This involved lots of back-and-forth between groups on the optimal timing for patches.
Offloading and automating all of that to the platform frees developers from implementation duties. They can now write more code and release it faster. It similarly allows operators to stand up and test patches across the application portfolio with one click, and report findings before operations ever see a line of code.
Once a team agrees on the confidence of a CVE patch, operators could move it across all operations in a canary-style deployment, maintaining 24⁄7 uptime in support of mission-critical workloads. The platform further enforces a unified governance model through API-level promises to each application within the foundation.
Authentication and Accreditation
Authentication is offloaded to an Enterprise Identity Provider, while the platform is authenticated using Single Sign-On. Applications can secure protected resources at whatever granularity that makes sense within a subject matter domain. This allows developers to test both inside the platform, and in isolation prior to pushing. It also offloads the heavyweight portions of security to a solution that has already been approved by security teams.
Until very recently, security and IA teams conducted extensive reviews to understand how each delivery affected the overall security posture of the enterprise. It slowed down fielding and often required an entire infrastructure support team just to deliver the application code.
Understanding and solving for runtime security implementation cleared several hurdles to continuous fielding. However, we still had to address updates to the platform components themselves, and the ever-expanding roster of new application features.
For the application features, we understood that by reframing delivery around application bits as the key currency throughout the delivery, we could revolutionize the constraint of Information Assurance.
To do this, we needed a defensible way to prove that everything below the application was validated, and had not changed with each release. We began by incorporating the validation of distributed components that form Cloud Foundry, via hash check—verifiable by the security team—in our continuous delivery pipelines. That gave the team full transparency into what was running in operations. Its secondary benefit was environment parity or minimizing configuration drift across platforms. That constant transparency combined with necessary security controls verified all the way to the container meant that only the application bits needed to be continuously accredited.
Additionally, distributed platform components were designed around separate concerns, through clearly defined, but loosely coupled API-level coordination. Our underlying release engineering and deployment toolchain was defined via statically typed version controlled artifacts. This allowed us to verify and validate the runtime environment against the configuration management in a repeatable, consistent fashion.
Organizing for Platform Security
Taken together it fundamentally changed how accreditation could be done, and to take advantage of that, several vectors in application delivery also had to be addressed.
We first needed a way to verify that only approved developers contributed to the official source code. Having a secure network transport between the source code repository and build servers covered that part. Credentials for builds that accessed environment resources required restricted testing access. Further, before any build was executed, source code commit signatures required validation.
Next, every build process underwent a minimum set of security scans before any tests and failure findings were relayed to developers. A security chain of trust needed to be initiated, starting with source code through build and release by auditing and signing the resulting binary artifact.
From here, an automated, independent and reproducible build would be initiated, with the resulting hash verified against the signed binary. This resulted in a successful push through a secure transport to a release repository, in which metadata and the appropriate egress rules could be defined and applied later in the push into operations.
Finally, the operations pipeline would pick up the binary, egress rules, and metadata. The pipeline would provision an ephemeral key promise, push the binary to a space within Cloud Foundry, and apply the appropriate egress rules. The instance would ask for the authorization token and keys.
This guaranteed that only authorized instances made it to operations, and eliminated runtime configuration drift. Changing this intricate security system resulted in the most significant velocity gains and reductions in fielding time. This is due to the smaller scope of application accreditation versus. stack/type accreditation, the automated security scanning pipeline, runtime authorization, and the verification of the binary artifact in an independent build process.
Warfighter Outcomes
Once we got to actual operations, the big question became, "How do we improve the runtime experience for the warfighter?" Several platform features aided us. For one, all logging and metrics on Cloud Foundry are treated equally. That gave operators enterprise-wide monitoring capabilities for a host of different aspects of the environment. All metrics and logs were event streams that could be funneled to a host of 3rd-party tools, allowing operators to monitor aggregates of app/component health across the platform. They could also monitor for security or container events, and resource access. Since all distributed components and application containers adhered to the API contracts, enterprise alerts were now activated with minimal effort. It further ensured that future workloads would not require any understanding of how event streams functioned in order to be added to the watch list.
These features, combined with built-in resiliency and fault tolerance created comprehensive self-healing capabilities. Out-of-the-box recovery capabilities included: restarting failed system processes; recreating missing or unresponsive VMs, deploying new application instances if they crashed or become unresponsive, application striping across availability zones, and dynamic routing and load balancing. In short, all of this removed most of the recovery burdens from the operator.
How Developers Now Design Mission-Critical Apps
Developers leveraged the highly distributed platform and their proximity to provisioned resources by designing mission-critical resiliency into application architectures. This increased the transfer speeds from imagery and command-and-control applications. By being able to field to the edge, they significantly reduce end-user latency and transfer speeds.
The distributed platform has several built-in components designed to work across geographic boundaries. Applications deployed to the platform can now be evolved to take advantage of platform-level high availability and resiliency. This combined with the wan-level caching technology supports end users who operate in both a connected and disconnected manner.
Additional platform capabilities include smart routing, service discovery, circuit breaker, data integration and real-time processing at scale support multiple approaches for mission functions. It includes the data synchronization strategy for an app, availability and consistency tolerance, etc. All are key elements to designing how applications behave during reductions in capability due to operations within Anti-Access/Area Denial-like conditions.
Partial datacenter outages can be mitigated with service discovery and circuit breaker technologies native to the platform. Developers calculate latency penalties by switching between instances. They build this intelligence into the applications combined with Layer 2/Layer 3 quality of service optimization to prioritize data in degraded environments. Finally, we’re beginning to explore with our clients the expected behaviors in failover scenarios, such as a full data center outages. Having application resiliency exercises on a monthly, weekly, daily basis is especially important once an application is fielded. When an application switches over to an available data center, all the elements within an application are required to account for this change such—routing, storage, and the data itself.
If you found this post illuminating, watch this story about our work with the US Air Force.