Security

How VMware IT Scales Security Workflows with Automation 

by: VMware Information Security Strategy Lead Priam Fernandes and VMware Information Security Strategy Program Manager Bhupendra Singh  

The ever-increasing volume and sophistication of cyber threats continue to endanger the enterprise, and security operations centers (SOCs) stand on the front line. However, keeping up with the high alert volumes, parsing through logs for relevant information, correlating multiple data sources, and filtering through false positives has put an undue burden on human analysts. Individual and team stakeholder collaboration is vital, as is the need for solid (and scalable) automation strategies for every phase of security workflows.

SIEM and SOAR

Security information and event management (SIEM) and security orchestration, automation and response (SOAR) platforms are two critical components of the SOC.  

  • The SIEM solution aggregates and correlates alerts from various sources such as logs, network devices and endpoint protection. 
  • SOAR is designed to create and execute automated playbooks that can perform a range of tasks from enrichment to remediation actions. SOAR can complement the SIEM by providing a centralized incident lifecycle management dashboard and acting as an orchestration engines that integrate a wide range of security tools to enrich the alerts with relevant threat intelligence context.  

More on SOAR

Security vendors present SOAR as a no-/low-code platform where automation playbooks are easily created without the need for customization. The reality is that triaging/responding to alerts is much more involved, making SOAR more of an integrated development environment (IDE) rather than a low-code platform.  

The following two sections are a primer for deciding how much should be automated and how much should be accomplished manually. 

Scope of automation

  • Before delving into automation, it is important to determine who owns the risk in the detection and response lifecycle. This determination provides teams the risk appetite for false positives and false negatives, as well as the confidence level in the automation playbooks. At VMware, the SOC is accountable for responding to alerts and therefore plays a pivotal role in deciding which of the processes should be automated. Each alert is made up of phases, with earlier phases comprising repeatable/tedious tasks and later ones involving analysis, decisions and context-specific tasks. Not surprisingly, SOC analysts and security automation engineers automate early-phase tasks while requiring human intervention for more critical decision making.

Prioritizing use cases

  • An early VMware IT team decision was to develop automation playbooks based on a single use case rather than automate the same phase for all use cases. That was predicated on the fact a minority of use cases took up the majority of analyst time, requiring manual intervention. To make it efficient and effective, the repeatable tasks were identified and automation playbooks were developed to avoid repetitions. However, some playbooks included pauses for human intervention or analysis as needed. 

Creating a new workflow 

In tandem with the use cases being defined, VMware IT developed a workflow that covered clearing the existing backlog, implementing new use cases, and resolving bugs. The workflow involved various connected phases, including: 

  1. Evaluation of automation requested by various teams: The traditional approach often resulted in gaps between implementation and deployment or usage of the implementation. The new workflow ensures a thorough collaborative evaluation before commencing implementation and aligning it quantitatively to business value. Work is considered complete only after deployment and use in the production environment. The process involves sub-tasks for tracking dependencies and reviews by other teams, with automated alerts/notifications sent to a common Slack channel.  
  2. Adding an automation request to the backlog: Thanks to workflow automated restrictions, a request is broken down into smaller requests categorized under appropriate technical classifications instead of treating them as independent requests. Periodic collaboration meetings between stakeholder teams are scheduled to avoid a scenario where the implementation was not accepted as the requested requirements and could not be deployed.  
  3. User acceptance testing (UAT): A requester team can undertake a UAT after development and code/peer reviews are complete—and before deployment into the staging/production environment. This allows the necessary time to indicate, mitigate and resolve any dependencies and/or blockers, or simply put the work item on hold if higher priority items require attention.  
  4. Agile-metrics-based dynamic dashboard: This tracks progress and identifies blockers, with live GANTT charts created in Jira for executives. The plan automatically assigns the sprint-end date as the target end date to indicate the type of progress (not started on time, in progress, completed).
  5. Capturing daily standup details from globally dispersed groups: We created a bot to capture/assess issues that affect the team (versus individual work details) via answers to questions like:  
    • What did you accomplish since your last update? How did you help the team meet any of the sprint goals? 
    • What will you do today for both your assignments and to meet the team’s sprint goals? 
    • Are there any blockers or dependencies you or your team are experiencing? If so, please elaborate. 
    • Is there any item that can’t be completed by sprint-end? If so, please elaborate. 
  6. Collaboration meetings—formal process for linking dependencies: Due to time zone differences when team meetings, a workflow was defined for formally logging dependency requests for testing, troubleshooting assistance, and other answers. This accelerated timelines while ensuring clarity of assignments and accountability. The status of the parent work item was updated accordingly and linked to the newly created items.  
  7. Production environment testing, completion and deployment became part of the definition of done: This increased accountability and efficiency—avoiding phase-wise good metrics without actually completing the final objective—and focusing on end-to-end completion and relevant metrics for objectives and key results (OKRs). 
  8. Bugs estimated for troubleshooting and resolution if the high-level root cause is known: Previously a ‘best-case effort’ only led to more uncertainty about completion of the planned sprint items.  
  9. Alerts for timely notifications: These are now immediately sent for reviews, backlog additions, ready for production, and release phases. Team members no longer need to constantly check pending actions in tools or dashboards.  
  10. Completion summary: After every sprint, a document is created using specific metrics and their color coding, action items, and next-up goals and their stakeholders.  

Conclusion

As we scale security operations, automating repeated tasks becomes mission critical for eliminating the expanding SOC burden. However, any such automation strategy must also consider people, processes, and technology to be successful. Individual and team stakeholder collaboration is critical, as is the need for solid (and scalable) automation strategies for every phase of security workflows.  

There’s a lot more to this topic than is presented here. That’s why we encourage you to contact your account team to schedule a briefing with us. No sales pitch, no marketing. Just straightforward peer conversations revolving around your company’s unique requirements. 

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment.To learn more about how VMware IT uses VMware products and technology to solve critical challenges, visit ourmicrosite, read ourblogsandIT Performance Annual Reportand follow us on SoundCloud,Twitter andYouTube. All VMware trademarks and registered marks (including logos and icons) referenced in the document remain the property of VMware.