In my previous blog, I wrote about how uShip scaled their cloud service with Wavefront analytics-driven insights. In this blog, I’ll recap OKTA’s story from our VMworld breakout session and how Michael Huang, principal engineer at OKTA, and his team were able to rapidly scale both their service and the underlying cloud infrastructure. If you’d rather see the full session, check it out here.
Wavefront has also been an OKTA customer for years, we have always found the service to be very reliable. No wonder, as I’ve learned that the OKTA engineering team’s motto is “Always On”. For those of you who are new to OKTA, it’s a rapidly growing San Francisco company behind the leading Identity as a Service (IaaS) platform. OKTA provides their service to over 4,700 customers across all industries, and as of August 2018, they have over 5,500 application integrations, and more than 1,200 employees worldwide.
Wavefront at OKTA Today
Michael’s core engineering team is responsible for the development and architecture of all OKTA backend infrastructure and services, ensuring “Always On” service. Michael opened his part of our session with the overview of how is Wavefront currently used at OKTA. The Wavefront deployment began within the backend infrastructure team but it then spread to other engineering and DevOps teams, as well as to marketing and sales teams. Now, Wavefront is used to monitor usage of all OKTA cloud resources: hosts, network traffic, AWS services like Kinesis, software component configuration, versioning control – to name a few.
Growth Challenges for Core Engineering and DevOps Teams
Being always on isn’t easy when you’re rapidly growing. During those early years of skyrocketing cloud service growth (a nice problem to have!), the core engineering and DevOps teams at OKTA accumulated loads of raw text data in the form of logs. Challenges emerged when they only used logs for their diagnostics. Engineers needed to acquire thorough domain product knowledge in order to understand various data formats and data implementations, which prolonged their incident resolution time. They started to build lots of internal tools, and they had to collect disparate pieces of business logic and siloed data, just to try to obtain meaningful metrics. Tools they used in those early days forced them to rely on non-unified data sources, making troubleshooting lengthy and laborious.
Also, their internal monitoring tools generated many false alarms. In addition to collecting raw text data, over time they started to add Remote Method Invocation (RMI) in addition to collecting raw text data. RMI helped them get the real-time state of their service health. However, some fundamental issues reemerged, including: it was hard to build time series metrics, also hard to establish baselines and more. Also, engineers were worried about security and access controls of the various methods they used. All of this resulted in the proliferation of policies: who, what, where can invoke specific methods. It was simply overwhelming, and engineers were exhausted. With all the different data sources, service providers and diversified data languages, they became error-prone. Thus they needed to change their approach.
Enter Wavefront
Wavefront cloud-native metrics monitoring platform resolved all inherited issues. It enabled Michael’s team to promptly understand the performance of their back-end infrastructure as well as different components of OKTA service. The Wavefront platform provided unified metric observability, enabling developers to be productive without prior training. No more domain-specific knowledge required. Wavefront introduces easy-to-customize dashboards and historical data with baselines. Michael’s team also gained improved user experience with more transparent metrics and many different reference points. They started receiving alerts without false alarms, and the engineering team could now trust their custom metrics displayed on NOC-style dashboards for all to observe.
Versioning Monitoring Using Custom Metrics
OKTA’s DevOps team deploys code into production multiple times per day across hundreds of servers. Moreover, they have to carefully manage their hosts to make sure that hosts are running the most recent software update, security patch or configuration change. All the data on different components collected daily are non-numeric. Engineers create hash functions to convert alphanumeric characters representing software versions into unique numbers to submit to Wavefront. Then within Wavefront, they use the variance function during a rolling restart deployment to analytically identify when some of their servers are running one software version, and the rest are running a different version. When the variance across servers is zero, they know that all servers are running the same desired update.
Monitoring AWS Kinesis Queue
The OKTA team uses AWS Kinesis distributed queue. Monitoring Kinesis helps to oversee latency, delay, unexpected errors, and ingestion/digestion rates from the queue. If the difference in inbound and outbound (see the figure below) curves remains slight, so do the lags. OKTA’s teams observe the status of the Kinesis queue daily, and they aim at zero latency between inbound and outbound queue. That means that with everything they put on Kinesis, they’re consuming from Kinesis error free and with almost no delays. Moreover monitoring Kinesis helps them to eliminate potential queuing delays, ultimately preventing any service degradations.
In the closing of our session, Michael shared some of Wavefront expansion plans at OKTA plans including applying Wavefront histograms, adopting containers, monitoring their AWS serverless functions and using more Wavefront AWS integrations.
It seems appropriate to close with Michael’s own words: “We are humans and we all make mistakes, but we use Wavefront to quickly spot incidents and help teams learn from past incidents to avoid future ones.” See what potential issues you may uncover – check out our free trial.
Stela Udovicic
Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ
The post How OKTA and uShip Scale Their Cloud Services with Wavefront Monitoring – Part 2 of 2 appeared first on Wavefront by VMware.