Software-Defined Storage vSAN

Pro Tips For Storage Performance Testing

Let’s face it — enterprise storage is a big investment.

And there are big, meaningful differences in how different storage products perform when you put them to the test.  Higher performing solutions can handle more workloads, more easily accommodate growth, and generate fewer unpleasant performance problems to deal with.

Great performing solutions can save both money and time.


Unfortunately, using publicly available information to compare different alternatives is a frustrating exercise at best.  Although we at VMware publish VSAN results frequently, that’s not the norm.  If you’d like a quick list of our published results to date, please scroll to the bottom of this post.

The lack of directly comparable performance testing information is not helpful if you have an important decision to make.

The solution?  Do your own head-to-head testing.  Investing in your own storage performance testing can help you figure out what’s the best product for you — and also avoid nasty surprises later on down the road.

And in this post, we’ll give you the basic do’s and don’t you’ll need to be successful in doing your own storage performance testing.


It’s Not Magic

black_artIt may seem that storage performance testing is a black art, but — in reality — it’s not hard to come up with useful results that can guide your decision-making process.  It can be a very valuable skill to have in any IT infrastructure professional’s portfolio.

Yes, you’ll inevitably make a few storage vendors a bit nervous, but that’s to be expected — just share your test plan with them, and press on.

Our goal here isn’t to make you into a certified expert — but to share enough that you can get the job done to your requirements.  Keep in mind, even among storage performance testing professionals, there are healthy differences of opinion, so be prepared for a few discussions along the way.

We’d like to encourage more people to do their own testing, especially when evaluating hyperconverged products.   But testing a hyperconverged storage product isn’t the same as testing a traditional external array.   Hyperconverged clusters are usually busy places: hundreds of VMs all with their own storage workloads.  The problem is that popular performance testing tools need a lot of scripting and automation to simulate that kind of environment.  Precious time gets spent making things run vs. getting results.

To make matters easier, we’ve just introduced a new free tool — HCIbench — that makes testing hyperconverged clusters far simpler and easier.  Our tool, in turn, is based on the popular Vdbench open source storage testing tool that many of us prefer.  We’ll be working to make HCIbench open source as well before long.own_testing

Getting HCIbench up and running is a straightforward exercise.  If you’re interested in laying down a solid foundation of synthetic tests, it’s a good choice.  Many people start with a battery of synthetic tests to get a good 360 perspective, and then pick out one or two relevant application level tests to complement the picture.

Popular application testing choices include Jetstress for Exchange environments, and Sysbench for database environments.  If you’ve got money to spend, the excellent Dell Benchmark Factory For Databases is used extensively in the vendor community.

In this article, we’ll share our suggestions on how to get the most out of HCIbench — or any other testing tool you choose!

When it comes to storage performance testing, I remind people of the “Three R’s” —relevance, repeatability and results.


Relevance Matters

Relevance is all about testing in such a way that your testing results will somehow correlate with observations in production.  After all, it’s why you’re doing storage performance testing in the first place!

lab_messConfiguration: you should ideally test what you plan to deploy, but that might not always be possible.  In the case of VSAN, performance will vary widely depending on the underlying hardware resources configured.

If you plan on deploying small, 3-node clusters, that’s what you should test.  But if you’re considering a decently-configured VSAN cluster, you should consider testing four nodes or more.

All VSAN testing to date has shown good linear scaling beyond 4 nodes, so there’s no burning need to test a ginormous cluster unless that’s what you really want to do.

More important: try to select the exact controllers, flash devices, network and disk devices you plan to use in production.  Don’t use lab leftovers, the results won’t be relevant.

Best practice: test the components you plan to deploy.

Choose Your Tool: while it’s true that many popular tools can be used if you know what you’re doing, certain tools make good performance testing easier than others.

However, testing at cluster scale can be problematic with traditional storage performance testing tools.  In a hyperconverged cluster, each server is expected to support many dozens of VMs, as well as contribute storage resources to the overall pool.   Ideally, you’d want to simulate a busy cluster, as that is what most people tend to run in production.

hellcatBut using a traditional storage testing tool to do this correctly on a hyperconverged cluster can take a lot of work, not to mention error-prone.  Time that could be spent doing actual testing is instead spent fiddling around with scripts and files.

To assist with this situation, we’ve just announced HCIbench, which automates the configuration, testing and data gathering of cluster-wide storage performance testing.  It in turn uses standard Vdbench open source benchmark for individual workloads.

More about HCIbench here — give it a try, you’ll find it makes your life a lot easier.

Best practice: use a tool that makes the job easier, not harder.  Additionally, if you choose Vdbench (with or without HCIbench for automation), we’ll be able to easily compare your results with ours.


Model your workloads: transactional, VDI, decision support, mixed — or something else?

If you have an existing production cluster to model, great.  VMware offers tools that will give you a good feel for your existing workload profile.   For example, the VMware VIP tool new storage features can collect performance data from your existing environment and make specific recommendations on cache sizing, etc.  Yes, it takes some extra work to set it up, but it’s usually worth it as the recommendations tend to be very

However, if you’re building a greenfield cluster, you’ll have to take some educated guesses on what your IO profiles will look like: block size mix, read/write mix and random/sequential mix.

When considering hybrid storage configurations (e.g. mixed flash and disk), the most important factor will be to estimate the size of your “working set”, e.g. the proportion of your entire data set that will be actively accessed.  Most observed working sets are less than 5% of the total dataset size, but there are exceptions.  If your tests size your working set too large, you’ll get a less-than-ideal picture of hybrid performance that won’t begin correspond with reality.

By comparison, all-flash configurations don’t need to be concerned with working set sizes — as the purpose of cache is simply to extend the life of less-wear-endurant capacity flash — at least, with respect to VSAN.

If you have the time and inclination, you can run through a variety of tests with different profiles to get a more complete picture.  While an interesting exercise, certainly not a quick task.

Best practice: have a rough idea of your target IO profile: read/write mix, sequential/random mix and working set size.  If you don’t have an existing setup to measure, make some educated guesses.  Testing *everything* is very time consuming and results in a morass of data that’s hard to make sense of.


Choose your goal: IOPS, bandwidth, latency — or a mix?  When you set up your testing, you’ll be trading off between different priorities.goal3

Small blocks and large OIO (outstanding IO) queues will deliver the best IOPS, but bandwidth and latency will not be optimized.  Large blocks and large OIO will show great bandwidth, but IOPS and latency will not be ideal.  Small blocks and low OIO will show great latency, but both bandwidth and IOPS will go in the other direction.

You get the idea — tradeoffs abound.

Transactional and VDI environments usually optimize around latency and IOPS.  Decision support and file access environments usually optimize around bandwidth.  Test and dev is usually a mix of both.

Best practice: consider testing at three different block sizes (e.g. 4K, 32K, 128K) so you’ll have nicely spaced data points to interpolate: random profiles for the smaller blocks, and sequential profiles for the larger blocks.  If you’re not sure of your working set size, start small (~5% of usable capacity) and then walk upwards as time and interests allow.

If you’re testing a busy cluster with many VMs and VMDKs, keep OIOs moderate (e.g. 1-2 per VMDK) so you’ll get a nice balance of latency and throughput.

Some of my favorite testing mixes:

– “Classic” OLTP — 70% read, 30% write, 50% random, 4K blocks, small working set size
– Decision Support — 95% read, 5% write, 10% random, 32K blocks, moderate working set size
– Data load — 100% write, 100% sequential, 32K blocks, large working set size
– Data unload — 100% read, 100% sequential, 32K blocks, large working set size



Multiple workloads: Make it real! You’ll notice in almost all of our testing, we’re firing up many VMs, each with multiple VMDKs.  Why?  Most clusters are busy and doing a lot of different work, so many VMs makes sense.  And using lots of VMDKs is a great way to generate multiple, parallel IO streams as well fully utilize storage capacity.

Best practice: lots of VMs, lots of VMDKs. A small number of VMs and/or VMDKs won’t really push the environment as much as you’d like.

Initialization: test steady-state, not start up.  If you’re using newly provisioned VMDKs, it makes sense to initially write through all the data sets first to avoid the “first write” allocation penalty.  This better reflects how VMDKs are actually used in production.

The other initialization that makes sense is cache warming.  When you first start up a workload, it takes a short while before steady-state performance is achieved.  Put differently, a 5-minute testing run won’t get you steady state performance.

Best practice: write through all data sets on newly created VMDKs , and allow time for cache warming before taking measurements — 60 minutes minimum is suggested.


Be Realistic:  there’s this very strong temptation to shoot for the Best Possible Numbers when doing performance testing.  Resist the temptation.  Yes, you can show millions of IOPS, or sub-millisecond latencies — but is that what you really need?

A good storage product should make it easy to get to good numbers without exotic steps being needed to get there.

Best Practice: decide before you begin when enough is enough — either results observed, or time invested.

Test Your Scenarios: it’s always interesting to see how performance changes when (for example) a drive or server fails and has to be reprotected.  If you plan to use deduplication and/or compression, our suggested HCIbench / VDbench tool can let you see how different data streams impact performance.  If your environment has a few really busy VMs, you should test that as well.

Best Practice: plan on failing a disk (or a server) and see what happens to performance when a rebuild kicks in.  It’s a useful piece of information.  Keep in mind that VSAN implements a standard 60 minute delay if you simply pull out a drive (vs. a drive actually failing).  Powering down a server has the same behavior.


Repeatability Matters, Too

acc-rptIf you ever took a lab class in school, it was important to write down everything you did and saw.  The same thing applies here as well.  Ideally, you’d capture enough information that someone could repeat your exact results if they wanted to.


Environment Configuration: document all the hardware bits that your test is running on: server configs, IO controller, flash devices, disk devices, firmware, drivers, network configuration — the works.  Next, document the supporting software you’re using: version of Linux, scheduler algorithm used, direct access or through a file system, version of testing software and so on.

Yes, it’s a lot of information, but it can be incredibly useful if you encounter a performance anomaly, or you later see a problem in production.

labnotesBest practice: capture the specifics of your entire environment in such a way that someone else could rebuild it *exactly* if needed.  Practice good version control as well; introducing new hardware or software components mid-testing can invalidate previous testing.


Testing procedure: document precisely what you did.  How many VMs and VMDKs?  What size?  What IO profile? Number of OIOs?  Did you pre-condition the environment?  Did you fire up all the VMs at once, or slowly bring them on line? Did you allow enough time for performance to reach steady-state?  Did you wipe out previous VMDKs, or re-use existing ones?  Details matter.

Best practice: clearly document the exact steps you used before, during and after your test runs.  Again, very useful if you encounter an anomaly and want to work with support.


Capture Data: in addition to capturing storage-specific results (IOPS, latencies, bandwidth), also capture relevant environmentals, in particular CPU, memory and network utilization.  In addition to what is captured by the test tool, also capture what is being seen at both the hypervisor and storage layer.   Also, make sure you have a plan to keep your results tightly associated with the test parameters you used — it’s easy to forget where you put things!

Best practice: do a trial run or two to make sure you know how to capture all the relevant storage statistics, as well as environmentals: CPU, memory and network.


Results Really Matter

powerpoint_is_comingI can’t tell you how often I’ve been in a meeting with someone going through reams of performance results — and has completely lost everyone in the room.  It takes a lot of experience and brainpower to interpret what all that data might mean, so do your audience a favor — tell them what you think it means in plain and very simple words.

If people can’t draw obvious conclusions from your work, the testing effort likely didn’t achieve the desired goals — which is to make an informed decision.

A few tips?

Headlines First:  don’t save the punchline until slide 132, put it up front!  If there were clear and obvious conclusions from your testing, lead with that and then show how you arrived at that conclusion.

Assumptions Made: early on, you had to make some assumptions about the workloads you modeled, what kinds of performance were important (e.g. response time vs aggregate throughput), and other items.  Get this out of the way early — if people don’t agree with your assumptions, they’re not going to agree with your conclusions.

Relative Price Metrics:  yes, costs matter.  If solution A delivers 90% of the performance of solution B at a fraction of the cost, that’s relevant.  Getting to precise pricing can be cumbersome, but you can easily ballpark to illustrate the point.  Perhaps think in terms of a restaurant rating system, e.g. $, $$, $$$ and $$$$.

Show Sample Data: you probably ran a lot of tests, but there are probably a handful that illustrates your point and supports your conclusion.  I’m not endorsing cherry-picking, but time is money, and showing a handful of representative tests vs. your entire catalog saves time.

Everything Else In An Appendix: all the gory detail: your complete data sets, your precise configurations, your testing methodology, etc. — all of that goes to an appendix of some sort in case someone is interested.


Final Thoughts

debate_teamYou’ll see all sorts of debates about storage performance testing.  No shortage of opinions.

One claim that the only relevant tests are with your exact workloads, under your exact conditions.  Of course, that’d be nice, but also very difficult (and expensive) to do.  Not to mention, workload profiles change all the time.

For those of us who have been doing storage performance testing for a long while, we always start with a baseline of synthetics, and then move to application-specific tests as time and interests allow.    Many good reasons for this synthetics-first approach: time-to-results, able to spot potential problem areas quickly, easy to dial up worst-case scenarios, and more.

You can capture a very wide range of application behaviors with synthetic tests, just as a synthesizer can capture an incredibly wide range of “real” musical instruments.   Don’t let people try and talk you out of them — they exist for a very good reason.

Another debate is “does storage performance really matter?”.  E.g. is good enough good enough?  That, of course, depends on your specific situation — but in many situations, yes it does matter a great deal.  If you can use a smaller, more cost-effective configuration to achieve your results, that’s a win.  And no one likes unpleasant performance problems.

The bottom line?

A lot of money gets spent on storage (and more recently) hyperconverged systems. There are significant and meaningful variations in how different products perform.  Being able to get good comparative published information would be nice, but that’s not the case today.

testing2Storage performance testing is a valuable skill to have in your IT portfolio.  Practically speaking, not many people are comfortable with it, which means they have to rely on vendors to provide information.  Sometimes that’s OK, often it’s not.  So many vendors, so many issues …

In reality, it’s not all that hard to do decently well, if you know a few do’s and don’ts.  With a little practice and effort, you’ll be able to provide a valuable service to your team, as well as have much more informed discussions with your vendors.

And that’s a win.


One comment has been added so far

  1. Great article, Many good reasons for the synthetics-first approach: time-to-results, able to spot potential problem areas quickly, easy to dial up worst-case scenarios, and more. we can capture a very wide range of application behaviors with synthetic tests, just as a synthesizer can capture an incredibly wide range of “real” musical instruments ,its really very interesting . storage performance really matter because no one likes unpleasant performance problems. If we know a few do’s and don’t, With a little practice and effort, we can be able to provide a valuable service to our team, as well as have much more informed discussions with the vendors. Thanks for sharing. the way you explained each and everything about storage performance is really great. Thanks once again.

Leave a Reply

Your email address will not be published.