From 0d2be9eba7abec75d8fb3115fd2ab748ce6ffbc9 Mon Sep 17 00:00:00 2001 From: mbeierl Date: Fri, 12 Oct 2018 13:42:57 -0400 Subject: Additional documentation Change-Id: I9b176794206e39db436d9597d976c42b7e9d22cf Signed-off-by: mbeierl --- docs/testing/user/introduction.rst | 196 +++++++++++++++++++++++++++++++++++-- 1 file changed, 190 insertions(+), 6 deletions(-) diff --git a/docs/testing/user/introduction.rst b/docs/testing/user/introduction.rst index 49e3220..0099c39 100644 --- a/docs/testing/user/introduction.rst +++ b/docs/testing/user/introduction.rst @@ -25,13 +25,13 @@ performance metrics in the shortest reasonable time. How Does StorPerf Work? ======================= -Once launched, StorPerf presents you with a ReST interface, along with a +Once launched, StorPerf presents a ReST interface, along with a `Swagger UI `_ that makes it easier to form HTTP ReST requests. Issuing an HTTP POST to the configurations API -causes StorPerf to talk to your OpenStack's heat service to create a new stack -with as many agent VMs and attached Cinder volumes as you specify. +causes StorPerf to talk to OpenStack's heat service to create a new stack +with as many agent VMs and attached Cinder volumes as specified. -After the stack is created, you can issue one or more jobs by issuing a POST +After the stack is created, we can issue one or more jobs by issuing a POST to the jobs ReST API. The job is the smallest unit of work that StorPerf can use to measure the disk's performance. @@ -45,8 +45,187 @@ measured start to "flat line" and stay within that range for the specified amount of time, then the metrics are considered to be indicative of a repeatable level of performance. -What Data Can I Get? -==================== +StorPerf Testing Guidelines +=========================== + +First of all, StorPerf is not able to give pointers on how to tune a +Cinder implementation, as there are far too many backends (Ceph, NFS, LVM, +etc), each with their own methods of tuning. StorPerf is here to assist in +getting a reliable performance measurement by encoding the test +specification from SNIA, and helping present the results in a way that makes +sense. + +Having said that, there are some general guidelines that we can present to +assist with planning a performance test. + +Workload Modelling +------------------ + +This is an important item to address as there are many parameters to how +data is accessed. Databases typically use a fixed block size and tend to +manage their data so that sequential access is more likely. GPS image tiles +can be around 20-60kb and will be accessed by reading the file in full, with +no easy way to predict what tiles will be needed next. Some programs are +able to submit I/O asynchronously where others need to have different threads +and may be synchronous. There is no one size fits all here, so knowing what +type of I/O pattern we need to model is critical to getting realistic +measurements. + +System Under Test +----------------- + +The unfortunate part is that StorPerf does not have any knowledge about the +underlying OpenStack itself – we can only see what is available through +OpenStack APIs, and none of them provide details about the underlying +storage implementation. As the test executor, we need to know +information such as: the number of disks or storage nodes; the amount of RAM +available for caching; the type of connection to the storage and bandwidth +available. + +Measure Storage, not Cache +-------------------------- + +As part of the test data size, we need to ensure that we prevent +caching from interfering in the measurements. The total size of the data +set in the test must exceed the total size of all the disk cache memory +available by a certain amount in order to ensure we are forcing non-cached +I/O. There is no exact science here, but if we balance test duration against +cache hit ratio, it can be argued that 20% cache hit is good enough and +increasing file size would result in diminishing returns. Let’s break this +number down a bit. Given a cache size of 10GB, we could write, then read the +following dataset sizes: + +* 10GB gives 100% cache hit +* 20GB gives 50% cache hit +* 50GB gives 20% cache hit +* 100GB gives 10% cache hit + +This means that for the first test, 100% of the results are unreliable due to +cache. At 50GB, the true performance without cache has only a 20% margin of +error. Given the fact that the 100GB would take twice as long, and that we +are only reducing the margin of error by 10%, we recommend this as the best +tradeoff. + +How much cache do we actually have? This depends on the storage device being +used. For hardware NAS or other arrays, it should be fairly easy to get the +number from the manufacturer, but for software defined storage, it can be +harder to determine. Let’s take Ceph as an example. Ceph runs as software +on the bare metal server and therefore has access to all the RAM available on +the server to use as its cache. Well, not exactly all the memory. We have +to take into account the memory consumed by the operating system, by the Ceph +processes, as well as any other processes running on the same system. In the +case of hyper-converged Ceph, where workload VMs and Ceph run on the systems, +it can become quite difficult to predict. Ultimately, the amount of memory +that is left over is the cache for that single Ceph instance. We now need to +add the memory available from all the other Ceph storage nodes in the +environment. Time for another example: given 3 Ceph storage nodes with +256GB RAM each. Let’s take 20% off to pin to the OS and other processes, +leaving approximately 240GB per node This gives us 3 x 240 or 720GB total RAM +available for cache. The total amount of data we want to write in order to +initialize our Cinder volumes would then be 5 x 720, or 3,600 GB. The +following illustrates some ways to allocate the data: + +* 1 VM with 1 3,600 GB volume +* 10 VMs each with 1 360 GB volume +* 2 VMs each with 5 360 GB volumes + +Back to Modelling +----------------- + +Now that we know there is 3.6 TB of data to be written, we need to go back to +the workload model to determine how we are going to write it. Factors to +consider: + +* Number of Volumes. We might be simulating a single database of 3.6 TB, so + only 1 Cinder volume is needed to represent this. Or, we might be + simulating a web server farm where there are hundreds of processes + accessing many different volumes. In this case, we divide the 3.6 TB by + the number of volumes, making each volume smaller. +* Number of Virtual Machines. We might have one monster VM that will drive + all our I/O in the system, or maybe there are hundreds of VMs, each with + their own individual volume. Using Ceph as an example again, we know that + it allows for a single VM to consume all the Ceph resources, which can be + perceived as a problem in terms of multi-tenancy and scaling. A common + practice to mitigate this is to use Cinder to throttle IOPS at the VM + level. If this technique is being used in the environment under test, we + must adjust the number of VMs used in the test accordingly. +* Block Size. We need to know if the application is managing the volume as a + raw device (ie: /dev/vdb) or as a filesystem mounted over the device. + Different filesystems have their own block sizes: ext4 only allows 1024, + 2048 or 4096 as the block size. Typically the larger the block, the better + the throughput, however as blocks must be written as an atomic unit, larger + block sizes can also reduce effective throughput by having to pad the block + if the content is smaller than the actual block size. +* I/O Depth. This represents the amount of I/O that the application can + issue simultaneously. In a multi-threaded app, or one that uses + asynchronous I/O, it is possible to have multiple read or write requests + outstanding at the same time. For example, with software defined storage + where there is an Ethernet network between the client and the storage, + the storage would have a higher latency for each I/O, but is capable of + accepting many requests in parallel. With an I/O depth of 1, we spend + time waiting for the network latency before a response comes back. With + higher I/O depth, we can get more throughput despite each I/O having higher + latency. Typically, we do not see applications that would go beyond a + queue depth of 8, however this is not a firm rule. +* Data Access Pattern. We need to know if the application typically reads + data sequentially or randomly, as well as what the mixture of read vs. + write is. It is possible to measure read by itself, or write by itself, + but this is not typical behavior for applications. It is useful for + determining the potential maximum throughput of a given type of operation. + +Fastest Path to Results +----------------------- + +Once we have the information gathered, we can now start executing some tests. +Let’s take some of the points discussed above and describe our system: + +* OpenStack deployment with 3 Control nodes, 5 Compute nodes and 3 dedicated + Ceph storage nodes. +* Ceph nodes each have 240 GB RAM available to be used as cache. +* Our application writes directly to the raw device (/dev/vdb) +* There will be 10 instances of the application running, each with its own + volume. +* Our application can use block sizes of 4k or 64k. +* Our application is capable of maintaining up to 6 I/O operations + simultaneously. + +The first thing we know is that we want to keep our cache hit ratio around +20%, so we will be moving 3,600 GB of data. We also know this will take a +significant amount of time, so here is where StorPerf helps. + +First, we use the configurations API to launch our 10 virtual machines each +with a 360 GB volume. Next comes the most time consuming part: we call the +initializations API to fill each one of these volumes with random data. By +preloading the data, we ensure a number of things: + +* The storage device has had to fully allocate all of the space for our + volumes. This is especially important for software defined storage like + Ceph, which is smart enough to know if data is being read from a block that + has never been written. No data on disk means no disk read is needed and + the response is immediate. +* The RAM cache has been overrun multiple times. Only 20% of what was + written can possibly remain in cache. + +This last part is important as we can now use StorPerf’s implementation of +SNIA’s steady state algorithm to ensure our follow up tests execute as +quickly as possible. Given the fact that 80% of the data in any given test +results in a cache miss, we can run multiple tests in a row without having +to re-initialize or invalidate the cache again in between test runs. We can +also mix and match the types of workloads to be run in a single performance +job submission. + +Now we can submit a job to the jobs API to execute a 70%/30% mix of +read/write, with a block size of 4k and an I/O queue depth of 6. This job +will run until either the maximum time has expired, or until StorPerf detects +steady state has been reached, at which point it will immediately complete +and report the results of the measurements. + +StorPerf uses FIO as its workload engine, so whatever workload parameters we +would like to use with FIO can be passed directly through via StorPerf’s jobs +API. + +What Data Can We Get? +===================== StorPerf provides the following metrics: @@ -57,4 +236,9 @@ StorPerf provides the following metrics: These metrics are available for every job, and for the specific workloads, I/O loads and I/O types (read, write) associated with the job. +For each metric, StorPerf also provides the set of samples that were +collected along with the slope, min and max values that can be used for +plotting or comparison. + As of this time, StorPerf only provides textual reports of the metrics. + -- cgit 1.2.3-korg