Introduction

Write speeds can be one of the main bottlenecks in databases. Doing something like write-back caching can help with spikes in traffic, but ultimately if there is high throughput, a single node will reach its limit. Many databases use the syscall fsync as a component of finishing a transaction to maintain durability. One of the side effects of ensuring data durability through the use of fsync is higher latency. And because this syscall can take up the lion’s share of the total latency for a database transaction, it’s safe to say that the rate at which fsync can be called will serve as a factor to how many transactions can be processed per second [1].

The focus of this post isn’t actually to get into the details of how databases use fsync, but to dig into what are the performance limits of this syscall on some commodity servers. To do that, we’re going to explore I/O write performance on some gp2 EBS volumes on AWS EC2. We’ll look into how quickly fsync operations can be run and if there is any evidence of parallelization or aggregation. The goal here is to answer the question: Can IOPS be estimated from fsync alone?

Commodity Disks in the Cloud

Amazon Web Services (AWS) provides Elastic Block Store (EBS) as an option for persistent storage within the cloud. These are paired with EC2 instances to create the base file system in which applications can interact. We’re going to look at gp2 volumes from EBS, which are standard general-purpose SSD disks. The write throughput of a gp2 volume is measured in terms of I/O operations per second or IOPS. All gp2 volumes start out with a minimum of 100 IOPS and can go up to a maximum of 16000 IOPS. In between the min and max, 3 IOPS are added to the baseline for every GB of provisioned disk space. The equation for calculated baseline IOPS is as follows:

  • Baseline IOPS = min(max(100, 3 * SIZE_IN_GB), 16000)

Note that gp2 volumes under 1000GB have the ability to burst temporarily up to 3000 IOPS [2]. To keep things simple, we’ll look at two different volume sizes: (1) 1600GB and (2) 5500GB, which will give us 4800 IOPS and 16000 IOPS, respectively. This allows us to explore the range of fsync latencies for different IOPS limits while also moving the baseline above 3000 IOPS to avoid the complexities of transient write capacity. These volumes were installed on t2.2xlarge EC2 instances (8 vCPUs / 32GB RAM) running Ubuntu 24.04 LTS (x86). A larger instance size was chosen to reduce the likelihood of bottlenecks beyond disk writes.

EC2s

Latency with Fsync

It’s useful to consider the performance of a single fsync to give an idea of the disk speed. Percona used a python script in their article, which I have modified slightly to report the average fsync latency over 3000 samples [4].

# For original see: https://www.percona.com/blog/fsync-performance-storage-devices/

import os, sys, mmap
import time

# Number of fsync samples to take
N = 3000

# Open a file
fd = os.open("testfile", os.O_RDWR|os.O_CREAT|os.O_DIRECT )

m = mmap.mmap(-1, 512)
s = time.perf_counter()
for i in range (1, N):
   os.lseek(fd,os.SEEK_SET,0)
   m[1] = 1
   os.write(fd, m)
   os.fsync(fd)
e = time.perf_counter()
elapsed = e - s
dt_fsync = 1e6 * elapsed/N
iops = N / elapsed

print(f"Time elapsed for {N} fsyncs: {elapsed}")
print(f"Avg fsync dt: {dt_fsync} [usec]")
print(f"IOPS: {iops}")

# Close opened file
os.close( fd )

Running the script above on a 1600GB gp2 volume gave an average fsync latency of 4.25 ms, which corresponds to 235 IOPS. Recall from the equation in the previous section that a gp2 instance has 3 IOPS per GB of storage. Therefore an instance of 1600GB should be able to deliver up to 4800 IOPS. This means that it must be able to run multiple fsync operations together, perhaps through batching or parallelization, because running fsync serially will not get us up to that level of IOPS.

Benchmarking Latency and IOPS with FIO

To get more detailed results on the fsync latency and IOPS performance of gp2 volumes I’ve used fio (Flexible I/O Tester), which is an open source benchmarking tool that can generate a range of different I/O workloads. Starting off with a root volume size of 1600GB, the first test that was ran used this command structure to create random 4k writes:

fio \
    --name=random-write \
    --bs=4k \
    --end_fsync=1 \
    --fsync=1 \
    --iodepth=1 \
    --ioengine=posixaio \
    --numjobs=1 \
    --runtime=60 \
    --rw=randwrite \
    --size=100M \
    --time_based

This I/O workload simulates a worse-case scenario that could be encountered with a database installation as random writes tend to be slower than sequential ones. There is one job configured to run for 60s, writing 100MB repeatedly until the time is reached. It will also perform an fsync after every I/O operation as well as when the job ends. In this particular case, I set iodepth to 1, which means that only 1 unit of I/O will be kept in flight against the file. This is essentially the desired queue size from the application side. The results are shown below.

random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.36
Starting 1 process
random-write: Laying out IO file (1 file / 100MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=996KiB/s][w=249 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=1949: Thu Mar  6 13:29:09 2025
  write: IOPS=256, BW=1024KiB/s (1049kB/s)(60.0MiB/60003msec); 0 zone resets
    slat (usec): min=2, max=927, avg= 8.00, stdev=11.55
    clat (nsec): min=1178, max=4494.4k, avg=94785.53, stdev=160874.28
     lat (usec): min=28, max=4502, avg=102.79, stdev=161.66
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   38], 10.00th=[   39], 20.00th=[   41],
     | 30.00th=[   43], 40.00th=[   44], 50.00th=[   46], 60.00th=[   48],
     | 70.00th=[   55], 80.00th=[   67], 90.00th=[  109], 95.00th=[  498],
     | 99.00th=[  766], 99.50th=[  930], 99.90th=[ 1237], 99.95th=[ 1401],
     | 99.99th=[ 2933]
   bw (  KiB/s): min=  968, max= 1112, per=100.00%, avg=1025.03, stdev=30.00, samples=119
   iops        : min=  242, max=  278, avg=256.25, stdev= 7.50, samples=119
  lat (usec)   : 2=0.01%, 4=0.01%, 50=65.26%, 100=24.40%, 250=1.68%
  lat (usec)   : 500=3.64%, 750=3.92%, 1000=0.74%
  lat (msec)   : 2=0.33%, 4=0.01%, 10=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=2371, max=15410, avg=3788.90, stdev=652.71
    sync percentiles (usec):
     |  1.00th=[ 2868],  5.00th=[ 2999], 10.00th=[ 3064], 20.00th=[ 3261],
     | 30.00th=[ 3425], 40.00th=[ 3556], 50.00th=[ 3687], 60.00th=[ 3851],
     | 70.00th=[ 4015], 80.00th=[ 4228], 90.00th=[ 4555], 95.00th=[ 4883],
     | 99.00th=[ 5800], 99.50th=[ 6521], 99.90th=[ 8356], 99.95th=[ 8586],
     | 99.99th=[10290]
  cpu          : usr=0.56%, sys=0.77%, ctx=30737, majf=0, minf=27
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15362,0,15362 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1024KiB/s (1049kB/s), 1024KiB/s-1024KiB/s (1049kB/s-1049kB/s), io=60.0MiB (62.9MB), run=60003-60003msec

Disk stats (read/write):
  xvda: ios=0/49108, sectors=0/1388608, merge=0/31216, ticks=0/55327, in_queue=55327, util=83.08%

There is a lot of detail in this output, but the parts we’re going to focus on are the averages for lat (usec), iops, and sync (usec), which represent the application I/O latency, IOPS, and fsync latency, respectively [5]. We’ll make the simplifying assumption that the total latency is the sum of lat (usec) and sync (usec). One interesting observation is that the average fsync latency is measured as 3788.9 us (3.8 ms), which isn’t far from the 4.25 ms we measured using the python script earlier. The average IOPS were also comparable at 256. Next, let’s see what happened as iodepth was swept from 1 up to 80 for the 1600GB gp2 volume.

1600GB GP2

The first thing to observe is that the latency stayed relatively constant as iodepth went from 1 up to 20. We also got linearly increasing IOPS as a function of queue depth. The relationship is captured by Little’s Law, which states that the queue size (L) will equal the average arrival rate (λ) multiplied by the wait time an item spends in the system (W).

Little's Law

The average total latency was taken as the sum of the average lat (usec) and sync (usec) values. Note that IOPS linearly increased with iodepth between 1 and 20 because the latency was relatively constant in that range. For a 1600GB gp2 volume, we would expect a capacity of 4800 IOPS, which is shown as a horizontal asymptote on the IOPS vs. Queue Depth chart. Another interesting observation is that the total latency began to increase rapidly as we reached the IOPS limit for a gp2 volume. This increase in latency offset the increase in iodepth such that no further increase in IOPS was achieved.

Next, let’s look at the results from the 5500GB gp2 volume. In this case iodepth was increased from 1 up to 400. This size volume had a 16000 IOPS capacity, which we see via the horizontal asymptote as iodepth was increased past 160. The total latency numbers increased from 3.89 ms up to 22.7 ms. Unlike the case with the 1600GB gp2 volume, latency steadily increased across the range of iodepth values, suggesting that at higher IOPS levels the system is unable to service each I/O unit immediately.

5500GB GP2

Finally, we can look at the 1600GB and 5500GB gp2 volume results together in terms of average total latency vs. IOPS. At roughly 4500 IOPS and below, each volume showed stable latency around 4 ms. The 1600GB volume quickly hit a wall at its 4800 IOPS limit and latency increased without any more write throughput. The 5500GB volume showed a latency curve that increases in slope as it got closer to the 16000 IOPS limit. This volume also had a wall, where IOPS could not move much further past 16000 without skyrocketing latency values.

Latency vs IOPS

Conclusions

When evaluating random write capabilities of a particular disk, the latency of a single fsync operation can provide some insight into its speed, however it’s not sufficient to know what the overall IOPS capacity is. We looked at test results from two sizes of AWS EBS gp2 volumes (1600GB and 5500GB) to better understand the IOPS vs. fsync latency relationship. These results showed that while the fsync latency for an I/O unit without a queue is around 4 ms, we could achieve up to 16000 IOPS on a 5500GB drive. This means that the system is capable of performing some sort of aggregation or parallelization of the fsync operations, which if only allowed to operate serially would limit us to approximately 250 IOPS. So if you have a disk where you’re unsure of its IOPS capacity, make sure to take the time to benchmark it. A back of the envelope calculation assuming the inverse of the fsync latency can be an order of magnitude off.

Sources

  1. MySQL transactions per second vs fsyncs per second
  2. Amazon EBS volume types #1
  3. Amazon EBS volume types #2
  4. Fsync Performance on Storage Devices
  5. Fio Output Explained