The Linux file I/O path. Writing data to a SAN hosted filesystem from a user
The following describes the I/O path from the moment a user executes :w in vim on a Linux box to the
point the data is stored on a storage array in a storage attached network. User space operations as well
as kernel and module/driver activity will be discussed.
The write command in vim (:w) writes the text buffer of the vim process to the file that was specified
when the vim session was started. User space applications accomplish this task with the Linux system
call ‘write’, which takes 3 args:
Linux ‘write’ system call:
ssize_t write(int fd, const void *buf, size_t count);
User applications access files as a linear streams of bytes, which are read and written through a ‘file
descriptor’. A file descriptor is an integer that is returned by the Linux kernel when an application
executes the ‘open’ system call. The user application has no knowledge of the structure of the
underlying filesystem, or the location of the file on, or the geometry of the disk storage device. The file
descriptor is just a pointer to other kernel data structures which map to the actual data on disk.
After invoking the ‘write’ Linux system call, with the file descriptor, pointer to the data buffer and the
byte count, the kernel takes over from here. The Linux kernel has access to all user space memory, so
no copying between user space and kernel space is necessary.
At this point the kernel must map the file descriptor to the logical file location within the filesystem,
and then pass the data through the proper modules (filesystem, SCSI, block, hardware specific (in this
case fibre channel) and PCIe modules.
The first step in this path is to find the inode that holds information about this file and the filesystem
specific information. The kernel starts with the process table.
The process table entry for the vim session has entries for all of its open file descriptors. The file
descriptor entry points to an entry in the kernel’s open file table. The open file table is a list of all open
files within the system.
The kernel uses the entries in this table, along with the virtual filesystem abstraction layer to find and
read the dentries, which map out the paths to the file and its parent directories, and also the memory
cached inodes, which map to the filesystem resident inodes.
Filesystem specific, on-disk inodes contain access information about the actual file on disk, including,
user/group permissions, logical location on the filesystem, last access time, hard link count, etc.
The inode for the file contains the ‘logical’ location of the file on-disk volume. Logical file locations do
not map to the physical location on the disk device. The filesystem module uses the on-disk inode for
the file and generates a write operation that is passed to the buffer cache.
At this point the write operation has passed into the write buffer cache. Once the write operation is
cached, the kernel can return to the user process, so that the user process does not block while waiting
for the data to physically get out to disk. These cached write operations are flushed to disk at a later
time, on a schedule that is controlled by the kernel.
Once the logical location has been found within the filesystem layer, and the write operation has been
cached, the SCSI disk/block layer takes over from here. This layer maps the logical file location found
in the file’s inode, to the location on the disk device. This layer also creates the SCSI commands that
will be interpreted by the target disk device. The disk write operation now has the SCSI commands and
disk block addresses to pass to the device specific (in this case fibre channel) driver.
The fibre channel device driver accepts the write requests from the sd/block drivers and communicates
with the FC HBA via the PCIe driver.
The PCIe driver sends commands to the FC HBA and instructs it perform the DMA (Direct Memory
Access) to read the data directly from RAM, bypassing the CPU.
At this point the FC HBA sends the data via FC frames, routing though any FC switches, to the target
The target storage device accepts the FC frames containing the SCSI commands and data and, after
working through any device specific cache and RAID schemes, sends the data to the individual disk
drives. The individual disk drives translate the block addresses to its internal LBA mapping, and
commits the data to disk.
Summary of data flow:
System call API (write)
Linux VFS layer (process table, open file table, dentry/generic inode
Filesystem specific module
Write buffer cache
Hardware device drivers
Physical HBA/PCIe path
Storage device, including vendor specific caching/RAID/Infinidat magic