XFS White Paper

XFS: A Next Generation Journalled 64-Bit Filesystem
With Guaranteed Rate I/O

Mike Holton
Raj Das
Silicon Graphics, Inc

1.0 Introduction
2.0 Opening New Opportunities
3.0 Product Features
4.0 Known Limitations
5.0 Design Architecture Overview
6.0 Implementation

7.0 Road Map to the Future

1.0 Introduction

XFS is Silicon Graphics' next-generation filesystem and volume manager for IRIX systems. XFS represents a leap into the future of filesystem management, providing a 64-bit journalled filesystem that can handle large files quickly and reliably. In addition, XFS provides the unique feature of guaranteed rate I/O that leverages Silicon Graphics' intimate knowledge of media serving environments. As a next generation product, XFS was designed as a new fully integrated product, rather than a modification of an existing product. Thus, important filesystem features such as journalling were seamlessly integrated into XFS, instead of being awkwardly added to an existing filesystem.

This paper is dedicated to describing the superior features of the XFS filesystem and its underlying architecture. Installation and administration of XFS are only roughly described. For more details, please see the XFS Administration Guide: "Getting Started With XFS Filesystem".

1.1 Product Feature Summary

XFS provides a full 64-bit filesystem capable of scaling easily to handle extremely large files and filesystems that can grow to 1 terabyte (up to 9 million terabytes in successor releases). XFS's major features include:

full 64-bit file capabilities
-large files
-large filesystems
-large numbers of files
-sparse files (files with "holes")
rapid and reliable recovery of filesystem structure using journalling technology
integrated, full-function volume manager called XLV
extremely high I/O performance that scales well on multiprocessing systems
guaranteed rate I/O for multimedia and data acquisition uses
back up of filesystems while still in use, significantly reducing a dministrative overhead
XFS is compatible with existing applications, EFS filesystems, and NFS

Additionally, systems can use XFS filesystems exclusively or have a mixture of XFS and EFS filesystems.

1.2 Background Information

The XFS filesystem is a "journalled" filesystem. This means that updates to filesystem metadata (inodes, directories, bitmaps, etc.) are written to a serial log area on disk before the original disk blocks are updated in place. In the event of a crash, such operations can be redone using data present in the log, to restore the filesystem to a consistent state.

XFS includes a fully integrated volume manager, XLV, which creates an abstraction of a sequence of logical disk blocks, called a volume, to be used by the XFS filesystem. The sequence of disk blocks (a volume) can be assembled by concatenating, plexing (mirroring), and striping across a number of physical disk drives or RAID devices. XLV ensures that plexed data is kept consistent across all plexes and will discussed in detail later.

2.0 Opening New Opportunities

2.1 A New Technology

XFS is a true next generation filesystem, not simply a rewrite or port of existing technology. Most filesystems on the market are based upon old filesystem architectures such as System V or BSD. XFS was "built from scratch". This allowed SGI to integrate into XFS key features such as journalling for high reliability and redesign important areas such as the allocation algorithms for increased performance with large filesystems. Thus, XFS is the first new filesystem built for the demanding large filesystem needs of the 1990's:

---------------------------------------------------------------------
TIME          FILESYTEM                                                
---------------------------------------------------------------------
Early 1970's  Version 7 filesystem                                     
Early 1980's  Berkeley "fast" filesystem (FFS)                         
Mid 1980's    Early journalled filesystems (Veritas/Tolerant/IBM JFS)  
Mid 1990's    XFS                                                      
---------------------------------------------------------------------

As indicated above, most filesystems on the market evolved in an earlier age of smaller filesystems, lower processing power, and limited storage capacity. Many current filesystems are based on architectures that emphasize conserving storage space at the expense of performance. The revolutionary growth of CPU and storage technology has fueled a dramatic increase in the size and complexity of filesystems, outstripping the capabilities of many filesystems. SGI designed XFS to meet this increasing need to manage large complex filesystems with high performance and reliability. Thus, completely different design targets were established for XFS from previous or current filesystems:

Scalable features and performance from small to truly huge data (petabytes)
Huge numbers of files (millions)
Exceptional performance: 500+ MBytes/second
Designed with log/database (journal) technology as a fundamental part not just an extension to an existing filesystem
Mission-critical reliability

2.2 New Opportunities

The cutting-edge filesystem technology of XFS opens up new opportunities in the marketplace by enabling users to manage large amounts of data with unsurpassed speed. This is particularly important for file, database, and compute server applications. Additionally, new programmatic features and interfaces such as guaranteed rate I/O provide unique advantages to users. In particular, the fast, guaranteed response time of XFS is excellent for digital media servers and other real-time applications. Finally, the growing necessity for business-critical reliability is driving corporations toward robust journalled filesystems such as XFS. Yet, even with all this power and added features, XFS is still fully upwardly-compatible with EFS and UNIX standards.

2.3 Upward Compatibility

XFS runs on all supported SGI machines which run IRIX 5.3 or higher, except IP4 and IP6. The filesystem is implemented under the Virtual Filesystem Switch which is extended from prior releases. This allows XFS to support the mixing of EFS and XFS filesystems on the same system. Hence, the administrator could easily move the more frequently used local filesystems to XFS and leave others as EFS for later conversion as needed (see section 6.2 for converting EFS filesystems to XFS filesystems).

XFS provides all the functions available in the current EFS filesystem at a superior performance level. This includes performance features such as asynchronous I/O volume, direct I/O, and synchronous I/O, in addition to normal (buffered) I/O.

The XFS volume manager, XLV, performs all the functions of the lv and can be run on the same system as these volume managers, but not on the same volume. This allows a gradual change over to the new filesystem and volume manager. Converting from lv logical volumes to XLV logical volumes is easy. The programs lv_to_xlv and xlv_make convert lv logical volumes to XLV without having to dump and restore data. Additionally, EFS and non filesystem applications can run on top of the new volume manager. Ordinary driver interfaces are presented to these clients.

2.4 Availability

XFS is available now in the following configurations:

IRIX 5.3 with XFS
-Supported on all SGI platforms except IP4, IP6, and R8000-based machines
IRIX 6.1 with XFS included
-Power Challenge, Power Onyx, Power Indigo II only (R8000-based machines)
-The root and /usr filesystems are factory installed as an XFS filesystem on Power Challenge

XFS may be added to systems which are running IRIX 5.3 and 6.0.1 by upgrading to IRIX 5.3 with XFS or IRIX 6.1 (XFS included). XFS is a standard integrated part of IRIX 6.1 and beyond. Note: As discussed in section 7.1.1, IRIX 6.2 with XFS included is scheduled for release soon.

2.5 Product Packaging

XFS is marketed in the following packages:

Marketing Order Code: SC4-XFS-1.0
Description: IRIX 5.3 with XFS support. Includes volume management with striping and concatenation, guaranteed rate I/O for 4 streams, DMAPI support, 64-bit files and filesystems.

Marketing Order Code: SR4-PLEX-1.0
Description: Right to use for Volume management mirroring (plexing) support.

Marketing Order Code: SR4-GRIO40-1.0
Description: Right to use for up to 40 Streams of Guaranteed Rate I/O

Marketing Order Code: SR4-GRIOMAX-1.0
Description: Right to use guaranteed rate I/O for greater than 40 I/O streams

3.0 Product Features

3.1 64-Bit Filesystem

3.1.1 Room to Grow

The XFS filesystem supports files of up to 263-1 bytes = 9,223,372,036,854, 775,807 bytes (9 million terabytes) and filesystems of up to 264 bytes. In IRIX 5.3, files and filesystems are restricted to 240-1 = 1,099,511,627,775 bytes (1 terabyte). IRIX 6.1 supports full 64-bit files with filesystems subject to the 1 terabyte restriction. Thus with XFS, filesystems are more limited by the user's ability to attach disks to the hardware than by the filesystem's capacity. Disk drive capacities are growing approximately 1.7 times per year, so this added filesystem capacity will prove invaluable now and in the future. As nearly the only filesystem on the market specially designed to scale to the full 64-bit address space, XFS is leading the way into the future of large filesystem management.

3.1.2 64-bit Compatibility

Both 32-bit and 64-bit interfaces are supported where they are supported by the underlying OS and hardware. NFS version 3, which is a 64-bit protocol, allows the export of large files and filesystems. NFS version 3 is an available option for IRIX 5.3 and higher. Systems which support only NFS version 2 may access files on XFS filesystems subject to the 32-bit files size limit. XFS filesystems accessed over NFS version 2 may be greater than 2 gigabytes; all file accesses will operate correctly, but disk free space reports will be incorrect. Additionally, the caching file system (CFS) uses NFS protocols, and allows efficient remote access to XFS filesystems.

32-bit applications may access 64-bit files through two methods. First, applications may use extended system calls lseek64(), stat64(), etc. These system calls enable the user to write 32-bit programs that can track 64-bit file position and file size. Second, The new compilers from SGI allow administrators to build N32 application interfaces that help the administrators avoid changing any source code. These compilers are available for IRIX 6.1 and above.

Note: Many UNIX programs will work without modification because sequential reads and writes will operate correctly even on files greater than 2 GB. The semantics of 32-bit applications operating naively on files longer than 232 bytes are defined in the paper "64 Bit File Access".

3.2 Filesystem Management

3.2.1 Basics

The XFS space manager efficiently allocates disk space within a filesystem. It is responsible for mapping a file (a sequence of bytes) onto sequences of disk blocks. The internal structures of the filesystem -- allocation groups, inodes, and free space management -- are the fundamental items controlled by the space manager. The namespace manager handles allocation of directory files, normally placing them close to the files in the directory for increased seek performance.

XFS space manager and namespace manger use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file). This significantly increases the speed of accessing information in files, especially with large files and filesystems. In the case of large filesystems, traditional filesystems linearly search the file location information in directory files; this information often spans multiple blocks and/or extents (a collection of blocks). XFS's B-Tree technology enables it to go directly to the blocks and/or extents containing a file's location using sophisticated indices. With large files, the B-Tree indices efficiently map the location of the extents containing the file's data. This avoids the slower multi-level indirect schemes (of other filesystems) which often require multiple block reads to find the desired information.

3.2.2 Structure

The space manager divides each filesystem data sub-volume (or partition) into a number of allocation groups at mkfs time. Each allocation group has a collection of inodes and data blocks, and data structures to control their allocation. These allocation groups help to divide the space management problem into easy to mange pieces, speeding up file creation.

The inode size is specified by the administrator at mkfs time by XFS which defaults the size to 256 byte. However, the location and number of inodes are allocated as needed by XFS, unlike most filesystems which require the static creation of the inodes at mkfs time. This dynamic inode allocation permits greater performance (fewer seeks) since inodes can be positioned closer to files and directories that use them.

Free blocks in an allocation group are kept track of using a pair of B-trees instead of a bitmap. This provides better scalability to large systems. Block allocation and free block operations are performed in parallel if done in different allocation groups.

The real-time sub-volume is divided into a number of fixed-size extents (a collection of blocks) to facilitate uniform read/writes during guaranteed-rate I/O operations. The size is chosen at mkfs time and is expected to be large (on the order of 1MByte). The real-time subvolume allocation uses a bitmap to assure predictable performance; the access time for B-tree indices varies.

3.2.3 Handles Contiguous Data

The space manager optimizes the layout of blocks in a file to avoid seeking during sequential processing. It also keeps related files (those in the same directory) close to each other on disk. More importantly, XFS uses large extents (groups of blocks) to form contiguous regions in a single file. Current block-based filesystems map a file to blocks of fixed length. Extent based filesystems map a file to an extent, not just a single block. Extents are easy to represent, minimize fragmentation and reduce the number of pointers required to represent a large file. Thus, data is kept in large contiguous extents that can be extracted from a disk in a single I/O operation - significantly reducing I/O time.

Filesystem extents are variable in size and configured by XFS as needed. XFS supports 512 bytes to 1 gigabyte per extent and optimizes the size for higher read performance. This is in contrast to other filesystems which provide only fixed extent sizes. Note: The XFS allowable extent size is considerably larger than the previous limit of 128K bytes for EFS.

The filesystem block size is set at filesystem creation time using mkfs. XFS supports multiple block sizes, ranging from 512 bytes (disk sector size) up to 64KB and up to 1GB for real-time data. The filesystem block size is the minimum unit of allocation for user data in the filesystem. As s file is written by an application, space is reserved but blocks are not allocated. This ensures that space is available, but gives XFS more flexibility in allocating blocks.

XFS delays allocation of user data blocks when possible to make blocks more contiguous; holding them in the buffer cache. This allows XFS to make extents large without requiring the user to specify extent size, and without requiring a filesystem reorganizer to fix the extent sizes after the fact. This also reduces the number of writes to disk and extents used for a file.

3.2.4 Efficient Filesystem

As implied earlier, XFS uses sophisticated filesystem management techniques such as extents and B-Tree indices to efficiently support:

Very large files (64-bit size)

There is little or no performance penalty to access blocks in different areas of a large file. XFS creates large sizeable extents to locate file data close together for faster reading and writing of data. Using B-Tree indexing technology, XFS assigns small areas on the disk to write indices for the location of data in large files. Generally these indices point to the extents containing the data. The use of B-Tree indices increases performance by avoiding slower multi-level indirect searches of the filesystem data structures - especially when accessing blocks at the end of large files.

Very small files

Most symbolic links and directory files are small files. XFS allows these files to be stored in inodes for increased performance. XFS also uses delayed writes to wait to gather the entire small file in the buffer cache before writing to disk. This reduces the number of writes to disk and extents used for a file.

Files with few extents

These files are represented by extent lists to increase performance. Extent lists are simple linear lists of the files' structure which avoid the sophisticated B-Tree operations necessary for large files.

Sparse files

Sparse files are files that contain arbitrary "holes," areas of the file which have never been written and which read back as zeroes. XFS supports holes (avoids wasted space) by using indexing (B-Trees) and extents.

Mapped files

Memory mapping of files is supported by XFS. It allows a program to "attach" a file such that the file appears to be part of the program, and let's XFS worry about managing the disk I/Os.

Large directories

Large directories are indexed in directory files using B-Trees to expedite searches, insertions, and deletions. Operations on directories containing millions of files are almost as fast as on directories containing only hundreds of files.

3.2.5 Attribute Management

XFS supports user defined Arbitrary Attributes which allow information about a file to be stored outside of the file. For example, a graphic image could have an associated description or a text document could have an attribute showing the language in which the document is written. An unlimited number of attributes can be associated with a file or directory. They can have any name, type or size of value. These Attributes represent the first substantial enhancement to the UNIX file interface in 10 years.

3.2.6 Online Filesystem Reconfiguration

An active filesystem may be extended (enlarged) by adding more space to the underlying volume. This operation is supported on-line by the space manager, which receives a request to expand the filesystem and updates on-disk and in-memory structures to implement it.

3.3 Fast, Scalable Filesystem Performance

XFS is a very high-performance filesystem with support for contiguous data (reduced I/Os) and both direct (non-buffered) and asychronous I/O. SGI has demonstrated I/O through the filesystem in excess of 500 MBytes/second. Customers have demonstrated and documented high I/O and bandwidth as well. For example, Tom Ruwart, Director of the Army High Performance Computing Lab at the University of Minnesota, has successfully tested an early "alpha" version of XFS on a large 377 gigabyte filesystem. Using 24 SCSI channels and 24 Ciprico RAIDs, Tom measured direct I/O throughput of:

186 megabytes/second sustained read or write speed, 1 process
330 megabytes/second sustained read speed, 4 processes

Furthermore, XFS is designed to scale in performance to match the new CHALLENGE MP architecture and beyond. In traditional filesystems, files, directories, and filesystems have reduced performance as they grow in size; with XFS there is no performance penalty.

3.4 Journalled Filesystem

3.4.1 Journalling

As mentioned earlier, XFS utilizes database journalling (logging) technology. Updates to filesystem metadata are written to a separate serial log area before the original disk blocks are updated. Journalling promotes high availability by allowing systems to quickly recover from failures while maintaining the disk-based file structure in a consistent state at all times.

XFS is the first filesystem to integrate journal technology into a new filesystem rather than add a section of journal code to an existing filesystem. This integration allows XFS to be a more robust and faster filesystem.

XFS uses a circular log which usually employs 1000-2000 filesystem blocks. During normal system operation the log is written and never read. Old information is dropped out the log file as its usefulness ends. These log buffer writes are mostly performed asynchronously so as not to force user applications to wait for them.

Approximately 1 transaction occurs per filesystem update operation. Batch transactions enable XFS to make metadata updates faster than EFS. The atomic, multi-block updates enabled by transactions, allow XFS to cleanly update its complex metadata structures.

3.4.2 Reliable Recovery

In the event of a crash, operations performed prior to the crash can be redone using data present in the log to restore the filesystem structure to a consistent state. This is done in the kernel at filesystem mount time. XFS performs a binary search of the log for transactions to replay. This eliminates the need to perform a slow total UNIX filesystem check (fsck) after a system crash. Also, when fsck finds inconsistent data structures it must throw away anything suspicious. XFS knows what was happening at the time of failure, so it never needs to throw anything away; it simply finishes what it started. Thus, XFS's journalled recovery provides higher filesystem integrity than does standard UNIX.

To ensure data consistency, some log writes must be synchronous. This does not include ordinary (buffered) writes. Direct I/O, synchronous I/O, or fsync() allow synchronous user data writes.

3.4.3 Fast Recovery

The XFS log-based recovery mechanism will recover a filesystem within a few seconds. The XFS recovery mechanism does not need to scan all inodes or directories to ensure consistency. Also, journalling makes the XFS recovery time independent of the filesystem size. Thus, the recovery time depends only upon the level of activity in the filesystem at the time of the failure.

3.5 Volume Management

3.5.1 Basics

The xlv volume manager (XLV) is an integral part of the XFS filesystem(1). The volume manager provides an operational interface to the system's disks and isolates the higher layers of the filesystem and applications from the details of the hardware. Essentially, higher-level software "sees" the logical volumes created by XLV exactly like disks. Yet, a logical volume is a faster, more reliable "disk" made from many physical disks providing important features such as the following (discussed in detail later):

concatenating volumes for a larger disk
striping volumes for a larger disk with more bandwidth
plexing (mirroring) volumes for a more reliable disk

The use of volumes enables XFS to create filesystems or raw devices that span more than one disk partition. These volumes behave like regular disk partitions and appear as block and character devices in the /dev directory. Filesystems, databases, and other applications access the volumes rather than the partitions. Each volume can be used as a single filesystem or as a raw partition. A logical volume might include partitions from several physical disks and, thus, be larger than any of the physical disks. Filesystems built on these volumes can be created, mounted, and used in the normal way.

The volume manager stores all configuration data in the disk's labels. These labels are stored on each disk and will be replicated so that a logical volume can be assembled even if some pieces are missing. There is a negligible performance penalty for using XLV when compared to accessing the disk directly; although plexing (mirroring data) will mildly degrade write performance.

3.5.2 Sub-volumes

Within each logical volume, the volume manager implements sub-volumes, which are separate linear address spaces of disk blocks in which the filesystem stores its data. For EFS filesystems, a volume consists of just one sub-volume. For XFS filesystems, a volume consists of a data sub-volume, an optional log sub-volume, and an optional real-time sub-volume:

Data sub-volume

The data sub-volume contains user files and filesystem metadata (inodes, directories, and free space blocks). It is required in all logical volumes containing XFS filesystems and is the only sub-volume present in the EFS filesystems.

Log sub-volume

The log sub-volume contains XFS journalling information. It is a log of filesystem transactions and is used to expedite system recovery after a crash.

Real-time sub-volume

Real-time sub-volumes are generally used for data applications such as video where guaranteed response time is paramount.

Sub-volumes facilitate separation of different data types. For example, user data could be prevented from overwriting filesystem log data. Sub-volumes also enable filesystem data and user data to be configured to meet goals for performance and reliability by putting sub-volumes on different disk drives - particularly useful for separating out real-time data for guaranteed rate I/O operations. Each sub-volume can also be optimally sized and organized independently. For example, the log sub-volume can be plexed (mirrored) for fault tolerance and the real-time sub-volume can be striped across a large number of disks to give maximum throughput for video playback.

Each sub-volume is made of partitions (real, physical regions of disk blocks) composed by concatenation, plexing (mirroring), and striping. The volume manager is responsible for translating logical addresses in the linear address spaces into real disk addresses from the partitions. Where there are multiple copies of a logical block (plexing), the volume manager writes simultaneously to all copies, and reads from any copy (since all copies are identical). The volume manager maintains the equality of plexes across crashes and both temporary and permanent disk failures. Single block failures in the plexed volumes are masked by the volume manager performing retries and rewrites.

3.5.3 Plex and Volume Elements

A volume used for filesystem operations is usually composed of at least two sub-volumes, one for the log and one for data. Each sub-volume can consist of a number of plexes (mirrored data). Plexes are individually organized, but are mapped to the same portion of the sub-volume's address space. Plexes may be added or detached while the volume is active. The root filesystem may be plexed.

A plex consists of 1 to 128 volume elements each of which maps a portion of the plex's address space (physical data location on disk) and are concatenated together. Each volume element can be striped across a number of disk partitions. XFS allows online growth of a filesystem using xfs_growfs.

3.5.4 Volume Manipulation

XLV provides the following services transparent to the filesystems and applications that access the volumes:

General volume manipulation

You can creates volumes, delete them, and move them to another system.

Auto-assembly of logical volumes

The volume manager will assemble logical volumes by scanning the hardware on the system and reading all the disk labels at system boot time.

Plexing for higher system and data reliability

A sub-volume can contain one to four plexes (mirrors) of the data. Each plex contains a portion or all of the subvolume's data. By creating a volume with multiple plexes, system reliability is increased.

Disk striping for higher I/O performance

Striped volume elements consist of two or more disk partitions, organized so that an amount of data called the stripe unit is written to each disk partition before writing the next stripe unit-worth of data to the next partition. This provides a performance advantage on large systems by allowing parallel I/O activity from the disk drives for large I/O operations.

Concatenation to build large filesystems

You can build arbitrarily large volumes, up to the 64-bit limit, by concatenating (joining) volume elements together (a maximum of 128 volume elements). This is useful for creating a filesystem that is larger than the size of a single disk.

XLV volumes may store XFS filesystems, EFS filesystems, or be used as raw partitions (databases).

3.5.5 I/O to a Volume

Each volume has a pair of device nodes (just like disks). I/O to XLV devices uses a kernel driver which dispatches the I/O to the right set of disks. Plexing is more complicated:

reads are automatically balanced across all plexes (in a round robin order)
writes are performed simultaneously to all plexes
single disk failures are masked via retries and/or rewrites in case of failures

XLV supports RAID devices. Thus, XLV is capable of handling large data transfers generated from the filesystem code, through the volume manager, down to the RAID driver.

3.5.6 Online Administration

XLV allows online volume reconfigurations, such as increasing the size of a volume or adding/removing a piece of a volume. This reduces system downtime.

3.6 Guaranteed Rate I/O

3.6.1 A Unique Feature

The XFS guaranteed rate I/O system (GRIO) is a unique feature of XFS not found in other filesystems. GRIO allows applications to reserve specific bandwidth to or from the filesystem. XFS will calculate the performance available and guarantee that the requested level of performance is met for a specified time. This frees the programmer from having to predict the performance, which can be complex and variable on flexible systems such as the CHALLENGE systems. This functionality is critical for full rate, high resolution media delivery systems such as video-on-demand or satellite systems that need to process information at a certain rate as the satellite passes over head.

While it is possible to obtain proprietary guaranteed rate devices, no other system integrates this feature into the filesystem. This integration yields significant performance benefits. For example, in XFS the real-time data can be separated from the regular data by XLV. This allows the administrator to physically locate the real-time data separate from the metadata and regular data for faster processing. Thus, dedicated storage devices may be reconfigured for higher performance at the expense of reliability. Moreover, the GRIO subsystem supports all storage devices (with some configuration), instead of locking the user to proprietary storage devices.

3.6.2 Hard vs. Soft Guarantee

Guarantees can be hard or soft, depending upon the trade-off between reliability and performance. Hard guarantees place greater restrictions on the system hardware configuration. They guarantee to deliver the requested performance, but with some possibility of error in the data (due to the need to turn off disk drive self-diagnostics and error-correction firmware). Hard guarantees are only possible if all the drives are on one SCSI bus and XFS knows and "trusts" all the devices on that bus (such as using all disk drives instead of unpredictable tape drives). Otherwise, XFS will allow a soft guarantee which allows the disk drive to retry operations in the event of an error, but this can possibly result in missing the rate guarantee.

3.6.3 Guarantee Mechanism

Applications request guarantees by providing a file descriptor, data rate, duration, and start time. The filesystem calculates the performance available and, if the bandwidth is available, guarantees that the requested level of performance can be met for the given period of time. To make a bandwidth reservation, a user issues a grio_request call on a file. All real-time data accesses are made with standard read and write system calls. Guaranteed rate I/O does not impact the buffer cache, because programs which utilize this mechanism are required to use direct I/O - avoiding the buffer cache. Real-time data may also be accessed in a non-realtime way using only direct I/O calls without GRIO.

The knowledge of the available bandwidth for reservation is located in a user level reservation scheduling daemon ggd. The daemon has knowledge of the characteristics and configuration of the disks and volumes on the system (including backplane and SCSI bus throughput), and it tracks both current and future bandwidth reservations.

By default, IRIX supports four GRIO streams (concurrent uses of GRIO). The number of streams can be increased to 40 by purchasing the High Performance Guaranteed-Rate I/O-5-40 option or more using the Unlimited Streams option.

Note: Disk drivers have been modified to recognize guaranteed rate requests and to schedule them in a real time manner. The disk and volume drivers also export an interface for acquiring their response time and bandwidth characteristics for use by the reservation scheduling module.

3.7 Data Management Interface

3.7.1 New Standard

In 1993, a group of computer system and data storage system vendors established the Data Management Interface Group (DMIG) to establish a standard filesystem interface for Hierarchical Storage Management (HSM) systems. This interface is referred to as a Data Management APplication Interface (DMAPI). Silicon Graphics is committed to the group's goal of simplifying and standardizing an HSM interface to filesystems. The DMIG also includes other companies such as IBM, Sun, Epoch, EMASS, Hitachi, Veritas, etc.

The DMIG has produced a draft of the standard (Version 2.1 dated March 1995) whose changes will be tracked as the DMAPI is implemented and used in the marketplace. XFS has already implemented much of the DMAPI interface while most other major vendors have lagged on their implementation. Two companies, EMASS and Hitachi, have released HSM products that use the DMAPI. Many other HSM vendors have shown interest in the DMAPI.

3.7.2 Modified XFS Dump

Silicon Graphics has modified the XFS backup interface "xfsdump" to work more efficiently with the HSM interfaces. XFS's xfsdump uses DMAPI interfaces to understand the location and structure of the files in the filesystem associated with any HSM products. This provides increased efficiency and the ability to work with an HSM instead of fighting it.

3.8 Expanded Dump/Restore Capabilities

3.8.1 Basics

XFS's dump and restore programs have been written to support the new next generation features of XFS. Thus, the new dump and restore efficiently supports large filesystems with up to 4 billion files and all other items discussed in this paper such as sparse files.

SGI modelled XFS's dump/restore after the BSD UNIX dump/restore as indicated below:

Similar command line options with new (incompatible) standard option syntax
Inode-based dump order instead of tar or cpio which perform directory searches
Directory tree reconstruction on restore

XFS's new dump/restore is bundled with XFS and works well with other advanced 3rd party solutions such as Networker. Also, as mentioned above, XFS's dump uses the DMAPI to rapidly read the filesystem without understanding details of XFS's internal structure, resulting in a far less complex program.

3.8.2 Features

Unlike traditional filesystems, which must be inactivated and then dismounted to guarantee a consistent dump image, you can dump an XFS filesystem while it is being used. Furthermore, XFS dumps/restores are resumable. This means that dump/restore can resume after a temporary termination where it left off, instead going back to the beginning of the backup process. XFS uses a log mechanism to see where the dump/restore temporarily stopped and proceeds from there.

XFS supports two types of dumps: level and subtree. XFS can dump by levels which indicate whether to dump all of the filesystem or various incremental dumps of just changes to the filesystem. XFS can also do "subtree" dumps by file name. With both types of dumps the user does not have to perform a complete filesystem dump to backup data.

The online inventory of dump/restore actions contains much more information than the old /etc/dumpdates file. The new online inventory directory /var/xfsdump provides an extensive review of the dump history displayable on screen. This information will help administrators to quickly restore filesystems.

3.8.3 Media Handling

XFS dump is designed to use the "end of tape" indication to determine when a new tape needs to be started. Thus, administrators with large filesystems do not need to worry about catastrophic "end-of-tape" conditions while dumping data to tapes. In XFS, a dump may span multiple tapes with multiple dumps per tape. This frees the operator from struggling to guess the amount of tape required.

In addition, media error handling has been improved:

XFS dump now relies only on hardware error correction instead of slowing down the process with extra error handling
XFS dump possesses error detection and the ability to get the past errors and minimize the data lost during this operation
If media errors occur during xfsdump, XFS dump allows the user to terminate the current tape and get another one (resumable dump)
If media error occurs during xfsrestore, XFS restore uses efficient resynchronization to restart/abandon the restore

XFS dump provides an "on-demand" progress report of all media handling operations. Additionally, XFS restore can restore from tapes in any order, independent of how the filesystem was dumped.

3.8.4 High Performance

XFS dump/restore provides high performance by:

Using multi-threaded processing that streams the drives so that they never "starve" for data and evenly distributes the dump across drives
Employing very large record sizes (typically 2MB) to reduce I/O operations

3.8.5 Administration

For backup and restore of files less than 2 GB in size, the standard IRIX utilities Backup, bru, cpio, Restore, and tar may be used. To dump XFS filesystems, the new utility xfsdump must be used instead of dump. Restoring from these dumps is done using xfsrestore. Xfsrestore also allows the backup media to be on a remote host.

3.8.6 Use EFS or XFS to Dump/Restore?

Which dump to use is simple:

-Use dump to dump EFS filesystems
-Use xfsdump to dump XFS filesystems

Which restore to use is also simple:

-Use restore to restore dumps made with dump
-Use xfsrestore to restore dumps made with xfsdump

Note: Restore or xfsrestore may be done to either type of filesystem.

4.0 Known Limitations

Although a powerful filesystem, XFS does possess some limitations worth mentioning:

IRIX is sensitive to hard disk errors on a non-plexed volume when reading or writing meta-data
Files and filesystems limited to 1TB in IRIX 5.3+XFS
Disk quotas (usage limitations per user) not yet implemented

5.0 Design Architecture Overview

The following sections provides an overview of each component in the XFS architecture: System Call and Vnode Interfaces, Lock Manager, NameSpace Manager, Attribute Manager, Space Manager, Log Manager, Buffer Cache, Guaranteed Rate I/O, Volume Manager, and Disk Drivers.

5.1 System Call and Vnode Interfaces

The filesystem related calls are implemented at the system call and vnode interface: read, write, open, ioctl, etc., for all filesystem types. The operations are then vectored out to different routines for each filesystem type through the vnode interfaces.

The vnode interfaces also allow interoperation with remote clients such as NFS. As indicated earlier, NFS Version 3.0 provides 64-bit file sharing capabilities for XFS.

System call and vnode operations support Hierarchical Storage Management (HSM) and backup applications. These were designed by an industry-wide working group (DMIG, Data Management Interface Group).

5.2 Lock Manager

The XFS lock manager implements locking on user files, supporting standard UNIX file locking calls such as fcntl and flock. The XFS lock manager is similar to the EFS lock manager with comparable performance.

5.3 NameSpace Manager

The XFS namespace manager implements filesystem naming operations, translating path names into file references (i.e., finding files). A file is identified internally to the filesystem by its inode number. The inode is the on-disk structure which holds all the information about a file. The inode number is the label (or index) of the inode within the particular filesystem.

Files are also identified internally by a numeric human-readable value unique to the file, called the file unique id. Filesystems may be identified either by a "magic cookie", typically a memory address of the root inode, or by a filesystem unique id. Filesystem unique id's are assigned when the filesystem is created and are associated uniquely with that filesystem until the filesystem is destroyed. In both cases, the unique id's help administrators trouble shoot systems by clearly identifying different files and filesystems.

The namespace manager manages the directory structures and the contents of the inode that are unrelated to space management (such as file permissions and time stamps). The namespace manager uses a cache to speed up naming operations. The details of the name translation are hidden from the callers.

5.4 Attribute Manager

The attribute manager implements filesystem attribute operations: storing and retrieving arbitrary user-defined attributes associated with objects in the namespace. An attribute is stored internally by attaching it to the inode of the referenced object. No storage for arbitrary attributes is allocated when an object is created, and any attributes that exist when an object is destroyed are destroyed as well.

The system backup utility will back up and restore the attributes of an object when that object is backed up or restored. Standard NFS does not support attributes beyond the traditional UNIX set, so these attributes are not visible in any way to a client that is accessing an XFS filesystem via standard NFS. NFS mounted filesystems continue to operate as if this feature did not exist.

5.5 Space Manager

As described earlier, the XFS space manager efficiently allocates the disk space within a filesystem using extents and B-Tree indices. It is responsible for mapping files, allocation groups, inodes, and free space.

5.6 Log Manager

Also as indicated earlier, all changes to filesystem metadata (inodes, directories, bitmaps, etc.) are serially logged (journalled) to a separate area of disk space by the log manager. The log allows fast reconstruction of a filesystem (recovery) to a consistent state if a crash intervenes before the metadata blocks are written to disk. There is a separate log space for each filesystem for safety; this separation is managed by the underlying volume manager. The log manager utilizes information provided by the space manager to control the sequencing of write operations from the buffer cache, since specific log writes must be sequenced before and after data operations for correctness if there is a crash.

The space and name manager subsystems send logging requests to the log manager. Each request may fill a partial log block or multiple blocks of the log. The log is implemented as a circular sequential list which wraps when writes reach the end. Each log entry contains a log sequence number, so that the end of the log may be found by looking for the highest sequence number.

On plexed volumes, the buffer cache is also responsible for inserting log records for non-metadata blocks, so that the volume manager's write-change log does not need to be used by the filesystem. This allows the system to keep the plexes of a volume synchronized with each other in the event of a crash between writes.

5.7 Buffer Cache

The buffer cache is a cache of disk blocks for the various filesystems local to a machine. Reads and writes may be performed from the buffer cache. Cache entries are flushed when new entries are needed, in an order which takes into account frequency (or recency) of use and filesystem semantics. Filesystem metadata as well as file data is stored in the buffer cache. User requests may bypass the cache by setting flags (O_DIRECT); otherwise all filesystem I/O goes through the cache.

The current buffer cache interfaces are extended from the EFS filesystem in two ways. First, 64-bit versions of the interfaces are added to support XFS's 64-bit file sizes. Second, a transaction mechanism is provided. This allows buffer cache clients to collect and modify buffers during an operation, send the changed buffers to the log manager, and then release all the buffers after successful logging.

5.8 Guaranteed Rate I/O

As detailed earlier, XFS supports digital media applications by providing a guaranteed rate I/O (GRIO) mechanism. This allows applications to specify "real-time" guarantees for the rate at which they can read or write a file.

5.9 Volume Manager

Also detailed earlier, XLV interposes a layer between the filesystem and the disk drivers by building logical volumes (also known simply as volumes) on top of the partition devices. A volume is a faster, more reliable "disk" made from many physical disks which allows concatenation, striping, and plexing of data.

5.10 Disk Drivers

Disk drivers are the same as in traditional and current IRIX systems, except for 64-bit compatibility and error management handling for guaranteed rate I/O.

6.0 Implementation

The following section outlines the installation and commands of the XFS filesystem. A more thorough discussion may be found in the XFS Administration Guide called "Getting Started With XFS Filesystem".

6.1 Installation

The following provides a synopsis of the steps necessary to install XFS on systems running IRIX 5.3 (XFS is included in IRIX 6.1):

Install IRIX 5.3 with XFS in the usual way from CD-ROM or over the net
Dump all filesystems
Come up on miniroot using 5.3+XFS
Choose block sizes
Choose the log type and size
Check for adequate free disk space
Perform disk partitioning
Use mkfs_xfs to create new XFS filesystems, wiping out disk
Restore dump to XFS filesystems
Reboot, you're running XFS!

6.2 EFS to XFS Conversion

EFS filesystem may be converted to XFS filesystems later by:

Dumping the EFS filesystem
Creating one or more empty XFS filesystems and
Restoring the files to those filesystems

6.3 Commands

XFS use numerous commands such as the following to provide flexible filesystem management:

6.3.1 XFS Commands

xfs_estimate (1M) Estimates the space that an XFS filesystem will need to store the contents of an existing EFS filesystem

mkfs, mkfs_xfs (1M) Constructs an XFS filesystem

xfs_check (1M) Checks whether an XFS filesystem is consistent

xfs_growfs (1M) Expands an existing XFS filesystem

xfsdump (1M) Filesystem dump utility

xfsrestore (1M) Filesystem restore utility

6.3.2 XLV Commands

xlv_make (1M) Creates new logical volume objects by writing logical volume labels to the devices that are to constitute the volume objects

lv_to_xlv (1M) Parses the file describing the logical volumes used by the local machines and generates the required xlv_make (1M) commands to create an equivalent XLV volume

xlv_assemble (1M) Scans all the disks attached to the local system for logical volume labels and assembles all the logical volumes to generate a new configuration data structure

xlv_labd (1M) User process that writes logical volume disk labels

xlv_plexed (1M) User process responsible for making all plexes within a subvolume consistent

xlvd (1M) A kernel process that handles I/O to plexes and performs plex error recovery

xlv_admin (1M) A menu-driven command that is used to modify existing XLV objects (volumes, plexes, volume elements, and XLV disk labels)

6.3.3 GRIO Commands

cfg (1M) Scans the hardware available on the system and creates a file, /etc/grio_config, that describes the rates that can be guaranteed on each I/O device

ggd (1M) Manages the I/O-rate guarantees that have been granted to processes on the system

7.0 Road Map to the Future

7.1 Future Features

7.1.1 IRIX 6.2

IRIX 6.2, soon to be released, will possess many new exciting enhancements:

Further improvements in hardware error handling
User-specified attributes on a per-file basis
Parallel dump/restore operation on multiple drives
Accelerated interactive restores using online dump history (XFS will tell the operator which tapes to use and how it's performing during a restore)
GUI filesystem and volume administration tools

7.1.2 Future Directions

The ever evolving future of XFS shall include items such as:

Access Control Lists
Disk Quotas

7.2 Grow With XFS

As the power and reliability of processors and storage devices increases, users will increasingly need a robust, fast, reliable 64-bit filesystem such as XFS to handle the voluminous amounts of data. XFS is designed from "scratch" to scale with these needs and subsequent SGI hardware architecture such as the new CHALLENGE MP. Thus, XFS is the true next generation filesystem for the 1990's and beyond.

XFS Data Sheet is available online

Footnotes

(1): Note: Although the implementation is integral, XLV may be packaged and sold separately from XFS.

XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O

Table of Contents