XFS is Silicon Graphics' next-generation filesystem and volume manager for IRIX systems. XFS represents a leap into the future of filesystem management, providing a 64-bit journalled filesystem that can handle large files quickly and reliably. In addition, XFS provides the unique feature of guaranteed rate I/O that leverages Silicon Graphics' intimate knowledge of media serving environments. As a next generation product, XFS was designed as a new fully integrated product, rather than a modification of an existing product. Thus, important filesystem features such as journalling were seamlessly integrated into XFS, instead of being awkwardly added to an existing filesystem.
This paper is dedicated to describing the superior
features of the XFS filesystem and its underlying architecture.
Installation and administration of XFS are only roughly described.
For more details, please see the XFS Administration Guide: "Getting
Started With XFS Filesystem".
XFS provides a full 64-bit filesystem capable of scaling easily to handle extremely large files and filesystems that can grow to 1 terabyte (up to 9 million terabytes in successor releases). XFS's major features include:
Additionally, systems can use XFS filesystems exclusively or have a mixture of
XFS and EFS filesystems.
The XFS filesystem is a "journalled" filesystem. This means that updates to
filesystem metadata (inodes, directories, bitmaps, etc.) are written to a serial
log area on disk before the original disk blocks are updated in place. In the
event of a crash, such operations can be redone using data present in the log,
to restore the filesystem to a consistent state.
XFS includes a fully integrated volume manager, XLV, which creates an abstraction
of a sequence of logical disk blocks, called a volume, to be used by the XFS
filesystem. The sequence of disk blocks (a volume) can be assembled by
concatenating, plexing (mirroring), and striping across a number of physical
disk drives or RAID devices. XLV ensures that plexed data is kept
consistent across all plexes and will discussed in detail later.
XFS is a true next generation filesystem, not simply a rewrite or port of
existing technology. Most filesystems on the market are based upon old
filesystem architectures such as System V or BSD. XFS was "built from
scratch". This allowed SGI to integrate into XFS key features such as
journalling for high reliability and redesign important areas such as the
allocation algorithms for increased performance with large filesystems.
Thus, XFS is the first new filesystem built for the demanding large filesystem
needs of the 1990's:
As indicated above, most filesystems on the market evolved in an earlier age
of smaller filesystems, lower processing power, and limited storage capacity.
Many current filesystems are based on architectures that emphasize conserving
storage space at the expense of performance. The revolutionary growth of CPU and
storage technology has fueled a dramatic increase in the size and complexity of
filesystems, outstripping the capabilities of many filesystems. SGI designed
XFS to meet this increasing need to manage large complex filesystems with high
performance and reliability. Thus, completely different design targets were
established for XFS from previous or current filesystems:
The cutting-edge filesystem technology of XFS opens up new opportunities in
the marketplace by enabling users to manage large amounts of data with
unsurpassed speed. This is particularly important for file, database,
and compute server applications. Additionally, new programmatic
features and interfaces such as guaranteed rate I/O provide unique advantages to
users. In particular, the fast, guaranteed response time of XFS is excellent for
digital media servers and other real-time applications. Finally, the
growing necessity for business-critical reliability is driving corporations
toward robust journalled filesystems such as XFS. Yet, even with all this power
and added features, XFS is still fully upwardly-compatible with EFS and UNIX
standards.
XFS runs on all supported SGI machines which run IRIX 5.3 or higher, except
IP4 and IP6. The filesystem is implemented under the Virtual Filesystem Switch
which is extended from prior releases. This allows XFS to support the mixing of
EFS and XFS filesystems on the same system. Hence, the administrator could
easily move the more frequently used local filesystems to XFS and leave others
as EFS for later conversion as needed (see section 6.2 for converting EFS
filesystems to XFS filesystems).
XFS provides all the functions available in the current EFS filesystem at a
superior performance level. This includes performance features such as
asynchronous I/O volume, direct I/O, and synchronous I/O, in addition to normal
(buffered) I/O.
The XFS volume manager, XLV, performs all the functions of the lv and can be run
on the same system as these volume managers, but not on the same volume. This
allows a gradual change over to the new filesystem and volume manager.
Converting from lv logical volumes to XLV logical volumes is easy. The programs
lv_to_xlv and xlv_make convert lv logical volumes to XLV without having to dump
and restore data. Additionally, EFS and non filesystem applications can run on
top of the new volume manager. Ordinary driver interfaces are presented to
these clients.
XFS is available now in the following configurations:
XFS may be added to systems which are running IRIX 5.3 and 6.0.1 by upgrading
to IRIX 5.3 with XFS or IRIX 6.1 (XFS included). XFS is a standard integrated
part of IRIX 6.1 and beyond. Note: As discussed in section 7.1.1, IRIX 6.2 with
XFS included is scheduled for release soon.
XFS is marketed in the following packages:
3.1.1 Room to Grow
The XFS filesystem supports files of up to 263-1 bytes = 9,223,372,036,854,
775,807 bytes (9 million terabytes) and filesystems of up to 264 bytes. In IRIX
5.3, files and filesystems are restricted to 240-1 = 1,099,511,627,775 bytes
(1 terabyte). IRIX 6.1 supports full 64-bit files with filesystems subject to
the 1 terabyte restriction. Thus with XFS, filesystems are more limited by the
user's ability to attach disks to the hardware than by the filesystem's
capacity. Disk drive capacities are growing approximately 1.7 times per year,
so this added filesystem capacity will prove invaluable now and in the future.
As nearly the only filesystem on the market specially designed to scale to the
full 64-bit address space, XFS is leading the way into the future of large
filesystem management.
3.1.2 64-bit Compatibility
Both 32-bit and 64-bit interfaces are supported where they are supported by
the underlying OS and hardware. NFS version 3, which is a 64-bit protocol,
allows the export of large files and filesystems. NFS version 3 is an available
option for IRIX 5.3 and higher. Systems which support only NFS version 2 may
access files on XFS filesystems subject to the 32-bit files size limit. XFS
filesystems accessed over NFS version 2 may be greater than 2 gigabytes; all
file accesses will operate correctly, but disk free space reports will be
incorrect. Additionally, the caching file system (CFS) uses NFS protocols, and
allows efficient remote access to XFS filesystems.
32-bit applications may access 64-bit files through two methods. First,
applications may use extended system calls lseek64(), stat64(), etc. These
system calls enable the user to write 32-bit programs that can track 64-bit file
position and file size. Second, The new compilers from SGI allow administrators to build N32 application interfaces that help the administrators avoid changing any source code. These compilers are available for IRIX 6.1 and above.
Note: Many UNIX programs will work without modification because sequential
reads and writes will operate correctly even on files greater than 2 GB. The
semantics of 32-bit applications operating naively on files longer than 232
bytes are defined in the paper "64 Bit File Access".
3.2 Filesystem Management
3.2.1 Basics
The XFS space manager efficiently allocates disk space within a filesystem. It
is responsible for mapping a file (a sequence of bytes) onto sequences of disk
blocks. The internal structures of the filesystem -- allocation groups, inodes,
and free space management -- are the fundamental items controlled by the space
manager. The namespace manager handles allocation of directory files, normally
placing them close to the files in the directory for increased seek performance.
XFS space manager and namespace manger use sophisticated B-Tree indexing
technology to represent file location information contained inside directory
files and to represent the structure of the files themselves (location of
information in a file). This significantly increases the speed of accessing
information in files, especially with large files and filesystems. In the case
of large filesystems, traditional filesystems linearly search the file location
information in directory files; this information often spans multiple blocks
and/or extents (a collection of blocks). XFS's B-Tree technology enables it to
go directly to the blocks and/or extents containing a file's location using
sophisticated indices. With large files, the B-Tree indices efficiently map the
location of the extents containing the file's data. This avoids the slower
multi-level indirect schemes (of other filesystems) which often require
multiple block reads to find the desired information.
3.2.2 Structure
The space manager divides each filesystem data sub-volume (or partition) into a
number of allocation groups at mkfs time.
Each allocation group has a collection of inodes and data blocks, and data
structures to control their allocation. These allocation groups help to divide
the space management problem into easy to mange pieces, speeding up file
creation.
The inode size is specified by the administrator at mkfs time by XFS
which defaults the size to 256 byte. However, the location and number of inodes
are allocated as needed by XFS, unlike most filesystems which require the static
creation of the inodes at mkfs time. This dynamic inode allocation
permits greater performance (fewer seeks) since inodes can be positioned closer
to files and directories that use them.
Free blocks in an allocation group are kept track of using a pair of B-trees
instead of a bitmap. This provides better scalability to large systems. Block
allocation and free block operations are performed in parallel if done in
different allocation groups.
The real-time sub-volume is divided into a number of fixed-size extents
(a collection of blocks) to facilitate uniform read/writes during
guaranteed-rate I/O operations. The size is chosen at mkfs time and is
expected to be large (on the order of 1MByte). The real-time subvolume
allocation uses a bitmap to assure predictable performance; the access time for
B-tree indices varies.
3.2.3 Handles Contiguous Data
The space manager optimizes the layout of blocks in a file to avoid seeking
during sequential processing. It also keeps related files (those in the same
directory) close to each other on disk. More importantly, XFS uses large extents
(groups of blocks) to form contiguous regions in a single file. Current
block-based filesystems map a file to blocks of fixed length. Extent based
filesystems map a file to an extent, not just a single block. Extents are easy
to represent, minimize fragmentation and reduce the number of pointers required
to represent a large file. Thus, data is kept in large contiguous extents that
can be extracted from a disk in a single I/O operation - significantly reducing
I/O time.
Filesystem extents are variable in size and configured by XFS as needed. XFS
supports 512 bytes to 1 gigabyte per extent and optimizes the size for higher
read performance. This is in contrast to other filesystems which provide only
fixed extent sizes. Note: The XFS allowable extent size is considerably larger
than the previous limit of 128K bytes for EFS.
The filesystem block size is set at filesystem creation time using mkfs.
XFS supports multiple block sizes, ranging from 512 bytes (disk sector size) up
to 64KB and up to 1GB for real-time data. The filesystem block size is the
minimum unit of allocation for user data in the filesystem. As s file is written
by an application, space is reserved but blocks are not allocated. This ensures
that space is available, but gives XFS more flexibility in allocating blocks.
XFS delays allocation of user data blocks when possible to make blocks more
contiguous; holding them in the buffer cache. This allows XFS to make extents
large without requiring the user to specify extent size, and without requiring
a filesystem reorganizer to fix the extent sizes after the fact. This also
reduces the number of writes to disk and extents used for a file.
3.2.4 Efficient Filesystem
As implied earlier, XFS uses sophisticated filesystem management techniques
such as extents and B-Tree indices to efficiently support:
Very large files (64-bit size)
There is little or no performance penalty to access blocks in different areas of a large file. XFS creates large sizeable extents to locate file data close together for faster reading and writing of data. Using B-Tree indexing technology, XFS assigns small areas on the disk to write indices for the location of data in large files. Generally these indices point to the extents containing the data. The use of B-Tree indices increases performance by avoiding slower multi-level indirect searches of the filesystem data structures - especially when accessing blocks at the end of large files.
Very small files
Most symbolic links and directory files are small files. XFS allows these files to be stored in inodes for increased performance. XFS also uses delayed writes to wait to gather the entire small file in the buffer cache before writing to disk. This reduces the number of writes to disk and extents used for a file.
Files with few extents
These files are represented by extent lists to increase performance. Extent
lists are simple linear lists of the files' structure which avoid the
sophisticated B-Tree operations necessary for large files.
Sparse files
Sparse files are files that contain arbitrary "holes," areas of the
file which have never been written and which read back as zeroes. XFS supports
holes (avoids wasted space) by using indexing (B-Trees) and extents.
Mapped files
Memory mapping of files is supported by XFS. It allows a program to "attach"
a file such that the file appears to be part of the program, and let's XFS
worry about managing the disk I/Os.
Large directories
Large directories are indexed in directory files using B-Trees to expedite
searches, insertions, and deletions. Operations on directories containing
millions of files are almost as fast as on directories containing only hundreds
of files.
3.2.5 Attribute Management
XFS supports user defined Arbitrary Attributes which allow information about a
file to be stored outside of the file. For example, a graphic image could have
an associated description or a text document could have an attribute showing
the language in which the document is written. An unlimited number of
attributes can be associated with a file or directory. They can have any name,
type or size of value. These Attributes represent the first substantial
enhancement to the UNIX file interface in 10 years.
3.2.6 Online Filesystem Reconfiguration
An active filesystem may be extended (enlarged) by adding more space to the
underlying volume. This operation is supported on-line by the space manager,
which receives a request to expand the filesystem and updates on-disk and
in-memory structures to implement it.
3.3 Fast, Scalable Filesystem Performance
XFS is a very high-performance filesystem with support for contiguous data
(reduced I/Os) and both direct (non-buffered) and asychronous I/O. SGI has
demonstrated I/O through the filesystem in excess of 500 MBytes/second.
Customers have demonstrated and documented high I/O and bandwidth as well.
For example, Tom Ruwart, Director of the Army High Performance Computing Lab at
the University of Minnesota, has successfully tested an early "alpha" version
of XFS on a large 377 gigabyte filesystem. Using 24 SCSI channels and 24
Ciprico RAIDs, Tom measured direct I/O throughput of:
Furthermore, XFS is designed to scale in performance to match the new
CHALLENGE MP architecture and beyond. In traditional filesystems, files,
directories, and filesystems have reduced performance as they grow in size;
with XFS there is no performance penalty.
3.4 Journalled Filesystem
3.4.1 Journalling
As mentioned earlier, XFS utilizes database journalling (logging) technology.
Updates to filesystem metadata are written to a separate serial log area before
the original disk blocks are updated. Journalling promotes high availability by
allowing systems to quickly recover from failures while maintaining the
disk-based file structure in a consistent state at all times.
XFS is the first filesystem to integrate journal technology into a new
filesystem rather than add a section of journal code to an existing filesystem.
This integration allows XFS to be a more robust and faster filesystem.
XFS uses a circular log which usually employs 1000-2000 filesystem blocks.
During normal system operation the log is written and never read. Old
information is dropped out the log file as its usefulness ends. These log
buffer writes are mostly performed asynchronously so as not to force user
applications to wait for them.
Approximately 1 transaction occurs per filesystem update operation. Batch
transactions enable XFS to make metadata updates faster than EFS. The atomic,
multi-block updates enabled by transactions, allow XFS to cleanly update its
complex metadata structures.
3.4.2 Reliable Recovery
In the event of a crash, operations performed prior to the crash can be redone
using data present in the log to restore the filesystem structure to a
consistent state. This is done in the kernel at filesystem mount time. XFS
performs a binary search of the log for transactions to replay. This eliminates
the need to perform a slow total UNIX filesystem check (fsck) after a
system crash. Also, when fsck finds inconsistent data structures it must
throw away anything suspicious. XFS knows what was happening at the time of
failure, so it never needs to throw anything away; it simply finishes what it
started. Thus, XFS's journalled recovery provides higher filesystem integrity
than does standard UNIX.
To ensure data consistency, some log writes must be synchronous. This does not
include ordinary (buffered) writes. Direct I/O, synchronous I/O,
or fsync() allow synchronous user data writes.
3.4.3 Fast Recovery
The XFS log-based recovery mechanism will recover a filesystem within a few
seconds. The XFS recovery mechanism does not need to scan all inodes or
directories to ensure consistency. Also, journalling makes the XFS recovery
time independent of the filesystem size. Thus, the recovery time depends only
upon the level of activity in the filesystem at the time of the failure.
3.5 Volume Management
3.5.1 Basics
The xlv volume manager (XLV) is an integral part of the XFS
filesystem(1). The volume manager
provides an operational interface to the system's disks and isolates the higher
layers of the filesystem and applications from the details of the hardware.
Essentially, higher-level software "sees" the logical volumes created
by XLV exactly like disks. Yet, a logical volume is a faster, more reliable
"disk" made from many physical disks providing important features such as the
following (discussed in detail later):
The use of volumes enables XFS to create filesystems or raw devices that
span more than one disk partition. These volumes behave like regular disk
partitions and appear as block and character devices in the /dev
directory. Filesystems, databases, and other applications access the volumes
rather than the partitions. Each volume can be used as a single filesystem or
as a raw partition. A logical volume might include partitions from several
physical disks and, thus, be larger than any of the physical disks. Filesystems
built on these volumes can be created, mounted, and used in the normal way.
The volume manager stores all configuration data in the disk's labels. These
labels are stored on each disk and will be replicated so that a logical volume
can be assembled even if some pieces are missing. There is a negligible
performance penalty for using XLV when compared to accessing the disk directly;
although plexing (mirroring data) will mildly degrade write performance.
3.5.2 Sub-volumes
Within each logical volume, the volume manager implements sub-volumes, which
are separate linear address spaces of disk blocks in which the filesystem
stores its data. For EFS filesystems, a volume consists of just one sub-volume.
For XFS filesystems, a volume consists of a data sub-volume, an optional log
sub-volume, and an optional real-time sub-volume:
Data sub-volume
The data sub-volume contains user files and filesystem metadata (inodes,
directories, and free space blocks). It is required in all logical volumes
containing XFS filesystems and is the only sub-volume present in the EFS
filesystems.
Log sub-volume
The log sub-volume contains XFS journalling information. It is a log of
filesystem transactions and is used to expedite system recovery after a crash.
Real-time sub-volume
Real-time sub-volumes are generally used for data applications such as video
where guaranteed response time is paramount.
Sub-volumes facilitate separation of different data types. For example, user
data could be prevented from overwriting filesystem log data. Sub-volumes also
enable filesystem data and user data to be configured to meet goals for
performance and reliability by putting sub-volumes on
different disk drives - particularly useful for separating out real-time data
for guaranteed rate I/O operations. Each sub-volume can also be optimally sized
and organized independently. For example, the log sub-volume can be plexed
(mirrored) for fault tolerance and the real-time sub-volume can be striped
across a large number of disks to give maximum throughput for video playback.
Each sub-volume is made of partitions (real, physical regions of disk
blocks) composed by concatenation, plexing (mirroring), and striping. The
volume manager is responsible for translating logical addresses in the linear
address spaces into real disk addresses from the partitions. Where there are
multiple copies of a logical block (plexing), the volume manager writes
simultaneously to all copies, and reads from any copy (since all
copies are identical). The volume manager maintains the equality of plexes
across crashes and both temporary and permanent disk failures. Single block
failures in the plexed volumes are masked by the volume manager performing
retries and rewrites.
3.5.3 Plex and Volume Elements
A volume used for filesystem operations is usually composed of at least two
sub-volumes, one for the log and one for data. Each sub-volume can consist of a
number of plexes (mirrored data). Plexes are individually organized, but are
mapped to the same portion of the sub-volume's address space. Plexes may be
added or detached while the volume is active. The root filesystem may be plexed.
A plex consists of 1 to 128 volume elements each of which maps a portion
of the plex's address space (physical data location on disk) and are
concatenated together. Each volume element can be striped across a number of
disk partitions. XFS allows online growth of a filesystem
using xfs_growfs.
3.5.4 Volume Manipulation
XLV provides the following services transparent to the filesystems and
applications that access the volumes:
General volume manipulation
You can creates volumes, delete them, and move them to another system.
Auto-assembly of logical volumes
The volume manager will assemble logical volumes by scanning the hardware on
the system and reading all the disk labels at system boot time.
Plexing for higher system and data reliability
A sub-volume can contain one to four plexes (mirrors) of the data. Each plex
contains a portion or all of the subvolume's data. By creating a volume with
multiple plexes, system reliability is increased.
Disk striping for higher I/O performance
Striped volume elements consist of two or more disk partitions, organized so
that an amount of data called the stripe unit is written to each disk partition
before writing the next stripe unit-worth of data to the next partition. This
provides a performance advantage on large systems by allowing parallel I/O
activity from the disk drives for large I/O operations.
Concatenation to build large filesystems
You can build arbitrarily large volumes, up to the 64-bit limit, by
concatenating (joining) volume elements together (a maximum of 128 volume
elements). This is useful for creating a filesystem that is larger than the
size of a single disk.
XLV volumes may store XFS filesystems, EFS filesystems, or be used as raw
partitions (databases).
3.5.5 I/O to a Volume
Each volume has a pair of device nodes (just like disks). I/O to XLV devices
uses a kernel driver which dispatches the I/O to the right set of disks.
Plexing is more complicated:
XLV supports RAID devices. Thus, XLV is capable of handling large data
transfers generated from the filesystem code, through the volume manager, down
to the RAID driver.
3.5.6 Online Administration
XLV allows online volume reconfigurations, such as increasing the size of a
volume or adding/removing a piece of a volume. This reduces system downtime.
3.6 Guaranteed Rate I/O
3.6.1 A Unique Feature
The XFS guaranteed rate I/O system (GRIO) is a unique feature of XFS not found
in other filesystems. GRIO allows applications to reserve specific bandwidth to
or from the filesystem. XFS will calculate the performance available and
guarantee that the requested level of performance is met for a specified time.
This frees the programmer from having to predict the performance, which can be
complex and variable on flexible systems such as the CHALLENGE systems. This
functionality is critical for full rate, high resolution media delivery systems
such as video-on-demand or satellite systems that need to process information
at a certain rate as the satellite passes over head.
While it is possible to obtain proprietary guaranteed rate devices, no other
system integrates this feature into the filesystem. This integration yields
significant performance benefits. For example, in XFS the real-time data can be
separated from the regular data by XLV. This allows the administrator to
physically locate the real-time data separate from the metadata and regular
data for faster processing. Thus, dedicated storage devices may be reconfigured
for higher performance at the expense of reliability. Moreover, the GRIO
subsystem supports all storage devices (with some configuration), instead of
locking the user to proprietary storage devices.
3.6.2 Hard vs. Soft Guarantee
Guarantees can be hard or soft, depending upon the trade-off between
reliability and performance. Hard guarantees place greater restrictions on the
system hardware configuration. They guarantee to deliver the requested
performance, but with some possibility of error in the data (due to the need to
turn off disk drive self-diagnostics and error-correction firmware). Hard
guarantees are only possible if all the drives are on one SCSI bus and XFS
knows and "trusts" all the devices on that bus (such as using all
disk drives instead of unpredictable tape drives). Otherwise, XFS will allow a
soft guarantee which allows the disk drive to retry operations in the event of
an error, but this can possibly result in missing the rate guarantee.
3.6.3 Guarantee Mechanism
Applications request guarantees by providing a file descriptor, data rate,
duration, and start time. The filesystem calculates the performance available
and, if the bandwidth is available, guarantees that the requested level of
performance can be met for the given period of time. To make a bandwidth
reservation, a user issues a grio_request call on a file. All real-time
data accesses are made with standard read and write system calls.
Guaranteed rate I/O does not impact the buffer cache, because programs which
utilize this mechanism are required to use direct I/O - avoiding the buffer
cache. Real-time data may also be accessed in a non-realtime way using
only direct I/O calls without GRIO.
The knowledge of the available bandwidth for reservation is located in a user
level reservation scheduling daemon ggd. The daemon has knowledge of the
characteristics and configuration of the disks and volumes on the system
(including backplane and SCSI bus throughput), and it tracks both current and
future bandwidth reservations.
By default, IRIX supports four GRIO streams (concurrent uses of GRIO). The
number of streams can be increased to 40 by purchasing the High Performance
Guaranteed-Rate I/O-5-40 option or more using the Unlimited Streams option.
Note: Disk drivers have been modified to recognize guaranteed rate requests and
to schedule them in a real time manner. The disk and volume drivers also export
an interface for acquiring their response time and bandwidth characteristics
for use by the reservation scheduling module.
3.7 Data Management Interface
3.7.1 New Standard
In 1993, a group of computer system and data storage system vendors established
the Data Management Interface Group (DMIG) to establish a standard filesystem
interface for Hierarchical Storage Management (HSM) systems. This interface is
referred to as a Data Management APplication Interface (DMAPI). Silicon Graphics
is committed to the group's goal of simplifying and standardizing an HSM
interface to filesystems. The DMIG also includes other companies such as IBM,
Sun, Epoch, EMASS, Hitachi, Veritas, etc.
The DMIG has produced a draft of the standard (Version 2.1 dated March 1995)
whose changes will be tracked as the DMAPI is implemented and used in the
marketplace. XFS has already implemented much of the DMAPI interface while most
other major vendors have lagged on their implementation. Two companies, EMASS
and Hitachi, have released HSM products that use the DMAPI. Many other HSM
vendors have shown interest in the DMAPI.
3.7.2 Modified XFS Dump
Silicon Graphics has modified the XFS backup interface "xfsdump" to
work more efficiently with the HSM interfaces. XFS's xfsdump uses DMAPI
interfaces to understand the location and structure of the files in the
filesystem associated with any HSM products. This provides increased efficiency
and the ability to work with an HSM instead of fighting it.
3.8 Expanded Dump/Restore Capabilities
3.8.1 Basics
XFS's dump and restore programs have been written to support the new next
generation features of XFS. Thus, the new dump and restore efficiently supports
large filesystems with up to 4 billion files and all other items discussed in
this paper such as sparse files.
SGI modelled XFS's dump/restore after the BSD UNIX dump/restore as indicated
below:
XFS's new dump/restore is bundled with XFS and works well with other advanced
3rd party solutions such as Networker. Also, as mentioned above, XFS's dump
uses the DMAPI to rapidly read the filesystem without understanding details of
XFS's internal structure, resulting in a far less complex program.
3.8.2 Features
Unlike traditional filesystems, which must be inactivated and then dismounted
to guarantee a consistent dump image, you can dump an XFS filesystem while it
is being used. Furthermore, XFS dumps/restores are resumable. This means
that dump/restore can resume after a temporary termination where it left off,
instead going back to the beginning of the backup process. XFS uses a log
mechanism to see where the dump/restore temporarily stopped and proceeds from
there.
XFS supports two types of dumps: level and subtree. XFS can dump by levels
which indicate whether to dump all of the filesystem or various incremental
dumps of just changes to the filesystem. XFS can also do "subtree"
dumps by file name. With both types of dumps the user does not have to perform
a complete filesystem dump to backup data.
The online inventory of dump/restore actions contains much more information
than the old /etc/dumpdates file. The new online inventory
directory /var/xfsdump provides an extensive review of the dump history
displayable on screen. This information will help administrators to quickly
restore filesystems.
3.8.3 Media Handling
XFS dump is designed to use the "end of tape" indication to determine
when a new tape needs to be started. Thus, administrators with large
filesystems do not need to worry about catastrophic "end-of-tape" conditions
while dumping data to tapes. In XFS, a dump may span multiple tapes with
multiple dumps per tape. This frees the operator from struggling to guess the
amount of tape required.
In addition, media error handling has been improved:
XFS dump provides an "on-demand" progress report of all media
handling operations. Additionally, XFS restore can restore from tapes in any
order, independent of how the filesystem was dumped.
3.8.4 High Performance
XFS dump/restore provides high performance by:
3.8.5 Administration
For backup and restore of files less than 2 GB in size, the standard IRIX
utilities Backup, bru, cpio, Restore, and tar may be used. To dump XFS filesystems, the new utility xfsdump must be used instead of dump. Restoring from these dumps is done using xfsrestore. Xfsrestore also allows the backup media to be on a remote host.
3.8.6 Use EFS or XFS to Dump/Restore?
Which dump to use is simple:
-Use dump to dump EFS filesystems
Which restore to use is also simple:
-Use restore to restore dumps made with dump
Note: Restore or xfsrestore may be done to either type of
filesystem.
Although a powerful filesystem, XFS does possess some limitations worth
mentioning:
The following sections provides an overview of each component in the XFS
architecture: System Call and Vnode Interfaces, Lock Manager, NameSpace
Manager, Attribute Manager, Space Manager, Log Manager, Buffer Cache,
Guaranteed Rate I/O, Volume Manager, and Disk Drivers.
The filesystem related calls are implemented at the system call and vnode
interface: read, write, open, ioctl, etc., for all filesystem types. The
operations are then vectored out to different routines for each filesystem type
through the vnode interfaces.
The vnode interfaces also allow interoperation with remote clients such as NFS.
As indicated earlier, NFS Version 3.0 provides 64-bit file sharing capabilities
for XFS.
System call and vnode operations support Hierarchical Storage Management (HSM)
and backup applications. These were designed by an industry-wide working group
(DMIG, Data Management Interface Group).
5.2 Lock Manager
The XFS lock manager implements locking on user files, supporting standard UNIX
file locking calls such as fcntl and flock. The XFS lock manager
is similar to the EFS lock manager with comparable performance.
5.3 NameSpace Manager
The XFS namespace manager implements filesystem naming operations, translating
path names into file references (i.e., finding files). A file is identified
internally to the filesystem by its inode number. The inode is the on-disk
structure which holds all the information about a file. The inode number is the
label (or index) of the inode within the particular filesystem.
Files are also identified internally by a numeric human-readable value unique to the file, called the file unique id. Filesystems may be identified either by a "magic cookie", typically a memory address of the root inode, or by a filesystem unique id. Filesystem unique id's are assigned when the filesystem is created and are associated uniquely with that filesystem until the filesystem is destroyed. In both cases, the unique id's help administrators trouble shoot systems by clearly identifying different files and filesystems.
The namespace manager manages the directory structures and the contents of the inode that are unrelated to space management (such as file permissions and time stamps). The namespace manager uses a cache to speed up naming operations. The details of the name translation are hidden from the callers.
5.4 Attribute Manager
The attribute manager implements filesystem attribute operations: storing and
retrieving arbitrary user-defined attributes associated with objects in the
namespace. An attribute is stored internally by attaching it to the inode of
the referenced object. No storage for arbitrary attributes is allocated when an
object is created, and any attributes that exist when an object is destroyed are
destroyed as well.
The system backup utility will back up and restore the attributes of an object
when that object is backed up or restored. Standard NFS does not support
attributes beyond the traditional UNIX set, so these attributes are not visible
in any way to a client that is accessing an XFS filesystem via standard NFS.
NFS mounted filesystems continue to operate as if this feature did not exist.
5.5 Space Manager
As described earlier, the XFS space manager efficiently allocates the disk
space within a filesystem using extents and B-Tree indices. It is responsible
for mapping files, allocation groups, inodes, and free space.
5.6 Log Manager
Also as indicated earlier, all changes to filesystem metadata (inodes,
directories, bitmaps, etc.) are serially logged (journalled) to a separate area
of disk space by the log manager. The log allows fast reconstruction of a
filesystem (recovery) to a consistent state if a crash intervenes before the
metadata blocks are written to disk. There is a separate log space for each
filesystem for safety; this separation is managed by the underlying volume
manager. The log manager utilizes information provided by the space manager to
control the sequencing of write operations from the buffer cache, since
specific log writes must be sequenced before and after data operations for
correctness if there is a crash.
The space and name manager subsystems send logging requests to the log manager.
Each request may fill a partial log block or multiple blocks of the log. The
log is implemented as a circular sequential list which wraps when writes reach
the end. Each log entry contains a log sequence number, so that the end of the
log may be found by looking for the highest sequence number.
On plexed volumes, the buffer cache is also responsible for inserting log
records for non-metadata blocks, so that the volume manager's write-change log
does not need to be used by the filesystem. This allows the system to keep the
plexes of a volume synchronized with each other in the event of a crash between
writes.
5.7 Buffer Cache
The buffer cache is a cache of disk blocks for the various filesystems local to
a machine. Reads and writes may be performed from the buffer cache. Cache
entries are flushed when new entries are needed, in an order which takes into
account frequency (or recency) of use and filesystem semantics. Filesystem
metadata as well as file data is stored in the buffer cache. User requests may
bypass the cache by setting flags (O_DIRECT); otherwise all filesystem I/O goes
through the cache.
The current buffer cache interfaces are extended from the EFS filesystem in two
ways. First, 64-bit versions of the interfaces are added to support XFS's
64-bit file sizes. Second, a transaction mechanism is provided. This allows
buffer cache clients to collect and modify buffers during an operation, send
the changed buffers to the log manager, and then release all the buffers after
successful logging.
5.8 Guaranteed Rate I/O
As detailed earlier, XFS supports digital media applications by providing a
guaranteed rate I/O (GRIO) mechanism. This allows applications to specify
"real-time" guarantees for the rate at which they can read or write a file.
5.9 Volume Manager
Also detailed earlier, XLV interposes a layer between the filesystem and the
disk drivers by building logical volumes (also known simply as volumes) on top
of the partition devices. A volume is a faster, more reliable "disk"
made from many physical disks which allows concatenation, striping, and plexing
of data.
5.10 Disk Drivers
Disk drivers are the same as in traditional and current IRIX systems, except
for 64-bit compatibility and error management handling for guaranteed rate I/O.
The following section outlines the installation and commands of the XFS
filesystem. A more thorough discussion may be found in the XFS Administration
Guide called "Getting Started With XFS Filesystem".
The following provides a synopsis of the steps necessary to install XFS on
systems running IRIX 5.3 (XFS is included in IRIX 6.1):
6.2 EFS to XFS Conversion
EFS filesystem may be converted to XFS filesystems later by:
6.3 Commands
XFS use numerous commands such as the following to provide flexible filesystem
management:
6.3.1 XFS Commands
xfs_estimate (1M) Estimates the space that an XFS filesystem will need
to store the contents of an existing EFS filesystem
mkfs, mkfs_xfs (1M) Constructs an XFS filesystem
xfs_check (1M) Checks whether an XFS filesystem is consistent
xfs_growfs (1M) Expands an existing XFS filesystem
xfsdump (1M) Filesystem dump utility
xfsrestore (1M) Filesystem restore utility
6.3.2 XLV Commands
xlv_make (1M) Creates new logical volume objects by writing logical
volume labels to the devices that are to constitute the volume objects
lv_to_xlv (1M) Parses the file describing the logical volumes used by
the local machines and generates the required xlv_make (1M) commands to create
an equivalent XLV volume
xlv_assemble (1M) Scans all the disks attached to the local system
for logical volume labels and assembles all the logical volumes to generate a
new configuration data structure
xlv_labd (1M) User process that writes logical volume disk labels
xlv_plexed (1M) User process responsible for making all plexes within
a subvolume consistent
xlvd (1M) A kernel process that handles I/O to plexes and performs
plex error recovery
xlv_admin (1M) A menu-driven command that is used to modify existing
XLV objects (volumes, plexes, volume elements, and XLV disk labels)
6.3.3 GRIO Commands
cfg (1M) Scans the hardware available on the system and creates a
file, /etc/grio_config, that describes the rates that can be guaranteed
on each I/O device
ggd (1M) Manages the I/O-rate guarantees that have been granted to
processes on the system
7.1.1 IRIX 6.2
IRIX 6.2, soon to be released, will possess many new exciting enhancements:
7.1.2 Future Directions
The ever evolving future of XFS shall include items such as:
7.2 Grow With XFS
As the power and reliability of processors and storage devices increases, users
will increasingly need a robust, fast, reliable 64-bit filesystem such as XFS
to handle the voluminous amounts of data. XFS is designed from
"scratch" to scale with these needs and subsequent SGI hardware architecture
such as the new CHALLENGE MP. Thus, XFS is the true next generation filesystem
for the 1990's and beyond.
Copyright © 1994, 1995 Silicon Graphics, Inc.
1.1 Product Feature Summary
-large files
-large filesystems
-large numbers of files
-sparse files (files with "holes")
1.2 Background Information
2.0 Opening New Opportunities
2.1 A New Technology
---------------------------------------------------------------------
TIME FILESYTEM
2.2 New Opportunities
2.3 Upward Compatibility
2.4 Availability
-Supported on all SGI platforms except IP4, IP6, and R8000-based machines
-Power Challenge, Power Onyx, Power Indigo II only (R8000-based machines)
-The root and /usr filesystems are factory installed as an XFS filesystem on
Power Challenge
2.5 Product Packaging
Description: IRIX 5.3 with XFS support. Includes volume management with
striping and concatenation, guaranteed rate I/O for 4 streams, DMAPI support,
64-bit files and filesystems.
Description: Right to use for Volume management mirroring (plexing) support.
Description: Right to use for up to 40 Streams of Guaranteed Rate I/O
Description: Right to use guaranteed rate I/O for greater than 40 I/O streams
3.1 64-Bit Filesystem
-Use xfsdump to dump XFS filesystems
-Use xfsrestore to restore dumps made with xfsdump
4.0 Known Limitations
5.0 Design Architecture Overview
5.1 System Call and Vnode Interfaces
6.0 Implementation
6.1 Installation
7.0 Road Map to the Future
7.1 Future Features
XFS Data Sheet is available online
Footnotes