HP-UX VxFS tuning and performance - filibeto.org
HP-UX VxFS tuning and performance - filibeto.org
HP-UX VxFS tuning and performance - filibeto.org
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>HP</strong>-<strong>UX</strong> <strong>VxFS</strong> <strong>tuning</strong> <strong>and</strong> <strong>performance</strong><br />
Technical white paper<br />
Table of contents<br />
Executive summary ............................................................................................................................... 2<br />
Introduction ......................................................................................................................................... 2<br />
Underst<strong>and</strong>ing <strong>VxFS</strong> ............................................................................................................................. 3<br />
Software versions vs. disk layout versions ............................................................................................ 3<br />
Variable sized extent based file system ............................................................................................... 3<br />
Extent allocation .............................................................................................................................. 4<br />
Fragmentation ................................................................................................................................. 4<br />
Transaction journaling ...................................................................................................................... 5<br />
Underst<strong>and</strong>ing your application ............................................................................................................ 5<br />
Data access methods ........................................................................................................................... 6<br />
Buffered/cached I/O ....................................................................................................................... 6<br />
Direct I/O ..................................................................................................................................... 12<br />
Concurrent I/O .............................................................................................................................. 15<br />
Oracle Disk Manager ..................................................................................................................... 17<br />
Creating your file system .................................................................................................................... 18<br />
Block size ...................................................................................................................................... 18<br />
Intent log size ................................................................................................................................ 18<br />
Disk layout version ......................................................................................................................... 19<br />
Mount options ................................................................................................................................... 20<br />
Dynamic file system tunables ............................................................................................................... 24<br />
System wide tunables ......................................................................................................................... 25<br />
Buffer cache on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier ......................................................................................... 25<br />
Unified File Cache on <strong>HP</strong>-<strong>UX</strong> 11i v3 ................................................................................................. 27<br />
<strong>VxFS</strong> metadata buffer cache ............................................................................................................ 27<br />
<strong>VxFS</strong> inode cache .......................................................................................................................... 28<br />
Directory Name Lookup Cache ........................................................................................................ 29<br />
<strong>VxFS</strong> ioctl() options ............................................................................................................................ 30<br />
Cache advisories ........................................................................................................................... 30<br />
Allocation policies .......................................................................................................................... 30<br />
Patches ............................................................................................................................................. 31<br />
Summary .......................................................................................................................................... 31<br />
For additional information .................................................................................................................. 32
2<br />
Executive summary<br />
File system <strong>performance</strong> is critical to overall system <strong>performance</strong>. While memory latencies are<br />
measured in nanoseconds, I/O latencies are measured in milliseconds. In order to maximize the<br />
<strong>performance</strong> of your systems, the file system must be as fast as possible by performing efficient I/O,<br />
eliminating unnecessary I/O, <strong>and</strong> reducing file system overhead.<br />
In the past several years many changes have been made to the Veritas File System (<strong>VxFS</strong>) as well as<br />
<strong>HP</strong>-<strong>UX</strong>. This paper is based on <strong>VxFS</strong> 3.5 <strong>and</strong> includes information on <strong>VxFS</strong> through version 5.0.1,<br />
currently available on <strong>HP</strong>-<strong>UX</strong> 11i v3 (11.31). Recent key changes mentioned in this paper include:<br />
page-based Unified File Cache introduced in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
improved <strong>performance</strong> for large directories in <strong>VxFS</strong> 5.0<br />
patches needed for read ahead in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
changes to the write flush behind policies in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
new read flush behind feature in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
changes in direct I/O alignment in <strong>VxFS</strong> 5.0 on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
concurrent I/O included with OnlineJFS license with <strong>VxFS</strong> 5.0.1<br />
Oracle Disk Manager (ODM) feature<br />
Target audience: This white paper is intended for <strong>HP</strong>-<strong>UX</strong> administrators that are familiar with<br />
configuring file systems on <strong>HP</strong>-<strong>UX</strong>.<br />
Introduction<br />
Beginning with <strong>HP</strong>-<strong>UX</strong> 10.0, Hewlett-Packard began providing a Journaled File System (JFS) from<br />
Veritas Software Corporation (now part of Symantec) known as the Veritas File System (<strong>VxFS</strong>). You no<br />
longer need to be concerned with <strong>performance</strong> attributes such as cylinders, tracks, <strong>and</strong> rotational<br />
delays. The speed of storage devices have increased dramatically <strong>and</strong> computer systems continue to<br />
have more memory available to them. Applications are becoming more complex, accessing large<br />
amounts of data in many different ways.<br />
In order to maximize <strong>performance</strong> with <strong>VxFS</strong> file systems, you need to underst<strong>and</strong> the various<br />
attributes of the file system, <strong>and</strong> which attributes can be changed to best suit how the application<br />
accesses the data.<br />
Many factors need to be considered when <strong>tuning</strong> your <strong>VxFS</strong> file systems for optimal <strong>performance</strong>. The<br />
answer to the question “How should I tune my file system for maximum <strong>performance</strong>?” is “It depends”.<br />
Underst<strong>and</strong>ing some of the key features of <strong>VxFS</strong> <strong>and</strong> knowing how the applications access data are<br />
keys to deciding how to best tune the file systems.<br />
The following topics are covered in this paper:<br />
Underst<strong>and</strong>ing <strong>VxFS</strong><br />
Underst<strong>and</strong>ing your application<br />
Data access methods<br />
Creating your file system<br />
Mount options<br />
Dynamic File system tunables<br />
System wide tunables<br />
<strong>VxFS</strong> ioctl() options
Underst<strong>and</strong>ing <strong>VxFS</strong><br />
In order to fully underst<strong>and</strong> some of the various file system creation options, mount options, <strong>and</strong><br />
tunables, a brief overview of <strong>VxFS</strong> is provided.<br />
Software versions vs. disk layout versions<br />
<strong>VxFS</strong> supports several different disk layout versions (DLV). The default disk layout version can be<br />
overridden when the file system is created using mkfs(1M). Also, the disk layout can be upgraded<br />
online using the vxupgrade(1M) comm<strong>and</strong>.<br />
Table 1. <strong>VxFS</strong> software versions <strong>and</strong> disk layout versions (* denotes default disk layout)<br />
OS Version SW Version Disk layout version<br />
11.23 <strong>VxFS</strong> 3.5<br />
<strong>VxFS</strong> 4.1<br />
<strong>VxFS</strong> 5.0<br />
11.31 <strong>VxFS</strong> 4.1<br />
<strong>VxFS</strong> 5.0<br />
<strong>VxFS</strong> 5.0.1<br />
3,4,5*<br />
4,5,6*<br />
4,5,6,7*<br />
4,5,6*<br />
4,5,6,7*<br />
4,5,6,7*<br />
Several improvements have been made in the disk layouts which can help increase <strong>performance</strong>. For<br />
example, the version 5 disk layout is needed to create file system larger than 2 TB. The version 7 disk<br />
layout available on <strong>VxFS</strong> 5.0 <strong>and</strong> above improves the <strong>performance</strong> of large directories.<br />
Please note that you cannot port a file system to a previous release unless the previous release<br />
supports the same disk layout version for the file system being ported. Before porting a file system<br />
from one operating system to another, be sure the file system is properly unmounted. Differences in<br />
the Intent Log will likely cause a replay of the log to fail <strong>and</strong> a full fsck will be required before<br />
mounting the file system.<br />
Variable sized extent based file system<br />
<strong>VxFS</strong> is a variable sized extent based file system. Each file is made up of one or more extents that<br />
vary in size. Each extent is made up of one or more file system blocks. A file system block is the<br />
smallest allocation unit of a file. The block size of the file system is determined when the file system is<br />
created <strong>and</strong> cannot be changed without rebuilding the file system. The default block size varies<br />
depending on the size of the file system.<br />
The advantages of variable sized extents include:<br />
Faster allocation of extents<br />
Larger <strong>and</strong> fewer extents to manage<br />
Ability to issue large physical I/O for an extent<br />
However, variable sized extents also have disadvantages:<br />
Free space fragmentation<br />
Files with many small extents<br />
3
4<br />
Extent allocation<br />
When a file is initially opened for writing, <strong>VxFS</strong> is unaware of how much data the application will<br />
write before the file is closed. The application may write 1 KB of data or 500 MB of data. The size of<br />
the initial extent is the largest power of 2 greater than the size of the write, with a minimum extent<br />
size of 8k. Fragmentation will limit the extent size as well.<br />
If the current extent fills up, the extent will be extended if neighboring free space is available.<br />
Otherwise, a new extent will be allocated that is progressively larger as long as there is available<br />
contiguous free space.<br />
When the file is closed by the last process that had the file opened, the last extent is trimmed to the<br />
minimum amount needed.<br />
Fragmentation<br />
Two types of fragmentation can occur with <strong>VxFS</strong>. The available free space can be fragmented <strong>and</strong><br />
individual files may be fragmented. As files are created <strong>and</strong> removed, free space can become<br />
fragmented due to the variable sized extent nature of the file system. For volatile file systems with a<br />
small block size, the file system <strong>performance</strong> can degrade significantly if the fragmentation is severe.<br />
Examples of applications that use many small files include mail servers. As free space becomes<br />
fragmented, file allocation takes longer as smaller extents are allocated. Then, more <strong>and</strong> smaller I/Os<br />
are necessary when the file is read or updated.<br />
However, files can be fragmented even when there are large extents available in the free space map.<br />
This fragmentation occurs when a file is repeatedly opened, extended, <strong>and</strong> closed. When the file is<br />
closed, the last extent is trimmed to the minimum size needed. Later, when the file grows but there is<br />
no contiguous free space to increase the size of the last extent, a new extent will be allocated <strong>and</strong> the<br />
file could be closed again. This process of opening, extending, <strong>and</strong> closing a file can repeat resulting<br />
in a large file with many small extents. When a file is fragmented, a sequential read through the file<br />
will be seen as small r<strong>and</strong>om I/O to the disk drive or disk array, since the fragmented extents may be<br />
small <strong>and</strong> reside anywhere on disk.<br />
Static file systems that use large files built shortly after the file system is created are less prone to<br />
fragmentation, especially if the file system block size is 8 KB. Examples are file systems that have<br />
large database files. The large database files are often updated but rarely change in size.<br />
File system fragmentation can be reported using the “-E” option of the fsadm(1M) utility. File systems<br />
can be defragmented or re<strong>org</strong>anized with the <strong>HP</strong> OnLineJFS product using the “-e” option of the<br />
fsadm utility. Remember, performing one 8 KB I/O will be faster than performing eight 1 KB I/Os.<br />
File systems with small block sizes are more susceptible to <strong>performance</strong> degradation due to<br />
fragmentation than file systems with large block sizes.<br />
File system re<strong>org</strong>anization should be done on a regular <strong>and</strong> periodic basis, where the interval<br />
between re<strong>org</strong>anizations depends on the volatility, size <strong>and</strong> number of the files. Large file systems<br />
with a large number of files <strong>and</strong> significant fragmentation can take an extended amount of time to<br />
defragment. The extent re<strong>org</strong>anization is run online; however, increased disk I/O will occur when<br />
fsadm copies data from smaller extents to larger ones.
File system re<strong>org</strong>anization attempts to collect large areas of free space by moving various file extents<br />
<strong>and</strong> attempts to defragment individual files by copying the data in the small extents to larger extents.<br />
The re<strong>org</strong>anization is not a compaction utility <strong>and</strong> does not try to move all the data to the front on the<br />
file system.<br />
Note<br />
Fsadm extent re<strong>org</strong>anization may fail if there are not sufficient large free<br />
areas to perform the re<strong>org</strong>anization. Fsadm -e should be used to<br />
defragment file systems on a regular basis.<br />
Transaction journaling<br />
By definition, a journal file system logs file system changes to a “journal” in order to provide file<br />
system integrity in the event of a system crash. <strong>VxFS</strong> logs structural changes to the file system in a<br />
circular transaction log called the Intent Log. A transaction is a group of related changes that<br />
represent a single operation. For example, adding a new file in a directory would require multiple<br />
changes to the file system structures such as adding the directory entry, writing the inode information,<br />
updating the inode bitmap, etc. Once the Intent Log has been updated with the pending transaction,<br />
<strong>VxFS</strong> can begin committing the changes to disk. When all the disk changes have been made on disk,<br />
a completion record is written to indicate that transaction is complete.<br />
With the datainlog mount option, small synchronous writes are also logged in the Intent Log. The<br />
datainlog mount option will be discussed in more detail later in this paper.<br />
Since the Intent Log is a circular log, it is possible for it to “fill” up. This occurs when the oldest<br />
transaction in the log has not been completed, <strong>and</strong> the space is needed for a new transaction. When<br />
the Intent Log is full, the threads must wait for the oldest transaction to be flushed <strong>and</strong> the space<br />
returned to the Intent Log. A full Intent Log occurs very infrequently, but may occur more often if many<br />
threads are making structural changes or small synchronous writes (datainlog) to the file system<br />
simultaneously.<br />
Underst<strong>and</strong>ing your application<br />
As you plan the various attributes of your file system, you must also know your application <strong>and</strong> how<br />
your application will be accessing the data. Ask yourself the following questions:<br />
Will the application be doing mostly file reads, file writes, or both?<br />
Will files be accessed sequentially or r<strong>and</strong>omly?<br />
What are the sizes of the file reads <strong>and</strong> writes?<br />
Does the application use many small files, or a few large files?<br />
How is the data <strong>org</strong>anized on the volume or disk? Is the file system fragmented? Is the volume<br />
striped?<br />
Are the files written or read by multiple processes or threads simultaneously?<br />
Which is more important, data integrity or <strong>performance</strong>?<br />
How you tune your system <strong>and</strong> file systems depend on how you answer the above questions. No<br />
single set of tunables exist that can be universally applied such that all workloads run at peak<br />
<strong>performance</strong>.<br />
5
6<br />
Data access methods<br />
Buffered/cached I/O<br />
By default, most access to files in a <strong>VxFS</strong> file system is through the cache. In <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong><br />
earlier, the <strong>HP</strong>-<strong>UX</strong> Buffer Cache provided the cache resources for file access. In <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong><br />
later, the Unified File Cache is used. Cached I/O allows for many features, including asynchronous<br />
prefetching known as read ahead, <strong>and</strong> asynchronous or delayed writes known as flush behind.<br />
Read ahead<br />
When reads are performed sequentially, <strong>VxFS</strong> detects the pattern <strong>and</strong> reads ahead or prefetches data<br />
into the buffer/file cache. <strong>VxFS</strong> attempts to maintain 4 read ahead “segments” of data in the<br />
buffer/file cache, using the read ahead size as the size of each segment.<br />
Figure 1. <strong>VxFS</strong> read ahead<br />
256K 256K 256K 256K 256K<br />
Sequential read<br />
The segments act as a “pipeline”. When the data in the first of the 4 segments is read, asynchronous<br />
I/O for a new segment is initiated, so that 4 segments are maintained in the read ahead pipeline.<br />
The initial size of a read ahead segment is specified by <strong>VxFS</strong> tunable read_pref_io. As the process<br />
continues to read a file sequentially, the read ahead size of the segment approximately doubles, to a<br />
maximum of read_pref_io * read_nstream.<br />
For example, consider a file system with read_pref_io set to 64 KB <strong>and</strong> read_nstream set to 4. When<br />
sequential reads are detected, <strong>VxFS</strong> will initially read ahead 256 KB (4 * read_pref_io). As the<br />
process continues to read ahead, the read ahead amount will grow to 1 MB (4 segments *<br />
read_pref_io * read_nstream). As the process consumes a 256 KB read ahead segment from the front<br />
of the pipeline, asynchronous I/O is started for the 256 KB read ahead segment at the end of the<br />
pipeline.<br />
By reading the data ahead, the disk latency can be reduced. The ideal configuration is the minimum<br />
amount of read ahead that will reduce the disk latency. If the read ahead size is configured too large,<br />
the disk I/O queue may spike when the reads are initiated, causing delays for other potentially more<br />
critical I/Os. Also, the data read into the cache may no longer be in the cache when the data is<br />
needed, causing the data to be re-read into the cache. These cache misses will cause the read ahead<br />
size to be reduced automatically.<br />
Note that if another thread is reading the file r<strong>and</strong>omly or sequentially, <strong>VxFS</strong> has trouble with the<br />
sequential pattern detection. The sequential reads may be treated as r<strong>and</strong>om reads <strong>and</strong> no read<br />
ahead is performed. This problem can be resolved with enhanced read ahead discussed later. The<br />
read ahead policy can be set on a per file system basis by setting the read_ahead tunable using<br />
vxtunefs(1M). The default read_ahead value is 1 (normal read ahead).
Read ahead with VxVM stripes<br />
The default value for read_pref_io is 64 KB, <strong>and</strong> the default value for read_nstream is 1, except when<br />
the file system is mounted on a VxVM striped volume where the tunables are defaulted to match the<br />
striping attributes. For most applications, the 64 KB read ahead size is good (remember, <strong>VxFS</strong><br />
attempts to maintain 4 segments of read ahead size). Depending on the stripe size (column width)<br />
<strong>and</strong> number of stripes (columns), the default tunables on a VxVM volume may result in excessive read<br />
ahead <strong>and</strong> lead to <strong>performance</strong> problems.<br />
Consider a VxVM volume with a stripe size of 1 MB <strong>and</strong> 20 stripes. By default, read_pref_io is set to<br />
1 MB <strong>and</strong> read_nstream is set to 20. When <strong>VxFS</strong> initiates the read ahead, it tries to prefetch 4<br />
segments of the initial read ahead size (4 segments * read_pref_io). But as the sequential reads<br />
continue, the read ahead size can grow to 20 MB (read_pref_io * read_nstream), <strong>and</strong> <strong>VxFS</strong> attempts<br />
to maintain 4 segments of read ahead data, or 80 MB. This large amount of read ahead can flood<br />
the SCSI I/O queues or internal disk array queues, causing severe <strong>performance</strong> degradation.<br />
Note<br />
If you are using VxVM striped volumes, be sure to review the<br />
read_pref_io <strong>and</strong> read_nstream values to be sure excessive read<br />
ahead is not being done.<br />
False sequential I/O patterns <strong>and</strong> read ahead<br />
Some applications trigger read ahead when it is not needed. This behavior occurs when the<br />
application is doing r<strong>and</strong>om I/O by positioning the file pointer using the lseek() system call, then<br />
performing the read() system call, or by simply using the pread() system call.<br />
Figure 2. False sequential I/O patterns <strong>and</strong> read ahead<br />
If the application reads two adjacent blocks of data, the read ahead algorithm is triggered. Although<br />
only 2 reads are needed, <strong>VxFS</strong> begins prefetching data from the file, up to 4 segments of the read<br />
ahead size. The larger the read ahead size, the more data read into the buffer/file cache. Not only is<br />
excessive I/O performed, but useful data that may have been in the buffer/file cache is invalidated so<br />
the new unwanted data can be stored.<br />
Note<br />
If the false sequential I/O patterns cause <strong>performance</strong> problems, read<br />
ahead can be disabled by setting read_ahead to 0 using vxtunefs(1M), or<br />
direct I/O could be used.<br />
Enhanced read ahead<br />
8K 8K 8K<br />
lseek()/read()/read() lseek()/read()<br />
Enhanced read ahead can detect non-sequential patterns, such as reading every third record or<br />
reading a file backwards. Enhanced read ahead can also h<strong>and</strong>le patterns from multiple threads. For<br />
7
8<br />
example, two threads can be reading from the same file sequentially <strong>and</strong> both threads can benefit<br />
from the configured read ahead size.<br />
Figure 3. Enhanced read ahead<br />
Enhanced read ahead can be set on a per file system basis by setting the read_ahead tunable to 2<br />
with vxtunefs(1M).<br />
Read ahead on 11i v3<br />
With the introduction of the Unified File Cache (UFC) on <strong>HP</strong>-<strong>UX</strong> 11i v3, significant changes were<br />
needed for <strong>VxFS</strong>. As a result, a defect was introduced which caused poor <strong>VxFS</strong> read ahead<br />
<strong>performance</strong>. The problems are fixed in the following patches according to the <strong>VxFS</strong> version:<br />
PHKL_40018 - 11.31 vxfs4.1 cumulative patch<br />
PHKL_41071 - 11.31 VRTS 5.0 MP1 VRTSvxfs Kernel Patch<br />
PHKL_41074 - 11.31 VRTS 5.0.1 GARP1 VRTSvxfs Kernel Patch<br />
These patches are critical to <strong>VxFS</strong> read ahead <strong>performance</strong> on 11i v3.<br />
Another caveat with <strong>VxFS</strong> 4.1 <strong>and</strong> later occurs when the size of the read() is larger than the<br />
read_pref_io value. A large cached read can result in a read ahead size that is insufficient for the<br />
read. To avoid this situation, the read_pref_io value should be configured to be greater than or equal<br />
to the application read size.<br />
For example, if the application is doing 128 KB read() system calls, then the application can<br />
experience <strong>performance</strong> issues with the default read_pref_io value of 64 KB, as the <strong>VxFS</strong> would only<br />
read ahead 64 KB, leaving the remaining 64 KB that would have to be read in synchronously.<br />
Note<br />
On <strong>HP</strong>-<strong>UX</strong> 11i v3, the read_pref_io value should be set to the size of the<br />
largest logical read size for a process doing sequential I/O through cache.<br />
The write_pref_io value should be configured to be greater than or equal to the read_pref_io value<br />
due to a new feature on 11i v3 called read flush behind, which will be discussed later.<br />
Flush behind<br />
64K 128K 64K 128K 64K<br />
Patterned Read<br />
During normal asynchronous write operations, the data is written to buffers in the buffer cache, or<br />
memory pages in the file cache. The buffers/pages are marked “dirty” <strong>and</strong> control returns to the<br />
application. Later, the dirty buffers/pages must be flushed from the cache to disk.<br />
There are 2 ways for dirty buffers/pages to be flushed to disk. One way is through a “sync”, either<br />
by a sync() or fsync() system call, or by the syncer daemon, or by one of the vxfsd daemon threads.<br />
The second method is known as “flush behind” <strong>and</strong> is initiated by the application performing the<br />
writes.
Figure 4. Flush behind<br />
As data is written to a <strong>VxFS</strong> file, <strong>VxFS</strong> will perform “flush behind” on the file. In other words, it will<br />
issue asynchronous I/O to flush the buffer from the buffer cache to disk. The flush behind amount is<br />
calculated by multiplying the write_pref_io by the write_nstream file system tunables. The default flush<br />
behind amount is 64 KB.<br />
Flush behind on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
By default, flush behind is disabled on <strong>HP</strong>-<strong>UX</strong> 11i v3. The advantage of disabling flush behind is<br />
improved <strong>performance</strong> for applications that perform rewrites to the same data blocks. If flush behind<br />
is enabled, the initial write may be in progress, causing the 2 nd write to stall as the page being<br />
modified has an I/O already in progress.<br />
However, the disadvantage of disabling flush behind is that more dirty pages in the Unified File<br />
Cache accumulate before getting flushed by vh<strong>and</strong>, ksyncher, or vxfsd.<br />
Flush behind can be enabled on <strong>HP</strong>-<strong>UX</strong> 11i v3 by changing the fcache_fb_policy(5) value using<br />
kctune(1M). The default value of fcache_fb_policy is 0, which disables flush behind. If<br />
fcache_fb_policy is set to 1, a special kernel daemon, fb_daemon, is responsible for flushing the dirty<br />
pages from the file cache. If fcache_fb_policy is set to 2, then the process that is writing the data will<br />
initiate the flush behind. Setting fcache_fb_policy to 2 is similar to the 11i v2 behavior.<br />
Note that write_pref_io <strong>and</strong> write_nstream tunable have no effect on flush behind if fcache_fb_policy<br />
is set to 0. However, these tunables may still impact read flush behind discussed later in this paper.<br />
Note<br />
Flush behind is disabled by default on <strong>HP</strong>-<strong>UX</strong> 11i v3. To enable flush<br />
behind, use kctune(1M) to set the fcache_fb_policy tunable to 1 or 2.<br />
I/O throttling<br />
64K 64K 64K 64K 64K<br />
flush behind<br />
Sequential write<br />
Often, applications may write to the buffer/file cache faster than <strong>VxFS</strong> <strong>and</strong> the I/O subsystem can<br />
flush the buffers. Flushing too many buffers/pages can cause huge disk I/O queues, which could<br />
impact other critical I/O to the devices. To prevent too many buffers/pages from being flushed<br />
simultaneously for a single file, 2 types of I/O throttling are provided - flush throttling <strong>and</strong> write<br />
throttling.<br />
9
10<br />
Figure 5. I/O throttling<br />
Flush throttling (max_diskq)<br />
The amount of dirty data being flushed per file cannot exceed the max_diskq tunable. The process<br />
performing a write() system call will skip the “flush behind” if the amount of outst<strong>and</strong>ing I/O exceeds<br />
the max_diskq. However, when many buffers/pages of a file need to be flushed to disk, if the amount<br />
of outst<strong>and</strong>ing I/O exceeds the max_diskq value, the process flushing the data will wait 20<br />
milliseconds before checking to see if the amount of outst<strong>and</strong>ing I/O drops below the max_diskq<br />
value. Flush throttling can severely degrade asynchronous write <strong>performance</strong>. For example, when<br />
using the default max_diskq value of 1 MB, the 1 MB of I/O may complete in 5 milliseconds, leaving<br />
the device idle for 15 milliseconds before checking to see if the outst<strong>and</strong>ing I/O drops below<br />
max_diskq. For most file systems, max_diskq should be increased to a large value (such as 1 GB)<br />
especially for file systems mounted on cached disk arrays.<br />
While the default max_diskq value is 1 MB, it cannot be set lower than write_pref_io * 4.<br />
Note<br />
For best asynchronous write <strong>performance</strong>, tune max_diskq to a large value<br />
such as 1 GB.<br />
Write throttling (write_throttle)<br />
The amount of dirty data (unflushed) per file cannot exceed write_throttle. If a process tries to perform<br />
a write() operation <strong>and</strong> the amount of dirty data exceeds the write throttle amount, then the process<br />
will wait until some of the dirty data has been flushed. The default value for write_throttle is 0 (no<br />
throttling).<br />
Read flush behind<br />
<strong>VxFS</strong> introduced a new feature on 11i v3 called read flush behind. When the number of free pages<br />
in the file cache is low (which is a common occurrence) <strong>and</strong> a process is reading a file sequentially,<br />
pages which have been previously read are freed from the file cache <strong>and</strong> reused.<br />
Figure 6. Read flush behind<br />
64K<br />
64K 64K 64K<br />
64K 64K<br />
Read flush behind<br />
flush behind<br />
64K 64K 64K<br />
Sequential read<br />
64K 64K<br />
Sequential write<br />
64K<br />
read ahead
The read flush behind feature has the advantage of preventing a read of a large file (such as a<br />
backup, file copy, gzip, etc) from consuming large amounts of file cache. However, the disadvantage<br />
of read flush behind is that a file may not be able to reside entirely in cache. For example, if the file<br />
cache is 6 GB in size <strong>and</strong> a 100 MB file is read sequentially, the file will likely not reside entirely in<br />
cache. If the file is re-read multiple times, the data would need to be read from disk instead of the<br />
reads being satisfied from the file cache.<br />
The read flush behind feature cannot be disabled directly. However, it can be tuned indirectly by<br />
setting the <strong>VxFS</strong> flush behind size using write_pref_io <strong>and</strong> write_nstream. For example, increasing<br />
write_pref_io to 200 MB would allow the 100 MB file to be read into cache in its entirety. Flush<br />
behind would not be impacted if fcache_fb_policy is set to 0 (default), however, flush behind can be<br />
impacted if fcache_fb_policy is set to 1 or 2.<br />
The read flush behind can also impact <strong>VxFS</strong> read ahead if the <strong>VxFS</strong> read ahead size is greater than<br />
the write flush behind size. For best read ahead <strong>performance</strong>, the write_pref_io <strong>and</strong> write_nstream<br />
values should be greater than or equal to read_pref_io <strong>and</strong> read_nstream values.<br />
Note<br />
For best read ahead <strong>performance</strong>, the write_pref_io <strong>and</strong> write_nstream<br />
values should be greater than or equal to read_pref_io <strong>and</strong> read_nstream<br />
values.<br />
Buffer sizes on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier<br />
On <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, cached I/O is performed through the <strong>HP</strong>-<strong>UX</strong> buffer cache. The<br />
maximum size of each buffer in the cache is determined by the file system tunable max_buf_data_size<br />
(default 8k). The only other value possible is 64k. For reads <strong>and</strong> writes larger than<br />
max_buf_data_size <strong>and</strong> when performing read ahead, <strong>VxFS</strong> will chain the buffers together when<br />
sending them to the Volume Management subsystem. The Volume Management subsystem holds the<br />
buffers until <strong>VxFS</strong> sends the last buffer in the chain, then it attempts to merge the buffers together in<br />
order to perform larger <strong>and</strong> fewer physical I/Os.<br />
Figure 7. Buffer merging<br />
8K<br />
more<br />
8K<br />
more<br />
8K<br />
more<br />
8K<br />
more<br />
8K<br />
more<br />
8K<br />
more<br />
8K<br />
more<br />
The maximum chain size that the operating system can use is 64 KB. Merged buffers cannot cross a<br />
“chain boundary”, which is 64 KB. This means that if the series of buffers to be merged crosses a 64<br />
KB boundary, then the merging will be split into a maximum of 2 buffers, each less than 64 KB in<br />
size.<br />
If your application performs writes to a file or performs r<strong>and</strong>om reads, do not increase<br />
max_buf_data_size to 64k unless the size of the read or write is 64k or greater. Smaller writes cause<br />
the data to be read into the buffer synchronously first, then the modified portion of the data will be<br />
updated in the buffer cache, then the buffer will be flushed asynchronously. The synchronous read of<br />
the data will drastically reduce <strong>performance</strong> if the buffer is not already in the cache. For r<strong>and</strong>om<br />
reads, <strong>VxFS</strong> will attempt to read an entire buffer, causing more data to be read than necessary.<br />
8K<br />
64K<br />
11
12<br />
Note that buffer merging was not implemented initially on IA-64 systems with <strong>VxFS</strong> 3.5 on 11i v2. So<br />
using the default max_buf_data_size of 8 KB would result in a maximum physical I/O size of 8 KB.<br />
Buffer merging is implemented in 11i v2 0409 released in the fall of 2004.<br />
Page sizes on <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong> later<br />
On <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong> later, <strong>VxFS</strong> performs cached I/O through the Unified File Cache. The UFC is<br />
paged based, <strong>and</strong> the max_buf_data_size tunable on 11i v3 has no effect. The default page size is 4<br />
KB. However, when larger reads or read ahead is performed, the UFC will attempt to use contiguous<br />
4 KB pages. For example, when performing sequential reads, the read_pref_io value will be used to<br />
allocate the consecutive 4K pages. Note that the “buffer” size is no longer limited to 8 KB or 64 KB<br />
with the UFC, eliminating the buffer merging overhead caused by chaining a number of 8 KB buffers.<br />
Also, when doing small r<strong>and</strong>om I/O, <strong>VxFS</strong> can perform 4 KB page requests instead of 8 KB buffer<br />
requests. The page request has the advantage of returning less data, thus reducing the transfer size.<br />
However, if the application still needs to read the other 4 KB, then a second I/O request would be<br />
needed which would decrease <strong>performance</strong>. So small r<strong>and</strong>om I/O <strong>performance</strong> may be worse on<br />
11i v3 if cache hits would have occurred had 8 KB of data been read instead of 4 KB. The only<br />
solution is to increase the base_pagesize(5) from the default value of 4 to 8 or 16. Careful testing<br />
should be taken when changing the base_pagesize from the default value as some applications<br />
assume the page size will be the default 4 KB.<br />
Direct I/O<br />
If OnlineJFS in installed <strong>and</strong> licensed, direct I/O can be enabled in several ways:<br />
Mount option “-o mincache=direct”<br />
Mount option “-o convosync=direct”<br />
Use VX_DIRECT cache advisory of the VX_SETCACHE ioctl() system call<br />
Discovered Direct I/O<br />
Direct I/O has several advantages:<br />
Data accessed only once does not benefit from being cached<br />
Direct I/O data does not disturb other data in the cache<br />
Data integrity is enhanced since disk I/O must complete before the read or write system call<br />
completes (I/O is synchronous)<br />
Direct I/O can perform larger physical I/O than buffered or cached I/O, which reduces the total<br />
number of I/Os for large read or write operations<br />
However, direct I/O also has its disadvantages:<br />
Direct I/O reads cannot benefit from <strong>VxFS</strong> read ahead algorithms<br />
All physical I/O is synchronous, thus each write must complete the physical I/O before the system<br />
call returns<br />
Direct I/O <strong>performance</strong> degrades when the I/O request is not properly aligned<br />
Mixing buffered/cached I/O <strong>and</strong> direct I/O can degrade <strong>performance</strong><br />
Direct I/O works best for large I/Os that are accessed once <strong>and</strong> when data integrity for writes is<br />
more important than <strong>performance</strong>.
Note<br />
The OnlineJFS license is required to perform direct I/O when using <strong>VxFS</strong><br />
5.0 or earlier. Beginning with <strong>VxFS</strong> 5.0.1, direct I/O is available with the<br />
Base JFS product.<br />
Discovered direct I/O<br />
With <strong>HP</strong> OnLineJFS, direct I/O will be enabled if the read or write size is greater than or equal to the<br />
file system tunable discovered_direct_iosz. The default discovered_direct_iosz is 256 KB. As with<br />
direct I/O, all discovered direct I/O will be synchronous. Read ahead <strong>and</strong> flush behind will not be<br />
performed with discovered direct I/O.<br />
If read ahead <strong>and</strong> flush behind is desired for large reads <strong>and</strong> writes or the data needs to be reused<br />
from buffer/file cache, then the discovered_direct_iosz can be increased as needed.<br />
Direct I/O <strong>and</strong> unaligned data<br />
When performing Direct I/O, alignment is very important. Direct I/O requests that are not properly<br />
aligned have the unaligned portions of the request managed through the buffer or file cache.<br />
For writes, the unaligned portions must be read from disk into the cache first, <strong>and</strong> then the data in the<br />
cache is modified <strong>and</strong> written out to disk. This read-before-write behavior can dramatically increase<br />
the times for the write requests.<br />
Note<br />
The best direct I/O <strong>performance</strong> occurs when the logical requests are<br />
properly aligned. Prior to <strong>VxFS</strong> 5.0 on 11i v3, the logical requests need to<br />
be aligned on file system block boundaries. Beginning with <strong>VxFS</strong> 5.0 on<br />
11i v3, logical requests need to be aligned on a device block boundary (1<br />
KB).<br />
Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.23<br />
Unaligned data is still buffered using 8 KB buffers (default size), although the I/O is still done<br />
synchronously. For all <strong>VxFS</strong> versions on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, direct I/O performs well when the<br />
request is aligned on a file system block boundary <strong>and</strong> the entire request can by-pass the buffer<br />
cache. However, requests smaller than the file system block size or requests that do not align on a file<br />
system block boundary will use the buffer cache for the unaligned portions of the request.<br />
For large transfers greater than the block size, part of the data can still be transferred using direct<br />
I/O. For example, consider a 16 KB write request on a file system using an 8 KB block where the<br />
request does not start on a file system block boundary (the data starts on a 4 KB boundary in this<br />
case). Since the first 4 KB is unaligned, the data must be buffered. Thus an 8 KB buffer is allocated,<br />
<strong>and</strong> the entire 8 KB of data is read in, thus 4 KB of unnecessary data is read in. This behavior is<br />
repeated for the last 4 KB of data, since it is also not properly aligned. The two 8 KB buffers are<br />
modified <strong>and</strong> written to the disk along with the 8 KB direct I/O for the aligned portion of the request.<br />
In all, the operation took 5 synchronous physical I/Os, 2 reads <strong>and</strong> 3 writes. If the file system is recreated<br />
to use a 4 KB block size or smaller, the I/O would be aligned <strong>and</strong> could be performed in a<br />
single 16 KB physical I/O.<br />
13
14<br />
Figure 8. Example of unaligned direct I/O<br />
Buffered<br />
I/O<br />
8 KB fs<br />
block<br />
8 KB fs<br />
block<br />
Using a smaller block size, such as 1 KB, will improve chances of doing more optimal direct I/O.<br />
However, even with a 1 KB file system block size, unaligned I/Os can occur.<br />
Also, when doing direct I/O writes, a file‟s buffers must be searched to locate any buffers that<br />
overlap with the direct I/O request. If any overlapping buffers are found, those buffers are<br />
invalidated. Invalidating overlapping buffers is required to maintain consistency of the data in the<br />
cache <strong>and</strong> on disk. A direct I/O write would put newer data on the disk <strong>and</strong> thus the data in the<br />
cache would be stale unless it is invalidated. If the number of buffers for a file is large, then the<br />
process may spend a lot of time searching the file‟s list of buffers before each direct I/O is performed.<br />
The invalidating of buffers when doing direct I/O writes can degrade <strong>performance</strong> so severely, that<br />
mixing buffered <strong>and</strong> direct I/O in the same file is highly discouraged.<br />
Mixing buffered I/O <strong>and</strong> direct I/O can occur due to discovered direct I/O, unaligned direct I/O, or<br />
if there is a mix of synchronous <strong>and</strong> asynchronous operations <strong>and</strong> the mincache <strong>and</strong> convosync<br />
options are not set to the same value.<br />
Note<br />
Mixing buffered <strong>and</strong> direct I/O on the same file can cause severe<br />
<strong>performance</strong> degradation on <strong>HP</strong>-<strong>UX</strong> 11i v2 or earlier systems.<br />
Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.31 with <strong>VxFS</strong> 4.1<br />
4K 4K 8K 4K 4K<br />
16k r<strong>and</strong>om write using Direct I/O<br />
Direct I/O<br />
8K 8K<br />
8k<br />
<strong>HP</strong>-<strong>UX</strong> 11.31 introduced the Unified File Cache (UFC). The UFC is similar to the <strong>HP</strong>-<strong>UX</strong> buffer cache in<br />
function, but is 4 KB page based as opposed to 8 KB buffer based. The alignment issues are similar<br />
with <strong>HP</strong>-<strong>UX</strong> 11.23 <strong>and</strong> earlier, although it may only be necessary to read in a single 4 KB page, as<br />
opposed to an 8 KB buffer.<br />
Also, the mixing of cached I/O <strong>and</strong> buffered I/O is no longer a <strong>performance</strong> issue, as the UFC<br />
implements a more efficient algorithm for locating pages in the UFC that overlap with the direct I/O<br />
request.<br />
Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.31 with <strong>VxFS</strong> 5.0 <strong>and</strong> later<br />
8 KB fs<br />
block<br />
Buffered<br />
I/O<br />
<strong>VxFS</strong> 5.0 on <strong>HP</strong>-<strong>UX</strong> 11i v3 introduced an enhancement that no longer requires direct I/O to be<br />
aligned on a file system block boundary. Instead, the direct I/O only needs to be aligned on a device<br />
block boundary, which is a 1 KB boundary. File systems with a 1 KB file system block size are not<br />
impacted, but it is no longer necessary to recreate file systems with a 1 KB file system block size if the<br />
application is doing direct I/O requests aligned on a 1 KB boundary.
Concurrent I/O<br />
The main problem addressed by concurrent I/O is <strong>VxFS</strong> inode lock contention. During a read() system<br />
call, <strong>VxFS</strong> will acquire the inode lock in shared mode, allowing many processes to read a single file<br />
concurrently without lock contention. However, when a write() system call is made, <strong>VxFS</strong> will attempt<br />
to acquire the lock in exclusive mode. The exclusive lock allows only one write per file to be in<br />
progress at a time, <strong>and</strong> also blocks other processes reading the file. These locks on the <strong>VxFS</strong> inode<br />
can cause serious <strong>performance</strong> problems when there are one or more file writers <strong>and</strong> multiple file<br />
readers.<br />
Figure 9. Normal shared read locks <strong>and</strong> exclusive write locks<br />
In the example above, Process A <strong>and</strong> Process B both have shared locks on the file when Process C<br />
requests an exclusive lock. Once the request for an exclusive lock is in place, other shared read lock<br />
requests must also block, otherwise the writer may be starved for the lock.<br />
With concurrent I/O, both read() <strong>and</strong> write() system calls will request a shared lock on the inode,<br />
allowing all read() <strong>and</strong> write() system calls to take place concurrently. Therefore, concurrent I/O<br />
will eliminate most of the <strong>VxFS</strong> inode lock contention.<br />
15
16<br />
Figure 10. Locking with concurrent I/O (cio)<br />
To enable a file system for concurrent I/O, the file system simply needs to be mounted with the cio<br />
mount option, for example:<br />
# mount -F vxfs -o cio,delaylog /dev/vgora5/lvol1 /oracledb/s05<br />
Concurrent I/O was introduced with <strong>VxFS</strong> 3.5. However, a separate license was needed to enable<br />
concurrent I/O. With the introduction of <strong>VxFS</strong> 5.0.1 on <strong>HP</strong>-<strong>UX</strong> 11i v3, the concurrent I/O feature of<br />
<strong>VxFS</strong> is now available with the OnlineJFS license.<br />
While concurrent I/O sounds like a great solution to alleviate <strong>VxFS</strong> inode lock contention, <strong>and</strong> it is,<br />
some major caveats exist that anyone who plans to use concurrent I/O should be aware of.<br />
Concurrent I/O converts read <strong>and</strong> write operations to direct I/O<br />
Some applications benefit greatly from the use of cached I/O, relying on cache hits to reduce the<br />
amount of physical I/O or <strong>VxFS</strong> read ahead to prefetch data <strong>and</strong> reduce the read time. Concurrent<br />
I/O converts most read <strong>and</strong> write operations to direct I/O, which bypasses the buffer/file cache.<br />
If a file system is mounted with the cio mount option, then mount options mincache=direct <strong>and</strong><br />
convosync=direct are implied. If the mincache=direct <strong>and</strong> convosync=direct options are used, they<br />
should have no impact if the cio option is used as well.<br />
Since concurrent I/O converts read <strong>and</strong> write operations to direct I/O, all alignment constraints for<br />
direct I/O also apply to concurrent I/O.<br />
Certain operations still take exclusive locks<br />
Some operations still need to take an exclusive lock when performing I/O <strong>and</strong> can negate the impact<br />
of concurrent I/O. The following is a list of operations that still need to obtain an exclusive lock.<br />
1. fsync()<br />
2. writes that span multiple file extents<br />
3. writes when request is not aligned on a 1 KB boundary<br />
4. extending writes<br />
5. allocating writes to “holes” in a sparse file<br />
6. writes to Zero Filled On Dem<strong>and</strong> (ZFOD) extents
Zero Filled On Dem<strong>and</strong> (ZFOD) extents are new with <strong>VxFS</strong> 5.0 <strong>and</strong> are created by the VX_SETEXT<br />
ioctl() with the VX_GROWFILE allocation flag, or with setext(1M) growfile option.<br />
Not all applications support concurrent I/O<br />
By using a shared lock for write operations, concurrent I/O breaks some POSIX st<strong>and</strong>ards. If two<br />
processes are writing to the same block, there is no coordination between the files, <strong>and</strong> data modified<br />
by one process can be lost by data modified by another process. Before implementing concurrent<br />
I/O, the application vendors should be consulted to be sure concurrent I/O is supported with the<br />
application.<br />
Performance improvements with concurrent I/O vary<br />
Concurrent I/O benefits <strong>performance</strong> the most when there are multiple processes reading a single file<br />
<strong>and</strong> one or more processes writing to the same file. The more writes performed to a file with multiple<br />
readers, the greater the <strong>VxFS</strong> inode lock contention <strong>and</strong> the more likely that concurrent I/O will<br />
benefit.<br />
A file system must be un-mounted to remove the cio mount option<br />
A file system may be remounted with the cio mount option ("mount -o remount,cio /fs"); however to<br />
remove the cio option, the file system must be completely unmounted using umount(1M).<br />
Oracle Disk Manager<br />
Oracle Disk Manager (ODM) is an additionally licensed feature specifically for use with Oracle<br />
database environments. If the oracle binary is linked with the ODM library, Oracle will make an<br />
ioctl() call to the ODM driver to initiate multiple I/O requests. The I/O requests will be issued in<br />
parallel <strong>and</strong> will be asynchronous provided the disk_asynch_io value is set to 1 in the init.ora file.<br />
Later, Oracle will make another ioctl() call to process the I/O completions.<br />
Figure 11. Locking with concurrent I/O (cio)<br />
ODM provides near raw device asynchronous I/O access. It also eliminates the <strong>VxFS</strong> inode lock<br />
contention (similar to concurrent I/O).<br />
17
18<br />
Creating your file system<br />
When you create a file system using newfs(1m) or mkfs(1m), you need to be aware of several options<br />
that could affect <strong>performance</strong>.<br />
Block size<br />
The block size (bsize) is the smallest amount of space that can be allocated to a file extent. Most<br />
applications will perform better with an 8 KB block size. Extent allocations are easier <strong>and</strong> file systems<br />
with an 8 KB block size are less likely to be impacted by fragmentation since each extent would have<br />
a minimum size of 8 KB. However, using an 8 KB block size could waste space, as a file with 1 byte<br />
of data would take up 8 KB of disk space.<br />
The default block size for a file system scales depending on the size of the file system when the file<br />
system is created. The fstyp(1M) comm<strong>and</strong> can be used verify the block size of a specific file system<br />
using the f_frsize field. The f_bsize can be ignored for <strong>VxFS</strong> file systems as the value will always be<br />
8192.<br />
# fstyp -v /dev/vg00/lvol3<br />
vxfs<br />
version: 6<br />
f_bsize: 8192<br />
f_frsize: 8192<br />
The following table documents the default block sizes for the various <strong>VxFS</strong> versions beginning with<br />
<strong>VxFS</strong> 3.5 on 11.23:<br />
Table 2. Default FS block sizes<br />
FS Size DLV 4 DLV 5 >= DLV6<br />
Table 3. Default intent log size<br />
FS Size <strong>VxFS</strong> 3.5 or 4.1 or DLV = 6<br />
20<br />
Note<br />
To improve your large directory <strong>performance</strong>, upgrade to <strong>VxFS</strong> 5.0 or later<br />
<strong>and</strong> run vxupgrade to upgrade the disk layout version to version 7.<br />
Mount options<br />
blkclear<br />
When new extents are allocated to a file, extents will contain whatever was last written on disk until<br />
new data overwrites the old uninitialized data. Accessing the uninitialized data could cause a security<br />
problem as sensitive data may remain in the extent. The newly allocated extents could be initialized to<br />
zeros if the blkclear option is used. In this case, <strong>performance</strong> is traded for data security.<br />
When writing to a “hole” in a sparse file, the newly allocated extent must be cleared first, before<br />
writing the data. The blkclear option does not need to be used to clear the newly allocated extents in<br />
a sparse file. This clearing is done automatically to prevent stale data from showing up in a sparse<br />
file.<br />
Note<br />
For best <strong>performance</strong>, do not use the blkclear mount option.<br />
datainlog, nodatainlog<br />
Figure 12. Datainlog mount option<br />
I<br />
Memory<br />
D<br />
Intent Log<br />
I<br />
D<br />
Later...<br />
When using <strong>HP</strong> OnLineJFS, small (
Using datainlog has no affect on normal asynchronous writes or synchronous writes performed with<br />
direct I/O. The option nodatainlog is the default for systems without <strong>HP</strong> OnlineJFS, while datainlog is<br />
the default for systems that do have <strong>HP</strong> OnlineJFS.<br />
mincache<br />
By default, I/O operations are cached in memory using the <strong>HP</strong>-<strong>UX</strong> buffer cache or Unified File Cache<br />
(UFC), which allows for asynchronous operations such as read ahead <strong>and</strong> flush behind. The mincache<br />
options convert cached asynchronous operations so they behave differently. Normal synchronous<br />
operations are not affected.<br />
The mincache option closesync flushes a file‟s dirty buffers/pages when the file is closed. The close()<br />
system call may experience some delays while the dirty data is flushed, but the integrity of closed files<br />
would not be compromised by a system crash.<br />
The mincache option dsync converts asynchronous I/O to synchronous I/O. For best <strong>performance</strong> the<br />
mincache=dsync option should not be used. This option should be used if the data must be on disk<br />
when write() system call returns.<br />
The mincache options direct <strong>and</strong> unbuffered are similar to the dsync option, but all I/Os are<br />
converted to direct I/O. For best <strong>performance</strong>, mincache=direct should not be used unless I/O is<br />
large <strong>and</strong> expected to be accessed only once, or synchronous I/O is desired. All direct I/O is<br />
synchronous <strong>and</strong> read ahead <strong>and</strong> flush behind are not performed.<br />
The mincache option tmpcache provides the best <strong>performance</strong>, but less data integrity in the event of a<br />
system crash. The tmpcache option does not flush a file‟s buffers when it is closed, so a risk exists of<br />
losing data or getting garbage data in a file if the data is not flushed to disk in the event of a system<br />
crash.<br />
All the mincache options except for closesync require the <strong>HP</strong> OnlineJFS product when using <strong>VxFS</strong> 5.0<br />
or earlier. The closesync <strong>and</strong> direct options are available with the Base JFS product on <strong>VxFS</strong> 5.0.1<br />
<strong>and</strong> above.<br />
convosync<br />
Applications can implement synchronous I/O by specifying the O_SYNC or O_DSYNC flags during<br />
the open() system call. The convosync options convert synchronous operations so they behave<br />
differently. Normal asynchronous operations are not affected when using the convosync mount option.<br />
The convosync option closesync converts O_SYNC operations to be asynchronous operations. The<br />
file‟s dirty buffers/pages are then flushed when the file is closed. While this option speeds up<br />
applications that use O_SYNC, it may be harmful for applications that rely on data to be written to<br />
disk before the write() system call completes.<br />
The convosync option delay converts all O_SYNC operations to be asynchronous. This option is<br />
similar to convosync=closesync except that the file‟s buffers/pages are not flushed when the file is<br />
closed. While this option will improve <strong>performance</strong>, applications that rely on the O_SYNC behavior<br />
may fail after a system crash.<br />
The convosync option dsync converts O_SYNC operations to O_DSYNC operations. Data writes will<br />
continue to be synchronous, but the associated inode time updates will be performed asynchronously.<br />
The convosync options direct <strong>and</strong> unbuffered cause O_SYNC <strong>and</strong> O_DSYNC operations to bypass<br />
buffer/file cache <strong>and</strong> perform direct I/O. These options are similar to the dsync option as the inode<br />
time update will be delayed.<br />
All of the convosync options require the <strong>HP</strong> OnlineJFS product when using <strong>VxFS</strong> 5.0 or earlier. The<br />
direct option is available with the Base JFS product on <strong>VxFS</strong> 5.0.1.<br />
21
22<br />
cio<br />
The cio option enables the file system for concurrent I/O.<br />
Prior to <strong>VxFS</strong> 5.0.1, a separate license was needed to use the concurrent I/O feature. Beginning with<br />
<strong>VxFS</strong> 5.0.1, concurrent I/O is available with the OnlineJFS license. Concurrent I/O is recommended<br />
with applications which support its use.<br />
remount<br />
The remount option allows for a file system to be remounted online with different mount options. The<br />
mount -o remount can be done online without taking down the application accessing the file system.<br />
During the remount, all file I/O is flushed to disk <strong>and</strong> subsequent I/O operations are frozen until the<br />
remount is complete. The remount option can result in a stall for some operations. Applications that<br />
are sensitive to timeouts, such as Oracle Clusterware, should avoid having the file systems remounted<br />
online.<br />
Intent log options<br />
There are 3 levels of transaction logging:<br />
tmplog – Most transaction flushes to the Intent Log are delayed, so some recent changes to the file<br />
system will be lost in the event of a system crash.<br />
delaylog – Some transaction flushes are delayed.<br />
log – Most all structural changes are logged before the system call returns to the application.<br />
Tmplog, delaylog, log options all guarantee the structural integrity of the file system in the event of a<br />
system crash by replaying the Intent Log during fsck. However, depending on the level of logging,<br />
some recent file system changes may be lost in the event of a system crash.<br />
There are some common misunderst<strong>and</strong>ings regarding the log levels. For read() <strong>and</strong> write() system<br />
calls, the log levels have no effect. For asynchronous write() system calls, the log flush is always<br />
delayed until after the system call is complete. The log flush will be performed when the actual user<br />
data is written to disk. For synchronous write() systems calls, the log flush is always performed prior to<br />
the completion of the write() system call. Outside of changing the file‟s access timestamp in the inode,<br />
a read() system call makes no structural changes, thus does not log any information.<br />
The following table identifies which operations cause the Intent Log to be written to disk synchronously<br />
(flushed), or if the flushing of the Intent Log is delayed until sometime after the system call is complete<br />
(delayed).
Table 4. Intent log flush behavior with <strong>VxFS</strong> 3.5 or above<br />
Operation log delaylog tmplog<br />
Async Write Delayed Delayed Delayed<br />
Sync Write Flushed Flushed Flushed<br />
Read n/a n/a n/a<br />
Sync (fsync()) Flushed Flushed Flushed<br />
File Creation Flushed Delayed Delayed<br />
File Removal Flushed Delayed Delayed<br />
File Timestamp changes Flushed Delayed Delayed<br />
Directory Creation Flushed Delayed Delayed<br />
Directory Removal Flushed Delayed Delayed<br />
Symbolic/Hard Link Creation Flushed Delayed Delayed<br />
File/Directory Renaming Flushed Flushed Delayed<br />
Most transactions that are delayed with the delaylog or tmplog options include file <strong>and</strong> directory<br />
creation, creation of hard links or symbolic links, <strong>and</strong> inode time changes (for example, using<br />
touch(1M) or utime() system call). The major difference between delaylog <strong>and</strong> tmplog is how file<br />
renaming is performed. With the log option, most transactions are flushed to the Intent Log prior to<br />
performing the operation.<br />
Note that using the log mount option has little to no impact on file system <strong>performance</strong> unless there<br />
are large amounts of file create/delete/rename operations. For example, if you are removing a<br />
directory with thous<strong>and</strong>s of files in it, the removal will likely run faster using the delaylog mount option<br />
than the log mount option, since the flushing of the intent log is delayed after each file removal.<br />
Also, using the delaylog mount option has little or no impact on data integrity, since the log level does<br />
not affect the read() or write() system calls. If data integrity is desired, synchronous writes should be<br />
performed as they are guaranteed to survive a system crash regardless of the log level.<br />
The logiosize can be used to increase throughput of the transaction log flush by flushing up to 4 KB at<br />
a time.<br />
The tranflush option causes all metadata updates (such as inodes, bitmaps, etc) to be flushed before<br />
returning from a system call. Using this option will negatively affect <strong>performance</strong> as all metadata<br />
updates will be done synchronously.<br />
noatime<br />
Each time a file is accessed, the access time (atime) is updated. This timestamp is only modified in<br />
memory, <strong>and</strong> periodically flushed to disk by vxfsd. Using noatime has virtually no impact on<br />
<strong>performance</strong>.<br />
nomtime<br />
The nomtime option is only used in cluster file systems to delay updating the Inode modification time.<br />
This option should only be used in a cluster environment when the inode modification timestamp does<br />
not have to be up-to-date.<br />
23
24<br />
qio<br />
The qio option enables Veritas Quick I/O for Oracle database.<br />
Dynamic file system tunables<br />
File system <strong>performance</strong> can be impacted by a number of dynamic file system tunables. These values<br />
can be changed online using the vxtunefs(1M) comm<strong>and</strong>, or they can be set when the file system is<br />
mounted by placing the values in the /etc/vx/tunefstab file (see tunefstab(4)).<br />
Dynamic tunables that are not mentioned below should be left at the default.<br />
read_pref_io <strong>and</strong> read_nstream<br />
The read_pref_io <strong>and</strong> read_nstream tunables are used to configure read ahead through the buffer/file<br />
cache. These tunables have no impact with direct I/O or concurrent I/O. The default values are good<br />
for most implementations. However, read_pref_io should be configured to be greater than or equal to<br />
the most common application read size for sequential reads (excluding reads that are greater than or<br />
equal to the discovered_direct_iosz).<br />
Increasing the read_pref_io <strong>and</strong> read_nstream values can be helpful in some environments which can<br />
h<strong>and</strong>le the increased I/O load on the SAN environment, such as those environments with multiple<br />
fiber channel paths to the SAN to accommodate increased I/O b<strong>and</strong>width. However, use caution<br />
when increasing these values as <strong>performance</strong> can degrade if too much read ahead is done.<br />
write_pref_io <strong>and</strong> write_nstream<br />
The write_pref_io <strong>and</strong> write_nstream tunables are used to configure flush behind. These tunables have<br />
no impact with direct I/O or concurrent I/O. On <strong>HP</strong>-<strong>UX</strong> 11i v3, flush behind is disabled by default,<br />
but can be enabled with the system wide tunable fcache_fb_policy. On 11i v3, write_pref_io <strong>and</strong><br />
write_nstream should be set to be greater than or equal to read_pref_io <strong>and</strong> read_nstream,<br />
respectively.<br />
max_buf_data_size<br />
The max_buf_data_size is only valid for <strong>HP</strong>-<strong>UX</strong> 11i v2 or earlier. This tunable does not impact direct<br />
I/O or concurrent I/O. The only possible values are 8 (default) <strong>and</strong> 64. The default is good for most<br />
file systems, although file systems that are accessed with only sequential I/O will perform better if<br />
max_buf_data_size is set to 64. File systems should avoid setting max_buf_data_size to 64 if r<strong>and</strong>om<br />
I/Os are performed.<br />
Note<br />
Only tune max_buf_data_size to 64 if all file access in a file system will be<br />
sequential.<br />
max_direct_iosz<br />
This max_direct_iosz tunable controls the maximum physical I/O size that can be issued with direct<br />
I/O or concurrent I/O. The default value is 1 MB <strong>and</strong> is good for most file systems. If a direct I/O<br />
read or write request is greater than the max_direct_iosz, the request will be split into smaller I/O<br />
requests with a maximum size of max_direct_iosz <strong>and</strong> issued serially, which can negatively affect<br />
<strong>performance</strong> of the direct I/O request. Read or write requests greater than 1 MB may work better if<br />
max_direct_iosz is configured to be larger than the largest read or write request.
ead_ahead<br />
The default read_ahead value is 1 <strong>and</strong> is sufficient for most file systems. Some file systems with nonsequential<br />
patterns may work best if enhanced read ahead is enabled by setting read_ahead to 2.<br />
Setting read_ahead to 0 disables read ahead. This tunable does not affect direct I/O or concurrent<br />
I/O.<br />
discovered_direct_iosz<br />
Read <strong>and</strong> write requests greater than or equal to discovered_direct_iosz are performed as direct I/O.<br />
This tunable has no impact if the file system is already mounted for direct I/O or concurrent I/O. The<br />
discovered_direct_iosz value should be increased if data needs to be cached for large read <strong>and</strong> write<br />
requests, or if read ahead or flush behind is needed for large read <strong>and</strong> write requests.<br />
max_diskq<br />
The max_diskq tunable controls how data from a file can be flushed to disk at one time. The default<br />
value is 1 MB <strong>and</strong> must be greater than or equal to 4 * write_perf_io. To avoid delays when flushing<br />
dirty data from cache, setting max_diskq to 1 GB is recommended, especially for cached storage<br />
arrays.<br />
write_throttle<br />
The write_throttle tunable controls how much data from a file can be dirty at one time. The default<br />
value is 0, which means that write_throttle feature is disabled. The default value is recommended for<br />
most file systems.<br />
Extent allocation policies<br />
Since most applications use a write size of 8k or less, the first extent is often the smallest. If the file<br />
system uses mostly large files, then increasing the initial_extent_size can reduce file fragmentation by<br />
allowing the first extent allocation to be larger.<br />
However, increasing the initial_extent_size may actually increase fragmentation if many small files<br />
(
26<br />
cache grows quickly as new buffers are needed. The buffer cache is slow to shrink, as memory<br />
pressure must be present in order to shrink buffer cache.<br />
Figure 13. <strong>HP</strong>-<strong>UX</strong> 11iv2 buffer cache tunables<br />
The buffer cache should be configured large enough to contain the most frequently accessed data.<br />
However, processes often read large files once (for example, during a file copy) causing more<br />
frequently accessed pages to be flushed or invalidated from the buffer cache.<br />
The advantage of buffer cache is that frequently referenced data can be accessed through memory<br />
without requiring disk I/O. Also, data being read from disk or written to disk can be done<br />
asynchronously.<br />
The disadvantage of buffer cache is that data in the buffer cache may be lost during a system failure.<br />
Also, all the buffers associated with the file system must be flushed to disk or invalidated when the file<br />
system is synchronized or unmounted. A large buffer cache can cause delays during these operations.<br />
If data needs to be accessed once without keeping the data in the cache, various options such as<br />
using direct I/O, discovered direct I/O, or the VX_SETCACHE ioctl with the VX_NOREUSE option<br />
may be used. For example, rather than using cp(1) to copy a large file, try using dd(1) instead using<br />
a large block size as follows:<br />
# dd if=srcfile of=destfile bs=1024k<br />
On <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, cp(1) reads <strong>and</strong> writes data using 64 KB logical I/O. Using dd, data<br />
can be read <strong>and</strong> written using 256 KB I/Os or larger. The large I/O size will cause dd to engage the<br />
Discovered Direct I/O feature of <strong>HP</strong> OnlineJFS <strong>and</strong> the data will be transferred using large direct I/O.<br />
There are several advantages of using dd(1) over cp(1):<br />
Transfer can be done using a larger I/O transfer size<br />
Memory<br />
dbc_min_pct dbc_max_pct<br />
Buffer cache is bypassed, thus leaving other more important data in the cache.<br />
Data is written synchronously, instead of asynchronously, avoiding large buildup of dirty buffers<br />
which can potentially cause large I/O queues or process that call sync()/fsync() to temporarily<br />
hang.<br />
Note<br />
On <strong>HP</strong>-<strong>UX</strong> 11i v3, the cp(1) comm<strong>and</strong> was changed to perform 1024 KB<br />
logical reads instead of 64 KB, which triggers discovered direct I/O similar<br />
to the dd(1) comm<strong>and</strong> mentioned above.
Unified File Cache on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />
Filecache_min / filecache_max<br />
The Unified File Cache provides a similar file caching function to the <strong>HP</strong>-<strong>UX</strong> buffer cache, but it is<br />
managed much differently. As mentioned earlier, the UFC is page-based versus buffer-based. <strong>HP</strong>-<strong>UX</strong><br />
11i v2 maintained separate caches, a buffer cache for st<strong>and</strong>ard file access, <strong>and</strong> a page cache for<br />
memory mapped file access. The UFC caches both normal file data as well as memory mapped file<br />
data, resulting in a unified file <strong>and</strong> page cache.<br />
The size of the UFC is managed through the kernel tunables filecache_min <strong>and</strong> filecache_max. Similar<br />
to dbc_mac_pct, the default filecache_max value is very large (49% of physical memory), which<br />
results in unnecessary memory pressure <strong>and</strong> paging of non-cache data. The filecache_min <strong>and</strong><br />
filecache_max values should be evaluated for the appropriate values based on how the system is<br />
used. For example, a system may have 128 GB of memory, but the primary application is Oracle <strong>and</strong><br />
the database files are mounted for Direct I/O. So the UFC should configured be to h<strong>and</strong>le the non-<br />
Oracle load, such as setting the filecache_max value to 4 GB of memory.<br />
fcache_fb_policy<br />
The fcache_fb_policy tunable controls the behavior of the file system flush behind. The default value of<br />
0 disables file system flush behind. The value of 1 enables file system flush behind via the fb_daemon<br />
threads, <strong>and</strong> a value of 2 enables “inline” flush behind where the process that dirties the pages also<br />
initiates the asynchronous writes (similar to the behavior in 11i v2).<br />
<strong>VxFS</strong> metadata buffer cache<br />
vxfs_bc_bufhwm<br />
File system metadata is the structural information in the file system, <strong>and</strong> includes the superblock,<br />
inodes, directories blocks, bit maps, <strong>and</strong> the Intent Log. Beginning with <strong>VxFS</strong> 3.5, a separate<br />
metadata buffer cache is used to cache the <strong>VxFS</strong> metadata. The separate metadata cache allows<br />
<strong>VxFS</strong> to do some special processing on metadata buffers, such as shared read locks, which can<br />
improve <strong>performance</strong> by allowing multiple readers of a single buffer.<br />
By default, the maximum size of the cache scales depending on the size of memory. The maximum<br />
size of the <strong>VxFS</strong> metadata cache varies depending on the amount of physical memory in the system<br />
according to the table below:<br />
Table 5. Size of <strong>VxFS</strong> metadata buffer cache<br />
Memory Size (MB)<br />
<strong>VxFS</strong> Metadata<br />
Buffer Cache (KB)<br />
<strong>VxFS</strong> Metadata Buffer Cache<br />
as a percent of memory<br />
256 32000 12.2%<br />
512 64000 12.2%<br />
1024 128000 12.2%<br />
2048 256000 12.2%<br />
8192 512000 6.1%<br />
32768 1024000 3.0%<br />
131072 2048000 1.5%<br />
27
28<br />
The kernel tunable vxfs_bc_bufhwm specifies the maximum amount of memory in kilobytes (or high<br />
water mark) to allow for the buffer pages. By default, vxfs_bc_bufhwm is set to zero, which means the<br />
default maximum sized is based on the physical memory size (see Table 5).<br />
The vxfsstat(1M) comm<strong>and</strong> with the -b option can be used to verify the size of the <strong>VxFS</strong> metadata<br />
buffer cache. For example:<br />
# vxfsstat -b /<br />
: :<br />
buffer cache statistics<br />
120320 Kbyte current 544768 maximum<br />
98674531 lookups 99.96% hit rate<br />
3576 sec recycle age [not limited by maximum]<br />
Note<br />
The <strong>VxFS</strong> metadata buffer cache is memory allocated in addition to the <strong>HP</strong>-<br />
<strong>UX</strong> Buffer Cache or Unified File Cache.<br />
<strong>VxFS</strong> inode cache<br />
vx_ninode<br />
<strong>VxFS</strong> maintains a cache of the most recently accessed inodes in memory. The <strong>VxFS</strong> inode cache is<br />
separate from the HFS inode cache. The <strong>VxFS</strong> inode cache is dynamically sized. The cache grows as<br />
new inodes are accessed <strong>and</strong> contracts when old inodes are not referenced. At least one inode entry<br />
must exist in the JFS inode cache for each file that is opened at a given time.<br />
While the inode cache is dynamically sized, there is a maximum size for the <strong>VxFS</strong> inode cache. The<br />
default maximum size is based on the amount of memory present. For example, a system with 2 to 8<br />
GB of memory will have maximum of 128,000 inodes. The maximum number of inodes can be tuned<br />
using the system wide tunable vx_ninode. Most systems do not need such a large <strong>VxFS</strong> inode cache<br />
<strong>and</strong> a value of 128000 is recommended.<br />
Note also that the HFS inode cache tunable ninode has no affect on the size of the <strong>VxFS</strong> inode cache.<br />
If /st<strong>and</strong> is the only HFS file system in use, ninode can be tuned lower (400, for example).<br />
The vxfsstat(1M) comm<strong>and</strong> with the -i option can be used to verify the size of the <strong>VxFS</strong> inode cache:<br />
# vxfsstat -i /<br />
: :<br />
inode cache statistics<br />
58727 inodes current 128007 peak 128000 maximum<br />
9726503 lookups 86.68% hit rate<br />
6066203 inodes alloced 5938196 freed<br />
927 sec recycle age<br />
1800 sec free age<br />
While there are 128007 inodes currently in the <strong>VxFS</strong> inode cache, not all of the inodes are actually<br />
in use. The vxfsstat(1M) comm<strong>and</strong> with the -v option can be used to verify the number of inodes in<br />
use:<br />
# vxfsstat -v / | grep inuse<br />
vxi_icache_inuseino 1165 vxi_icache_maxino 128000
Note<br />
When setting vx_ninode to reduce the JFS inode cache, use the -h option<br />
with kctune(1M) to hold the change until the next reboot to prevent<br />
temporary hangs or Serviceguard TOC events as the vxfsd daemon<br />
become very active shrinking the JFS inode cache.<br />
vxfs_ifree_timelag<br />
By default, the <strong>VxFS</strong> inode cache is dynamically sized. The inode cache typically exp<strong>and</strong>s very<br />
rapidly, then shrinks over time. The constant resizing of the cache results in additional memory<br />
consumption <strong>and</strong> memory fragmentation as well as additional CPU used to manage the cache.<br />
The recommended value for vxfs_ifree_timelag is -1, which allows the inode cache to exp<strong>and</strong> to its<br />
maximum sizes based on vx_ninode, but the inode cache will not dynamically shrink. This behavior is<br />
similar to the former HFS inode cache behavior.<br />
Note<br />
To reduce memory usage <strong>and</strong> memory fragmentation, set<br />
vxfs_ifree_timelag to -1 <strong>and</strong> vx_ninode to 128000 on most systems.<br />
Directory Name Lookup Cache<br />
The Directory Name Lookup Cache (DNLC) is used to improve directory lookup <strong>performance</strong>. By<br />
default, the DNLC contains a number of directory <strong>and</strong> file name entries in a cache sized by the<br />
default size of the JFS Inode Cache. Prior to <strong>VxFS</strong> 5.0 on 11i v3, the size of the DNLC could be<br />
increased by increasing the size of vx_ninode value, but the size of the DNLC could not be<br />
decreased. With <strong>VxFS</strong> 5.0 on 11i v3, the DNLC can be increased or decreased by <strong>tuning</strong> the<br />
vx_ninode value. However, the size of the DNLC is not dynamic so the change in size will take effect<br />
after the next reboot.<br />
The DNLC is searched first, prior to searching the actual directories. Only filenames with less than 32<br />
characters can be cached. The DNLC may not help if an entire large directory cannot fit into the<br />
cache, so an ll(1) or find(1) of a large directory could push out other more useful entries in the cache.<br />
Also, the DNLC does not help when adding a new file to the directory or when searching for a nonexistent<br />
directory entry.<br />
When a file system is unmounted, all of the DNLC entries associated with the file system must be<br />
purged. If a file system has several thous<strong>and</strong>s of files <strong>and</strong> the DNLC is configured to be very large, a<br />
delay could occur when the file system is unmounted.<br />
The vxfsstat(1M) comm<strong>and</strong> with the -i option can be used to verify the size of the <strong>VxFS</strong> DNLC:<br />
# vxfsstat -i /<br />
: :<br />
Lookup, DNLC & Directory Cache Statistics<br />
337920 maximum entries in dnlc<br />
1979050 total lookups 90.63% fast lookup<br />
1995410 total dnlc lookup 98.57% dnlc hit rate<br />
25892 total enter 54.08 hit per enter<br />
0 total dircache setup 0.00 calls per setup<br />
46272 total directory scan 15.29% fast directory scan<br />
While the size of the DNLC can be tuned using the vx_ninode tunable, the primary purpose of<br />
vx_ninode it to tune the JFS Inode Cache. So follow the same recommendations for <strong>tuning</strong> the size of<br />
the JFS inode cache discussed earlier.<br />
29
30<br />
<strong>VxFS</strong> ioctl() options<br />
Cache advisories<br />
While the mount options allow you to change the cache advisories on a per-file system basis, the<br />
VX_SETCACHE ioctl() allows an application to change the cache advisory on a per-file basis. The<br />
following options are available with the VX_SETCACHE ioctl:<br />
VX_RANDOM - Treat read as r<strong>and</strong>om I/O <strong>and</strong> do not perform read ahead<br />
VX_SEQ - Treat read as sequential <strong>and</strong> perform maximum amount of read ahead<br />
VX_DIRECT - Bypass the buffer cache. All I/O to the file is synchronous. The application buffer must<br />
be aligned on a word (4 byte) address, <strong>and</strong> the I/O must begin on a block (1024 KB) boundary.<br />
VX_NOREUSE - Invalidate buffer immediately after use. This option is useful for data that is not<br />
likely to be reused.<br />
VX_DSYNC - Data is written synchronously, but the file‟s timestamps in the inode are not flushed to<br />
disk. This option is similar to using the O_DSYNC flag when opening the file.<br />
VX_UNBUFFERED - Same behavior as VX_DIRECT, but updating the file size in the inode is done<br />
asynchronously.<br />
VX_CONCURRENT - Enables concurrent I/O on the specified file.<br />
See the manpage vxfsio(7) for more information. <strong>HP</strong> OnLineJFS product is required to us the<br />
VX_SETCACHE ioctl().<br />
Allocation policies<br />
The VX_SETEXT ioctl() passes 3 parameters: a fixed extent size to use, the amount of space to reserve<br />
for the file, <strong>and</strong> an allocation flag defined below.<br />
VX_NOEXTEND - write will fail if an attempt is made to extend the file past the current reservation<br />
VX_TRIM - trim the reservation when the last close of the file is performed<br />
VX_CONTIGUOUS - the reservation must be allocated in a single extent<br />
VX_ALIGN - all extents must be aligned on an extent-sized boundary<br />
VX_NORESERVE - reserve space, but do not record reservation space in the inode. If the file is<br />
closed or the system fails, the reservation is lost.<br />
VX_CHGSIZE - update the file size in the inode to reflect the reservation amount. This allocation<br />
does not initialize the data <strong>and</strong> reads from the uninitialized portion of the file will return unknown<br />
contents. Root permission is required.<br />
VX_GROWFILE - update the file size in the inode to reflect the reservation amount. Subsequent<br />
reads from the uninitialized portion of the file will return zeros. This option creates Zero Fill On<br />
Dem<strong>and</strong> (ZFOD) extents. Writing to ZFOD extents negates the benefit of concurrent I/O.<br />
Reserving file space insures that you do not run out of space before you are done writing to the file.<br />
The overhead of allocating extents is done up front. Using a fixed extent size can also reduce<br />
fragmentation.<br />
The setext(1M) comm<strong>and</strong> can also be used to set the extent allocation <strong>and</strong> reservation policies for a<br />
file.<br />
See the manpage vxfsio(7) <strong>and</strong> setext(1M) for more information. <strong>HP</strong> OnLineJFS product is required to<br />
us the VX_SETEXT ioctl().
Patches<br />
Performance problems are often resolved by patches to the <strong>VxFS</strong> subsystem, such as the patches for<br />
the <strong>VxFS</strong> read ahead issues on <strong>HP</strong>-<strong>UX</strong> 11i v3. Be sure to check the latest patches for fixes to various<br />
<strong>performance</strong> related problems.<br />
Summary<br />
There is no single set of values for the tunable to apply to every system. You must underst<strong>and</strong> how<br />
your application accesses data in the file system to decide which options <strong>and</strong> tunables can be<br />
changed to maximize the <strong>performance</strong> of your file system. A list of <strong>VxFS</strong> tunables <strong>and</strong> options<br />
discussed in this paper are summarized below.<br />
Table 6. List of <strong>VxFS</strong> tunables <strong>and</strong> options<br />
newfs<br />
mkfs<br />
bsize<br />
logsize<br />
version<br />
Mount<br />
Option<br />
blkclear<br />
datainlog<br />
nodatainlog<br />
mincache<br />
convosync<br />
cio<br />
remount<br />
tmplog<br />
delaylog<br />
log<br />
logiosize<br />
tranflush<br />
noatime<br />
nomtime<br />
qio<br />
File system<br />
tunables<br />
read_pref_io<br />
read_nstream<br />
read_ahead<br />
write_pref_io<br />
write_nstream<br />
max_buf_data_size<br />
max_direct_iosz<br />
discovered_direct_iosz<br />
max_diskq<br />
write_throttle<br />
initial_extent_size<br />
max_seqio_extent_size<br />
qio_cache_enable<br />
System wide<br />
tunables<br />
nbuf<br />
bufpages<br />
dbc_max_pct<br />
dbc_min_pct<br />
filecache_min<br />
filecache_max<br />
fcache_fb_policy<br />
vxfs_bc_bufhwm<br />
vx_ninode<br />
vxfs_ifree_timelag<br />
per-file attributes<br />
VX_SETCACHE<br />
VX_SETEXT<br />
31
For additional information<br />
For additional reading, please refer to the following documents:<br />
<strong>HP</strong>-<strong>UX</strong> <strong>VxFS</strong> mount options for Oracle Database environments,<br />
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA1-9839ENW&cc=us&lc=en<br />
Common Misconfigured <strong>HP</strong>-<strong>UX</strong> Resources,<br />
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c01920394/c01920394.pdf<br />
Veritas File System 5.0.1 Administrator’s Guide (<strong>HP</strong>-<strong>UX</strong> 11i v3),<br />
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02220689/c02220689.pdf<br />
Performance Improvements using Concurrent I/O on <strong>HP</strong>-<strong>UX</strong> 11i v3 with OnlineJFS 5.0.1 <strong>and</strong> the <strong>HP</strong>-<br />
<strong>UX</strong> Logical Volume Manager,<br />
http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA1-5719ENW&cc=us&lc=en<br />
To help us improve our documents, please provide feedback at<br />
http://h20219.www2.hp.com/ActiveAnswers/us/en/solutions/technical_tools_feedback.html.<br />
© Copyright 2004, 2011 Hewlett-Packard Development Company, L.P. The information contained herein is<br />
subject to change without notice. The only warranties for <strong>HP</strong> products <strong>and</strong> services are set forth in the express<br />
warranty statements accompanying such products <strong>and</strong> services. Nothing herein should be construed as<br />
constituting an additional warranty. <strong>HP</strong> shall not be liable for technical or editorial errors or omissions<br />
contained herein.<br />
Oracle is a registered trademarks of Oracle <strong>and</strong>/or its affiliates.<br />
c01919408, Created August 2004; Updated January 2011, Rev. 3