18.07.2013 Views

HP-UX VxFS tuning and performance - filibeto.org

HP-UX VxFS tuning and performance - filibeto.org

HP-UX VxFS tuning and performance - filibeto.org

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>HP</strong>-<strong>UX</strong> <strong>VxFS</strong> <strong>tuning</strong> <strong>and</strong> <strong>performance</strong><br />

Technical white paper<br />

Table of contents<br />

Executive summary ............................................................................................................................... 2<br />

Introduction ......................................................................................................................................... 2<br />

Underst<strong>and</strong>ing <strong>VxFS</strong> ............................................................................................................................. 3<br />

Software versions vs. disk layout versions ............................................................................................ 3<br />

Variable sized extent based file system ............................................................................................... 3<br />

Extent allocation .............................................................................................................................. 4<br />

Fragmentation ................................................................................................................................. 4<br />

Transaction journaling ...................................................................................................................... 5<br />

Underst<strong>and</strong>ing your application ............................................................................................................ 5<br />

Data access methods ........................................................................................................................... 6<br />

Buffered/cached I/O ....................................................................................................................... 6<br />

Direct I/O ..................................................................................................................................... 12<br />

Concurrent I/O .............................................................................................................................. 15<br />

Oracle Disk Manager ..................................................................................................................... 17<br />

Creating your file system .................................................................................................................... 18<br />

Block size ...................................................................................................................................... 18<br />

Intent log size ................................................................................................................................ 18<br />

Disk layout version ......................................................................................................................... 19<br />

Mount options ................................................................................................................................... 20<br />

Dynamic file system tunables ............................................................................................................... 24<br />

System wide tunables ......................................................................................................................... 25<br />

Buffer cache on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier ......................................................................................... 25<br />

Unified File Cache on <strong>HP</strong>-<strong>UX</strong> 11i v3 ................................................................................................. 27<br />

<strong>VxFS</strong> metadata buffer cache ............................................................................................................ 27<br />

<strong>VxFS</strong> inode cache .......................................................................................................................... 28<br />

Directory Name Lookup Cache ........................................................................................................ 29<br />

<strong>VxFS</strong> ioctl() options ............................................................................................................................ 30<br />

Cache advisories ........................................................................................................................... 30<br />

Allocation policies .......................................................................................................................... 30<br />

Patches ............................................................................................................................................. 31<br />

Summary .......................................................................................................................................... 31<br />

For additional information .................................................................................................................. 32


2<br />

Executive summary<br />

File system <strong>performance</strong> is critical to overall system <strong>performance</strong>. While memory latencies are<br />

measured in nanoseconds, I/O latencies are measured in milliseconds. In order to maximize the<br />

<strong>performance</strong> of your systems, the file system must be as fast as possible by performing efficient I/O,<br />

eliminating unnecessary I/O, <strong>and</strong> reducing file system overhead.<br />

In the past several years many changes have been made to the Veritas File System (<strong>VxFS</strong>) as well as<br />

<strong>HP</strong>-<strong>UX</strong>. This paper is based on <strong>VxFS</strong> 3.5 <strong>and</strong> includes information on <strong>VxFS</strong> through version 5.0.1,<br />

currently available on <strong>HP</strong>-<strong>UX</strong> 11i v3 (11.31). Recent key changes mentioned in this paper include:<br />

page-based Unified File Cache introduced in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

improved <strong>performance</strong> for large directories in <strong>VxFS</strong> 5.0<br />

patches needed for read ahead in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

changes to the write flush behind policies in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

new read flush behind feature in <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

changes in direct I/O alignment in <strong>VxFS</strong> 5.0 on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

concurrent I/O included with OnlineJFS license with <strong>VxFS</strong> 5.0.1<br />

Oracle Disk Manager (ODM) feature<br />

Target audience: This white paper is intended for <strong>HP</strong>-<strong>UX</strong> administrators that are familiar with<br />

configuring file systems on <strong>HP</strong>-<strong>UX</strong>.<br />

Introduction<br />

Beginning with <strong>HP</strong>-<strong>UX</strong> 10.0, Hewlett-Packard began providing a Journaled File System (JFS) from<br />

Veritas Software Corporation (now part of Symantec) known as the Veritas File System (<strong>VxFS</strong>). You no<br />

longer need to be concerned with <strong>performance</strong> attributes such as cylinders, tracks, <strong>and</strong> rotational<br />

delays. The speed of storage devices have increased dramatically <strong>and</strong> computer systems continue to<br />

have more memory available to them. Applications are becoming more complex, accessing large<br />

amounts of data in many different ways.<br />

In order to maximize <strong>performance</strong> with <strong>VxFS</strong> file systems, you need to underst<strong>and</strong> the various<br />

attributes of the file system, <strong>and</strong> which attributes can be changed to best suit how the application<br />

accesses the data.<br />

Many factors need to be considered when <strong>tuning</strong> your <strong>VxFS</strong> file systems for optimal <strong>performance</strong>. The<br />

answer to the question “How should I tune my file system for maximum <strong>performance</strong>?” is “It depends”.<br />

Underst<strong>and</strong>ing some of the key features of <strong>VxFS</strong> <strong>and</strong> knowing how the applications access data are<br />

keys to deciding how to best tune the file systems.<br />

The following topics are covered in this paper:<br />

Underst<strong>and</strong>ing <strong>VxFS</strong><br />

Underst<strong>and</strong>ing your application<br />

Data access methods<br />

Creating your file system<br />

Mount options<br />

Dynamic File system tunables<br />

System wide tunables<br />

<strong>VxFS</strong> ioctl() options


Underst<strong>and</strong>ing <strong>VxFS</strong><br />

In order to fully underst<strong>and</strong> some of the various file system creation options, mount options, <strong>and</strong><br />

tunables, a brief overview of <strong>VxFS</strong> is provided.<br />

Software versions vs. disk layout versions<br />

<strong>VxFS</strong> supports several different disk layout versions (DLV). The default disk layout version can be<br />

overridden when the file system is created using mkfs(1M). Also, the disk layout can be upgraded<br />

online using the vxupgrade(1M) comm<strong>and</strong>.<br />

Table 1. <strong>VxFS</strong> software versions <strong>and</strong> disk layout versions (* denotes default disk layout)<br />

OS Version SW Version Disk layout version<br />

11.23 <strong>VxFS</strong> 3.5<br />

<strong>VxFS</strong> 4.1<br />

<strong>VxFS</strong> 5.0<br />

11.31 <strong>VxFS</strong> 4.1<br />

<strong>VxFS</strong> 5.0<br />

<strong>VxFS</strong> 5.0.1<br />

3,4,5*<br />

4,5,6*<br />

4,5,6,7*<br />

4,5,6*<br />

4,5,6,7*<br />

4,5,6,7*<br />

Several improvements have been made in the disk layouts which can help increase <strong>performance</strong>. For<br />

example, the version 5 disk layout is needed to create file system larger than 2 TB. The version 7 disk<br />

layout available on <strong>VxFS</strong> 5.0 <strong>and</strong> above improves the <strong>performance</strong> of large directories.<br />

Please note that you cannot port a file system to a previous release unless the previous release<br />

supports the same disk layout version for the file system being ported. Before porting a file system<br />

from one operating system to another, be sure the file system is properly unmounted. Differences in<br />

the Intent Log will likely cause a replay of the log to fail <strong>and</strong> a full fsck will be required before<br />

mounting the file system.<br />

Variable sized extent based file system<br />

<strong>VxFS</strong> is a variable sized extent based file system. Each file is made up of one or more extents that<br />

vary in size. Each extent is made up of one or more file system blocks. A file system block is the<br />

smallest allocation unit of a file. The block size of the file system is determined when the file system is<br />

created <strong>and</strong> cannot be changed without rebuilding the file system. The default block size varies<br />

depending on the size of the file system.<br />

The advantages of variable sized extents include:<br />

Faster allocation of extents<br />

Larger <strong>and</strong> fewer extents to manage<br />

Ability to issue large physical I/O for an extent<br />

However, variable sized extents also have disadvantages:<br />

Free space fragmentation<br />

Files with many small extents<br />

3


4<br />

Extent allocation<br />

When a file is initially opened for writing, <strong>VxFS</strong> is unaware of how much data the application will<br />

write before the file is closed. The application may write 1 KB of data or 500 MB of data. The size of<br />

the initial extent is the largest power of 2 greater than the size of the write, with a minimum extent<br />

size of 8k. Fragmentation will limit the extent size as well.<br />

If the current extent fills up, the extent will be extended if neighboring free space is available.<br />

Otherwise, a new extent will be allocated that is progressively larger as long as there is available<br />

contiguous free space.<br />

When the file is closed by the last process that had the file opened, the last extent is trimmed to the<br />

minimum amount needed.<br />

Fragmentation<br />

Two types of fragmentation can occur with <strong>VxFS</strong>. The available free space can be fragmented <strong>and</strong><br />

individual files may be fragmented. As files are created <strong>and</strong> removed, free space can become<br />

fragmented due to the variable sized extent nature of the file system. For volatile file systems with a<br />

small block size, the file system <strong>performance</strong> can degrade significantly if the fragmentation is severe.<br />

Examples of applications that use many small files include mail servers. As free space becomes<br />

fragmented, file allocation takes longer as smaller extents are allocated. Then, more <strong>and</strong> smaller I/Os<br />

are necessary when the file is read or updated.<br />

However, files can be fragmented even when there are large extents available in the free space map.<br />

This fragmentation occurs when a file is repeatedly opened, extended, <strong>and</strong> closed. When the file is<br />

closed, the last extent is trimmed to the minimum size needed. Later, when the file grows but there is<br />

no contiguous free space to increase the size of the last extent, a new extent will be allocated <strong>and</strong> the<br />

file could be closed again. This process of opening, extending, <strong>and</strong> closing a file can repeat resulting<br />

in a large file with many small extents. When a file is fragmented, a sequential read through the file<br />

will be seen as small r<strong>and</strong>om I/O to the disk drive or disk array, since the fragmented extents may be<br />

small <strong>and</strong> reside anywhere on disk.<br />

Static file systems that use large files built shortly after the file system is created are less prone to<br />

fragmentation, especially if the file system block size is 8 KB. Examples are file systems that have<br />

large database files. The large database files are often updated but rarely change in size.<br />

File system fragmentation can be reported using the “-E” option of the fsadm(1M) utility. File systems<br />

can be defragmented or re<strong>org</strong>anized with the <strong>HP</strong> OnLineJFS product using the “-e” option of the<br />

fsadm utility. Remember, performing one 8 KB I/O will be faster than performing eight 1 KB I/Os.<br />

File systems with small block sizes are more susceptible to <strong>performance</strong> degradation due to<br />

fragmentation than file systems with large block sizes.<br />

File system re<strong>org</strong>anization should be done on a regular <strong>and</strong> periodic basis, where the interval<br />

between re<strong>org</strong>anizations depends on the volatility, size <strong>and</strong> number of the files. Large file systems<br />

with a large number of files <strong>and</strong> significant fragmentation can take an extended amount of time to<br />

defragment. The extent re<strong>org</strong>anization is run online; however, increased disk I/O will occur when<br />

fsadm copies data from smaller extents to larger ones.


File system re<strong>org</strong>anization attempts to collect large areas of free space by moving various file extents<br />

<strong>and</strong> attempts to defragment individual files by copying the data in the small extents to larger extents.<br />

The re<strong>org</strong>anization is not a compaction utility <strong>and</strong> does not try to move all the data to the front on the<br />

file system.<br />

Note<br />

Fsadm extent re<strong>org</strong>anization may fail if there are not sufficient large free<br />

areas to perform the re<strong>org</strong>anization. Fsadm -e should be used to<br />

defragment file systems on a regular basis.<br />

Transaction journaling<br />

By definition, a journal file system logs file system changes to a “journal” in order to provide file<br />

system integrity in the event of a system crash. <strong>VxFS</strong> logs structural changes to the file system in a<br />

circular transaction log called the Intent Log. A transaction is a group of related changes that<br />

represent a single operation. For example, adding a new file in a directory would require multiple<br />

changes to the file system structures such as adding the directory entry, writing the inode information,<br />

updating the inode bitmap, etc. Once the Intent Log has been updated with the pending transaction,<br />

<strong>VxFS</strong> can begin committing the changes to disk. When all the disk changes have been made on disk,<br />

a completion record is written to indicate that transaction is complete.<br />

With the datainlog mount option, small synchronous writes are also logged in the Intent Log. The<br />

datainlog mount option will be discussed in more detail later in this paper.<br />

Since the Intent Log is a circular log, it is possible for it to “fill” up. This occurs when the oldest<br />

transaction in the log has not been completed, <strong>and</strong> the space is needed for a new transaction. When<br />

the Intent Log is full, the threads must wait for the oldest transaction to be flushed <strong>and</strong> the space<br />

returned to the Intent Log. A full Intent Log occurs very infrequently, but may occur more often if many<br />

threads are making structural changes or small synchronous writes (datainlog) to the file system<br />

simultaneously.<br />

Underst<strong>and</strong>ing your application<br />

As you plan the various attributes of your file system, you must also know your application <strong>and</strong> how<br />

your application will be accessing the data. Ask yourself the following questions:<br />

Will the application be doing mostly file reads, file writes, or both?<br />

Will files be accessed sequentially or r<strong>and</strong>omly?<br />

What are the sizes of the file reads <strong>and</strong> writes?<br />

Does the application use many small files, or a few large files?<br />

How is the data <strong>org</strong>anized on the volume or disk? Is the file system fragmented? Is the volume<br />

striped?<br />

Are the files written or read by multiple processes or threads simultaneously?<br />

Which is more important, data integrity or <strong>performance</strong>?<br />

How you tune your system <strong>and</strong> file systems depend on how you answer the above questions. No<br />

single set of tunables exist that can be universally applied such that all workloads run at peak<br />

<strong>performance</strong>.<br />

5


6<br />

Data access methods<br />

Buffered/cached I/O<br />

By default, most access to files in a <strong>VxFS</strong> file system is through the cache. In <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong><br />

earlier, the <strong>HP</strong>-<strong>UX</strong> Buffer Cache provided the cache resources for file access. In <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong><br />

later, the Unified File Cache is used. Cached I/O allows for many features, including asynchronous<br />

prefetching known as read ahead, <strong>and</strong> asynchronous or delayed writes known as flush behind.<br />

Read ahead<br />

When reads are performed sequentially, <strong>VxFS</strong> detects the pattern <strong>and</strong> reads ahead or prefetches data<br />

into the buffer/file cache. <strong>VxFS</strong> attempts to maintain 4 read ahead “segments” of data in the<br />

buffer/file cache, using the read ahead size as the size of each segment.<br />

Figure 1. <strong>VxFS</strong> read ahead<br />

256K 256K 256K 256K 256K<br />

Sequential read<br />

The segments act as a “pipeline”. When the data in the first of the 4 segments is read, asynchronous<br />

I/O for a new segment is initiated, so that 4 segments are maintained in the read ahead pipeline.<br />

The initial size of a read ahead segment is specified by <strong>VxFS</strong> tunable read_pref_io. As the process<br />

continues to read a file sequentially, the read ahead size of the segment approximately doubles, to a<br />

maximum of read_pref_io * read_nstream.<br />

For example, consider a file system with read_pref_io set to 64 KB <strong>and</strong> read_nstream set to 4. When<br />

sequential reads are detected, <strong>VxFS</strong> will initially read ahead 256 KB (4 * read_pref_io). As the<br />

process continues to read ahead, the read ahead amount will grow to 1 MB (4 segments *<br />

read_pref_io * read_nstream). As the process consumes a 256 KB read ahead segment from the front<br />

of the pipeline, asynchronous I/O is started for the 256 KB read ahead segment at the end of the<br />

pipeline.<br />

By reading the data ahead, the disk latency can be reduced. The ideal configuration is the minimum<br />

amount of read ahead that will reduce the disk latency. If the read ahead size is configured too large,<br />

the disk I/O queue may spike when the reads are initiated, causing delays for other potentially more<br />

critical I/Os. Also, the data read into the cache may no longer be in the cache when the data is<br />

needed, causing the data to be re-read into the cache. These cache misses will cause the read ahead<br />

size to be reduced automatically.<br />

Note that if another thread is reading the file r<strong>and</strong>omly or sequentially, <strong>VxFS</strong> has trouble with the<br />

sequential pattern detection. The sequential reads may be treated as r<strong>and</strong>om reads <strong>and</strong> no read<br />

ahead is performed. This problem can be resolved with enhanced read ahead discussed later. The<br />

read ahead policy can be set on a per file system basis by setting the read_ahead tunable using<br />

vxtunefs(1M). The default read_ahead value is 1 (normal read ahead).


Read ahead with VxVM stripes<br />

The default value for read_pref_io is 64 KB, <strong>and</strong> the default value for read_nstream is 1, except when<br />

the file system is mounted on a VxVM striped volume where the tunables are defaulted to match the<br />

striping attributes. For most applications, the 64 KB read ahead size is good (remember, <strong>VxFS</strong><br />

attempts to maintain 4 segments of read ahead size). Depending on the stripe size (column width)<br />

<strong>and</strong> number of stripes (columns), the default tunables on a VxVM volume may result in excessive read<br />

ahead <strong>and</strong> lead to <strong>performance</strong> problems.<br />

Consider a VxVM volume with a stripe size of 1 MB <strong>and</strong> 20 stripes. By default, read_pref_io is set to<br />

1 MB <strong>and</strong> read_nstream is set to 20. When <strong>VxFS</strong> initiates the read ahead, it tries to prefetch 4<br />

segments of the initial read ahead size (4 segments * read_pref_io). But as the sequential reads<br />

continue, the read ahead size can grow to 20 MB (read_pref_io * read_nstream), <strong>and</strong> <strong>VxFS</strong> attempts<br />

to maintain 4 segments of read ahead data, or 80 MB. This large amount of read ahead can flood<br />

the SCSI I/O queues or internal disk array queues, causing severe <strong>performance</strong> degradation.<br />

Note<br />

If you are using VxVM striped volumes, be sure to review the<br />

read_pref_io <strong>and</strong> read_nstream values to be sure excessive read<br />

ahead is not being done.<br />

False sequential I/O patterns <strong>and</strong> read ahead<br />

Some applications trigger read ahead when it is not needed. This behavior occurs when the<br />

application is doing r<strong>and</strong>om I/O by positioning the file pointer using the lseek() system call, then<br />

performing the read() system call, or by simply using the pread() system call.<br />

Figure 2. False sequential I/O patterns <strong>and</strong> read ahead<br />

If the application reads two adjacent blocks of data, the read ahead algorithm is triggered. Although<br />

only 2 reads are needed, <strong>VxFS</strong> begins prefetching data from the file, up to 4 segments of the read<br />

ahead size. The larger the read ahead size, the more data read into the buffer/file cache. Not only is<br />

excessive I/O performed, but useful data that may have been in the buffer/file cache is invalidated so<br />

the new unwanted data can be stored.<br />

Note<br />

If the false sequential I/O patterns cause <strong>performance</strong> problems, read<br />

ahead can be disabled by setting read_ahead to 0 using vxtunefs(1M), or<br />

direct I/O could be used.<br />

Enhanced read ahead<br />

8K 8K 8K<br />

lseek()/read()/read() lseek()/read()<br />

Enhanced read ahead can detect non-sequential patterns, such as reading every third record or<br />

reading a file backwards. Enhanced read ahead can also h<strong>and</strong>le patterns from multiple threads. For<br />

7


8<br />

example, two threads can be reading from the same file sequentially <strong>and</strong> both threads can benefit<br />

from the configured read ahead size.<br />

Figure 3. Enhanced read ahead<br />

Enhanced read ahead can be set on a per file system basis by setting the read_ahead tunable to 2<br />

with vxtunefs(1M).<br />

Read ahead on 11i v3<br />

With the introduction of the Unified File Cache (UFC) on <strong>HP</strong>-<strong>UX</strong> 11i v3, significant changes were<br />

needed for <strong>VxFS</strong>. As a result, a defect was introduced which caused poor <strong>VxFS</strong> read ahead<br />

<strong>performance</strong>. The problems are fixed in the following patches according to the <strong>VxFS</strong> version:<br />

PHKL_40018 - 11.31 vxfs4.1 cumulative patch<br />

PHKL_41071 - 11.31 VRTS 5.0 MP1 VRTSvxfs Kernel Patch<br />

PHKL_41074 - 11.31 VRTS 5.0.1 GARP1 VRTSvxfs Kernel Patch<br />

These patches are critical to <strong>VxFS</strong> read ahead <strong>performance</strong> on 11i v3.<br />

Another caveat with <strong>VxFS</strong> 4.1 <strong>and</strong> later occurs when the size of the read() is larger than the<br />

read_pref_io value. A large cached read can result in a read ahead size that is insufficient for the<br />

read. To avoid this situation, the read_pref_io value should be configured to be greater than or equal<br />

to the application read size.<br />

For example, if the application is doing 128 KB read() system calls, then the application can<br />

experience <strong>performance</strong> issues with the default read_pref_io value of 64 KB, as the <strong>VxFS</strong> would only<br />

read ahead 64 KB, leaving the remaining 64 KB that would have to be read in synchronously.<br />

Note<br />

On <strong>HP</strong>-<strong>UX</strong> 11i v3, the read_pref_io value should be set to the size of the<br />

largest logical read size for a process doing sequential I/O through cache.<br />

The write_pref_io value should be configured to be greater than or equal to the read_pref_io value<br />

due to a new feature on 11i v3 called read flush behind, which will be discussed later.<br />

Flush behind<br />

64K 128K 64K 128K 64K<br />

Patterned Read<br />

During normal asynchronous write operations, the data is written to buffers in the buffer cache, or<br />

memory pages in the file cache. The buffers/pages are marked “dirty” <strong>and</strong> control returns to the<br />

application. Later, the dirty buffers/pages must be flushed from the cache to disk.<br />

There are 2 ways for dirty buffers/pages to be flushed to disk. One way is through a “sync”, either<br />

by a sync() or fsync() system call, or by the syncer daemon, or by one of the vxfsd daemon threads.<br />

The second method is known as “flush behind” <strong>and</strong> is initiated by the application performing the<br />

writes.


Figure 4. Flush behind<br />

As data is written to a <strong>VxFS</strong> file, <strong>VxFS</strong> will perform “flush behind” on the file. In other words, it will<br />

issue asynchronous I/O to flush the buffer from the buffer cache to disk. The flush behind amount is<br />

calculated by multiplying the write_pref_io by the write_nstream file system tunables. The default flush<br />

behind amount is 64 KB.<br />

Flush behind on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

By default, flush behind is disabled on <strong>HP</strong>-<strong>UX</strong> 11i v3. The advantage of disabling flush behind is<br />

improved <strong>performance</strong> for applications that perform rewrites to the same data blocks. If flush behind<br />

is enabled, the initial write may be in progress, causing the 2 nd write to stall as the page being<br />

modified has an I/O already in progress.<br />

However, the disadvantage of disabling flush behind is that more dirty pages in the Unified File<br />

Cache accumulate before getting flushed by vh<strong>and</strong>, ksyncher, or vxfsd.<br />

Flush behind can be enabled on <strong>HP</strong>-<strong>UX</strong> 11i v3 by changing the fcache_fb_policy(5) value using<br />

kctune(1M). The default value of fcache_fb_policy is 0, which disables flush behind. If<br />

fcache_fb_policy is set to 1, a special kernel daemon, fb_daemon, is responsible for flushing the dirty<br />

pages from the file cache. If fcache_fb_policy is set to 2, then the process that is writing the data will<br />

initiate the flush behind. Setting fcache_fb_policy to 2 is similar to the 11i v2 behavior.<br />

Note that write_pref_io <strong>and</strong> write_nstream tunable have no effect on flush behind if fcache_fb_policy<br />

is set to 0. However, these tunables may still impact read flush behind discussed later in this paper.<br />

Note<br />

Flush behind is disabled by default on <strong>HP</strong>-<strong>UX</strong> 11i v3. To enable flush<br />

behind, use kctune(1M) to set the fcache_fb_policy tunable to 1 or 2.<br />

I/O throttling<br />

64K 64K 64K 64K 64K<br />

flush behind<br />

Sequential write<br />

Often, applications may write to the buffer/file cache faster than <strong>VxFS</strong> <strong>and</strong> the I/O subsystem can<br />

flush the buffers. Flushing too many buffers/pages can cause huge disk I/O queues, which could<br />

impact other critical I/O to the devices. To prevent too many buffers/pages from being flushed<br />

simultaneously for a single file, 2 types of I/O throttling are provided - flush throttling <strong>and</strong> write<br />

throttling.<br />

9


10<br />

Figure 5. I/O throttling<br />

Flush throttling (max_diskq)<br />

The amount of dirty data being flushed per file cannot exceed the max_diskq tunable. The process<br />

performing a write() system call will skip the “flush behind” if the amount of outst<strong>and</strong>ing I/O exceeds<br />

the max_diskq. However, when many buffers/pages of a file need to be flushed to disk, if the amount<br />

of outst<strong>and</strong>ing I/O exceeds the max_diskq value, the process flushing the data will wait 20<br />

milliseconds before checking to see if the amount of outst<strong>and</strong>ing I/O drops below the max_diskq<br />

value. Flush throttling can severely degrade asynchronous write <strong>performance</strong>. For example, when<br />

using the default max_diskq value of 1 MB, the 1 MB of I/O may complete in 5 milliseconds, leaving<br />

the device idle for 15 milliseconds before checking to see if the outst<strong>and</strong>ing I/O drops below<br />

max_diskq. For most file systems, max_diskq should be increased to a large value (such as 1 GB)<br />

especially for file systems mounted on cached disk arrays.<br />

While the default max_diskq value is 1 MB, it cannot be set lower than write_pref_io * 4.<br />

Note<br />

For best asynchronous write <strong>performance</strong>, tune max_diskq to a large value<br />

such as 1 GB.<br />

Write throttling (write_throttle)<br />

The amount of dirty data (unflushed) per file cannot exceed write_throttle. If a process tries to perform<br />

a write() operation <strong>and</strong> the amount of dirty data exceeds the write throttle amount, then the process<br />

will wait until some of the dirty data has been flushed. The default value for write_throttle is 0 (no<br />

throttling).<br />

Read flush behind<br />

<strong>VxFS</strong> introduced a new feature on 11i v3 called read flush behind. When the number of free pages<br />

in the file cache is low (which is a common occurrence) <strong>and</strong> a process is reading a file sequentially,<br />

pages which have been previously read are freed from the file cache <strong>and</strong> reused.<br />

Figure 6. Read flush behind<br />

64K<br />

64K 64K 64K<br />

64K 64K<br />

Read flush behind<br />

flush behind<br />

64K 64K 64K<br />

Sequential read<br />

64K 64K<br />

Sequential write<br />

64K<br />

read ahead


The read flush behind feature has the advantage of preventing a read of a large file (such as a<br />

backup, file copy, gzip, etc) from consuming large amounts of file cache. However, the disadvantage<br />

of read flush behind is that a file may not be able to reside entirely in cache. For example, if the file<br />

cache is 6 GB in size <strong>and</strong> a 100 MB file is read sequentially, the file will likely not reside entirely in<br />

cache. If the file is re-read multiple times, the data would need to be read from disk instead of the<br />

reads being satisfied from the file cache.<br />

The read flush behind feature cannot be disabled directly. However, it can be tuned indirectly by<br />

setting the <strong>VxFS</strong> flush behind size using write_pref_io <strong>and</strong> write_nstream. For example, increasing<br />

write_pref_io to 200 MB would allow the 100 MB file to be read into cache in its entirety. Flush<br />

behind would not be impacted if fcache_fb_policy is set to 0 (default), however, flush behind can be<br />

impacted if fcache_fb_policy is set to 1 or 2.<br />

The read flush behind can also impact <strong>VxFS</strong> read ahead if the <strong>VxFS</strong> read ahead size is greater than<br />

the write flush behind size. For best read ahead <strong>performance</strong>, the write_pref_io <strong>and</strong> write_nstream<br />

values should be greater than or equal to read_pref_io <strong>and</strong> read_nstream values.<br />

Note<br />

For best read ahead <strong>performance</strong>, the write_pref_io <strong>and</strong> write_nstream<br />

values should be greater than or equal to read_pref_io <strong>and</strong> read_nstream<br />

values.<br />

Buffer sizes on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier<br />

On <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, cached I/O is performed through the <strong>HP</strong>-<strong>UX</strong> buffer cache. The<br />

maximum size of each buffer in the cache is determined by the file system tunable max_buf_data_size<br />

(default 8k). The only other value possible is 64k. For reads <strong>and</strong> writes larger than<br />

max_buf_data_size <strong>and</strong> when performing read ahead, <strong>VxFS</strong> will chain the buffers together when<br />

sending them to the Volume Management subsystem. The Volume Management subsystem holds the<br />

buffers until <strong>VxFS</strong> sends the last buffer in the chain, then it attempts to merge the buffers together in<br />

order to perform larger <strong>and</strong> fewer physical I/Os.<br />

Figure 7. Buffer merging<br />

8K<br />

more<br />

8K<br />

more<br />

8K<br />

more<br />

8K<br />

more<br />

8K<br />

more<br />

8K<br />

more<br />

8K<br />

more<br />

The maximum chain size that the operating system can use is 64 KB. Merged buffers cannot cross a<br />

“chain boundary”, which is 64 KB. This means that if the series of buffers to be merged crosses a 64<br />

KB boundary, then the merging will be split into a maximum of 2 buffers, each less than 64 KB in<br />

size.<br />

If your application performs writes to a file or performs r<strong>and</strong>om reads, do not increase<br />

max_buf_data_size to 64k unless the size of the read or write is 64k or greater. Smaller writes cause<br />

the data to be read into the buffer synchronously first, then the modified portion of the data will be<br />

updated in the buffer cache, then the buffer will be flushed asynchronously. The synchronous read of<br />

the data will drastically reduce <strong>performance</strong> if the buffer is not already in the cache. For r<strong>and</strong>om<br />

reads, <strong>VxFS</strong> will attempt to read an entire buffer, causing more data to be read than necessary.<br />

8K<br />

64K<br />

11


12<br />

Note that buffer merging was not implemented initially on IA-64 systems with <strong>VxFS</strong> 3.5 on 11i v2. So<br />

using the default max_buf_data_size of 8 KB would result in a maximum physical I/O size of 8 KB.<br />

Buffer merging is implemented in 11i v2 0409 released in the fall of 2004.<br />

Page sizes on <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong> later<br />

On <strong>HP</strong>-<strong>UX</strong> 11i v3 <strong>and</strong> later, <strong>VxFS</strong> performs cached I/O through the Unified File Cache. The UFC is<br />

paged based, <strong>and</strong> the max_buf_data_size tunable on 11i v3 has no effect. The default page size is 4<br />

KB. However, when larger reads or read ahead is performed, the UFC will attempt to use contiguous<br />

4 KB pages. For example, when performing sequential reads, the read_pref_io value will be used to<br />

allocate the consecutive 4K pages. Note that the “buffer” size is no longer limited to 8 KB or 64 KB<br />

with the UFC, eliminating the buffer merging overhead caused by chaining a number of 8 KB buffers.<br />

Also, when doing small r<strong>and</strong>om I/O, <strong>VxFS</strong> can perform 4 KB page requests instead of 8 KB buffer<br />

requests. The page request has the advantage of returning less data, thus reducing the transfer size.<br />

However, if the application still needs to read the other 4 KB, then a second I/O request would be<br />

needed which would decrease <strong>performance</strong>. So small r<strong>and</strong>om I/O <strong>performance</strong> may be worse on<br />

11i v3 if cache hits would have occurred had 8 KB of data been read instead of 4 KB. The only<br />

solution is to increase the base_pagesize(5) from the default value of 4 to 8 or 16. Careful testing<br />

should be taken when changing the base_pagesize from the default value as some applications<br />

assume the page size will be the default 4 KB.<br />

Direct I/O<br />

If OnlineJFS in installed <strong>and</strong> licensed, direct I/O can be enabled in several ways:<br />

Mount option “-o mincache=direct”<br />

Mount option “-o convosync=direct”<br />

Use VX_DIRECT cache advisory of the VX_SETCACHE ioctl() system call<br />

Discovered Direct I/O<br />

Direct I/O has several advantages:<br />

Data accessed only once does not benefit from being cached<br />

Direct I/O data does not disturb other data in the cache<br />

Data integrity is enhanced since disk I/O must complete before the read or write system call<br />

completes (I/O is synchronous)<br />

Direct I/O can perform larger physical I/O than buffered or cached I/O, which reduces the total<br />

number of I/Os for large read or write operations<br />

However, direct I/O also has its disadvantages:<br />

Direct I/O reads cannot benefit from <strong>VxFS</strong> read ahead algorithms<br />

All physical I/O is synchronous, thus each write must complete the physical I/O before the system<br />

call returns<br />

Direct I/O <strong>performance</strong> degrades when the I/O request is not properly aligned<br />

Mixing buffered/cached I/O <strong>and</strong> direct I/O can degrade <strong>performance</strong><br />

Direct I/O works best for large I/Os that are accessed once <strong>and</strong> when data integrity for writes is<br />

more important than <strong>performance</strong>.


Note<br />

The OnlineJFS license is required to perform direct I/O when using <strong>VxFS</strong><br />

5.0 or earlier. Beginning with <strong>VxFS</strong> 5.0.1, direct I/O is available with the<br />

Base JFS product.<br />

Discovered direct I/O<br />

With <strong>HP</strong> OnLineJFS, direct I/O will be enabled if the read or write size is greater than or equal to the<br />

file system tunable discovered_direct_iosz. The default discovered_direct_iosz is 256 KB. As with<br />

direct I/O, all discovered direct I/O will be synchronous. Read ahead <strong>and</strong> flush behind will not be<br />

performed with discovered direct I/O.<br />

If read ahead <strong>and</strong> flush behind is desired for large reads <strong>and</strong> writes or the data needs to be reused<br />

from buffer/file cache, then the discovered_direct_iosz can be increased as needed.<br />

Direct I/O <strong>and</strong> unaligned data<br />

When performing Direct I/O, alignment is very important. Direct I/O requests that are not properly<br />

aligned have the unaligned portions of the request managed through the buffer or file cache.<br />

For writes, the unaligned portions must be read from disk into the cache first, <strong>and</strong> then the data in the<br />

cache is modified <strong>and</strong> written out to disk. This read-before-write behavior can dramatically increase<br />

the times for the write requests.<br />

Note<br />

The best direct I/O <strong>performance</strong> occurs when the logical requests are<br />

properly aligned. Prior to <strong>VxFS</strong> 5.0 on 11i v3, the logical requests need to<br />

be aligned on file system block boundaries. Beginning with <strong>VxFS</strong> 5.0 on<br />

11i v3, logical requests need to be aligned on a device block boundary (1<br />

KB).<br />

Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.23<br />

Unaligned data is still buffered using 8 KB buffers (default size), although the I/O is still done<br />

synchronously. For all <strong>VxFS</strong> versions on <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, direct I/O performs well when the<br />

request is aligned on a file system block boundary <strong>and</strong> the entire request can by-pass the buffer<br />

cache. However, requests smaller than the file system block size or requests that do not align on a file<br />

system block boundary will use the buffer cache for the unaligned portions of the request.<br />

For large transfers greater than the block size, part of the data can still be transferred using direct<br />

I/O. For example, consider a 16 KB write request on a file system using an 8 KB block where the<br />

request does not start on a file system block boundary (the data starts on a 4 KB boundary in this<br />

case). Since the first 4 KB is unaligned, the data must be buffered. Thus an 8 KB buffer is allocated,<br />

<strong>and</strong> the entire 8 KB of data is read in, thus 4 KB of unnecessary data is read in. This behavior is<br />

repeated for the last 4 KB of data, since it is also not properly aligned. The two 8 KB buffers are<br />

modified <strong>and</strong> written to the disk along with the 8 KB direct I/O for the aligned portion of the request.<br />

In all, the operation took 5 synchronous physical I/Os, 2 reads <strong>and</strong> 3 writes. If the file system is recreated<br />

to use a 4 KB block size or smaller, the I/O would be aligned <strong>and</strong> could be performed in a<br />

single 16 KB physical I/O.<br />

13


14<br />

Figure 8. Example of unaligned direct I/O<br />

Buffered<br />

I/O<br />

8 KB fs<br />

block<br />

8 KB fs<br />

block<br />

Using a smaller block size, such as 1 KB, will improve chances of doing more optimal direct I/O.<br />

However, even with a 1 KB file system block size, unaligned I/Os can occur.<br />

Also, when doing direct I/O writes, a file‟s buffers must be searched to locate any buffers that<br />

overlap with the direct I/O request. If any overlapping buffers are found, those buffers are<br />

invalidated. Invalidating overlapping buffers is required to maintain consistency of the data in the<br />

cache <strong>and</strong> on disk. A direct I/O write would put newer data on the disk <strong>and</strong> thus the data in the<br />

cache would be stale unless it is invalidated. If the number of buffers for a file is large, then the<br />

process may spend a lot of time searching the file‟s list of buffers before each direct I/O is performed.<br />

The invalidating of buffers when doing direct I/O writes can degrade <strong>performance</strong> so severely, that<br />

mixing buffered <strong>and</strong> direct I/O in the same file is highly discouraged.<br />

Mixing buffered I/O <strong>and</strong> direct I/O can occur due to discovered direct I/O, unaligned direct I/O, or<br />

if there is a mix of synchronous <strong>and</strong> asynchronous operations <strong>and</strong> the mincache <strong>and</strong> convosync<br />

options are not set to the same value.<br />

Note<br />

Mixing buffered <strong>and</strong> direct I/O on the same file can cause severe<br />

<strong>performance</strong> degradation on <strong>HP</strong>-<strong>UX</strong> 11i v2 or earlier systems.<br />

Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.31 with <strong>VxFS</strong> 4.1<br />

4K 4K 8K 4K 4K<br />

16k r<strong>and</strong>om write using Direct I/O<br />

Direct I/O<br />

8K 8K<br />

8k<br />

<strong>HP</strong>-<strong>UX</strong> 11.31 introduced the Unified File Cache (UFC). The UFC is similar to the <strong>HP</strong>-<strong>UX</strong> buffer cache in<br />

function, but is 4 KB page based as opposed to 8 KB buffer based. The alignment issues are similar<br />

with <strong>HP</strong>-<strong>UX</strong> 11.23 <strong>and</strong> earlier, although it may only be necessary to read in a single 4 KB page, as<br />

opposed to an 8 KB buffer.<br />

Also, the mixing of cached I/O <strong>and</strong> buffered I/O is no longer a <strong>performance</strong> issue, as the UFC<br />

implements a more efficient algorithm for locating pages in the UFC that overlap with the direct I/O<br />

request.<br />

Direct I/O on <strong>HP</strong>-<strong>UX</strong> 11.31 with <strong>VxFS</strong> 5.0 <strong>and</strong> later<br />

8 KB fs<br />

block<br />

Buffered<br />

I/O<br />

<strong>VxFS</strong> 5.0 on <strong>HP</strong>-<strong>UX</strong> 11i v3 introduced an enhancement that no longer requires direct I/O to be<br />

aligned on a file system block boundary. Instead, the direct I/O only needs to be aligned on a device<br />

block boundary, which is a 1 KB boundary. File systems with a 1 KB file system block size are not<br />

impacted, but it is no longer necessary to recreate file systems with a 1 KB file system block size if the<br />

application is doing direct I/O requests aligned on a 1 KB boundary.


Concurrent I/O<br />

The main problem addressed by concurrent I/O is <strong>VxFS</strong> inode lock contention. During a read() system<br />

call, <strong>VxFS</strong> will acquire the inode lock in shared mode, allowing many processes to read a single file<br />

concurrently without lock contention. However, when a write() system call is made, <strong>VxFS</strong> will attempt<br />

to acquire the lock in exclusive mode. The exclusive lock allows only one write per file to be in<br />

progress at a time, <strong>and</strong> also blocks other processes reading the file. These locks on the <strong>VxFS</strong> inode<br />

can cause serious <strong>performance</strong> problems when there are one or more file writers <strong>and</strong> multiple file<br />

readers.<br />

Figure 9. Normal shared read locks <strong>and</strong> exclusive write locks<br />

In the example above, Process A <strong>and</strong> Process B both have shared locks on the file when Process C<br />

requests an exclusive lock. Once the request for an exclusive lock is in place, other shared read lock<br />

requests must also block, otherwise the writer may be starved for the lock.<br />

With concurrent I/O, both read() <strong>and</strong> write() system calls will request a shared lock on the inode,<br />

allowing all read() <strong>and</strong> write() system calls to take place concurrently. Therefore, concurrent I/O<br />

will eliminate most of the <strong>VxFS</strong> inode lock contention.<br />

15


16<br />

Figure 10. Locking with concurrent I/O (cio)<br />

To enable a file system for concurrent I/O, the file system simply needs to be mounted with the cio<br />

mount option, for example:<br />

# mount -F vxfs -o cio,delaylog /dev/vgora5/lvol1 /oracledb/s05<br />

Concurrent I/O was introduced with <strong>VxFS</strong> 3.5. However, a separate license was needed to enable<br />

concurrent I/O. With the introduction of <strong>VxFS</strong> 5.0.1 on <strong>HP</strong>-<strong>UX</strong> 11i v3, the concurrent I/O feature of<br />

<strong>VxFS</strong> is now available with the OnlineJFS license.<br />

While concurrent I/O sounds like a great solution to alleviate <strong>VxFS</strong> inode lock contention, <strong>and</strong> it is,<br />

some major caveats exist that anyone who plans to use concurrent I/O should be aware of.<br />

Concurrent I/O converts read <strong>and</strong> write operations to direct I/O<br />

Some applications benefit greatly from the use of cached I/O, relying on cache hits to reduce the<br />

amount of physical I/O or <strong>VxFS</strong> read ahead to prefetch data <strong>and</strong> reduce the read time. Concurrent<br />

I/O converts most read <strong>and</strong> write operations to direct I/O, which bypasses the buffer/file cache.<br />

If a file system is mounted with the cio mount option, then mount options mincache=direct <strong>and</strong><br />

convosync=direct are implied. If the mincache=direct <strong>and</strong> convosync=direct options are used, they<br />

should have no impact if the cio option is used as well.<br />

Since concurrent I/O converts read <strong>and</strong> write operations to direct I/O, all alignment constraints for<br />

direct I/O also apply to concurrent I/O.<br />

Certain operations still take exclusive locks<br />

Some operations still need to take an exclusive lock when performing I/O <strong>and</strong> can negate the impact<br />

of concurrent I/O. The following is a list of operations that still need to obtain an exclusive lock.<br />

1. fsync()<br />

2. writes that span multiple file extents<br />

3. writes when request is not aligned on a 1 KB boundary<br />

4. extending writes<br />

5. allocating writes to “holes” in a sparse file<br />

6. writes to Zero Filled On Dem<strong>and</strong> (ZFOD) extents


Zero Filled On Dem<strong>and</strong> (ZFOD) extents are new with <strong>VxFS</strong> 5.0 <strong>and</strong> are created by the VX_SETEXT<br />

ioctl() with the VX_GROWFILE allocation flag, or with setext(1M) growfile option.<br />

Not all applications support concurrent I/O<br />

By using a shared lock for write operations, concurrent I/O breaks some POSIX st<strong>and</strong>ards. If two<br />

processes are writing to the same block, there is no coordination between the files, <strong>and</strong> data modified<br />

by one process can be lost by data modified by another process. Before implementing concurrent<br />

I/O, the application vendors should be consulted to be sure concurrent I/O is supported with the<br />

application.<br />

Performance improvements with concurrent I/O vary<br />

Concurrent I/O benefits <strong>performance</strong> the most when there are multiple processes reading a single file<br />

<strong>and</strong> one or more processes writing to the same file. The more writes performed to a file with multiple<br />

readers, the greater the <strong>VxFS</strong> inode lock contention <strong>and</strong> the more likely that concurrent I/O will<br />

benefit.<br />

A file system must be un-mounted to remove the cio mount option<br />

A file system may be remounted with the cio mount option ("mount -o remount,cio /fs"); however to<br />

remove the cio option, the file system must be completely unmounted using umount(1M).<br />

Oracle Disk Manager<br />

Oracle Disk Manager (ODM) is an additionally licensed feature specifically for use with Oracle<br />

database environments. If the oracle binary is linked with the ODM library, Oracle will make an<br />

ioctl() call to the ODM driver to initiate multiple I/O requests. The I/O requests will be issued in<br />

parallel <strong>and</strong> will be asynchronous provided the disk_asynch_io value is set to 1 in the init.ora file.<br />

Later, Oracle will make another ioctl() call to process the I/O completions.<br />

Figure 11. Locking with concurrent I/O (cio)<br />

ODM provides near raw device asynchronous I/O access. It also eliminates the <strong>VxFS</strong> inode lock<br />

contention (similar to concurrent I/O).<br />

17


18<br />

Creating your file system<br />

When you create a file system using newfs(1m) or mkfs(1m), you need to be aware of several options<br />

that could affect <strong>performance</strong>.<br />

Block size<br />

The block size (bsize) is the smallest amount of space that can be allocated to a file extent. Most<br />

applications will perform better with an 8 KB block size. Extent allocations are easier <strong>and</strong> file systems<br />

with an 8 KB block size are less likely to be impacted by fragmentation since each extent would have<br />

a minimum size of 8 KB. However, using an 8 KB block size could waste space, as a file with 1 byte<br />

of data would take up 8 KB of disk space.<br />

The default block size for a file system scales depending on the size of the file system when the file<br />

system is created. The fstyp(1M) comm<strong>and</strong> can be used verify the block size of a specific file system<br />

using the f_frsize field. The f_bsize can be ignored for <strong>VxFS</strong> file systems as the value will always be<br />

8192.<br />

# fstyp -v /dev/vg00/lvol3<br />

vxfs<br />

version: 6<br />

f_bsize: 8192<br />

f_frsize: 8192<br />

The following table documents the default block sizes for the various <strong>VxFS</strong> versions beginning with<br />

<strong>VxFS</strong> 3.5 on 11.23:<br />

Table 2. Default FS block sizes<br />

FS Size DLV 4 DLV 5 >= DLV6<br />


Table 3. Default intent log size<br />

FS Size <strong>VxFS</strong> 3.5 or 4.1 or DLV = 6<br />


20<br />

Note<br />

To improve your large directory <strong>performance</strong>, upgrade to <strong>VxFS</strong> 5.0 or later<br />

<strong>and</strong> run vxupgrade to upgrade the disk layout version to version 7.<br />

Mount options<br />

blkclear<br />

When new extents are allocated to a file, extents will contain whatever was last written on disk until<br />

new data overwrites the old uninitialized data. Accessing the uninitialized data could cause a security<br />

problem as sensitive data may remain in the extent. The newly allocated extents could be initialized to<br />

zeros if the blkclear option is used. In this case, <strong>performance</strong> is traded for data security.<br />

When writing to a “hole” in a sparse file, the newly allocated extent must be cleared first, before<br />

writing the data. The blkclear option does not need to be used to clear the newly allocated extents in<br />

a sparse file. This clearing is done automatically to prevent stale data from showing up in a sparse<br />

file.<br />

Note<br />

For best <strong>performance</strong>, do not use the blkclear mount option.<br />

datainlog, nodatainlog<br />

Figure 12. Datainlog mount option<br />

I<br />

Memory<br />

D<br />

Intent Log<br />

I<br />

D<br />

Later...<br />

When using <strong>HP</strong> OnLineJFS, small (


Using datainlog has no affect on normal asynchronous writes or synchronous writes performed with<br />

direct I/O. The option nodatainlog is the default for systems without <strong>HP</strong> OnlineJFS, while datainlog is<br />

the default for systems that do have <strong>HP</strong> OnlineJFS.<br />

mincache<br />

By default, I/O operations are cached in memory using the <strong>HP</strong>-<strong>UX</strong> buffer cache or Unified File Cache<br />

(UFC), which allows for asynchronous operations such as read ahead <strong>and</strong> flush behind. The mincache<br />

options convert cached asynchronous operations so they behave differently. Normal synchronous<br />

operations are not affected.<br />

The mincache option closesync flushes a file‟s dirty buffers/pages when the file is closed. The close()<br />

system call may experience some delays while the dirty data is flushed, but the integrity of closed files<br />

would not be compromised by a system crash.<br />

The mincache option dsync converts asynchronous I/O to synchronous I/O. For best <strong>performance</strong> the<br />

mincache=dsync option should not be used. This option should be used if the data must be on disk<br />

when write() system call returns.<br />

The mincache options direct <strong>and</strong> unbuffered are similar to the dsync option, but all I/Os are<br />

converted to direct I/O. For best <strong>performance</strong>, mincache=direct should not be used unless I/O is<br />

large <strong>and</strong> expected to be accessed only once, or synchronous I/O is desired. All direct I/O is<br />

synchronous <strong>and</strong> read ahead <strong>and</strong> flush behind are not performed.<br />

The mincache option tmpcache provides the best <strong>performance</strong>, but less data integrity in the event of a<br />

system crash. The tmpcache option does not flush a file‟s buffers when it is closed, so a risk exists of<br />

losing data or getting garbage data in a file if the data is not flushed to disk in the event of a system<br />

crash.<br />

All the mincache options except for closesync require the <strong>HP</strong> OnlineJFS product when using <strong>VxFS</strong> 5.0<br />

or earlier. The closesync <strong>and</strong> direct options are available with the Base JFS product on <strong>VxFS</strong> 5.0.1<br />

<strong>and</strong> above.<br />

convosync<br />

Applications can implement synchronous I/O by specifying the O_SYNC or O_DSYNC flags during<br />

the open() system call. The convosync options convert synchronous operations so they behave<br />

differently. Normal asynchronous operations are not affected when using the convosync mount option.<br />

The convosync option closesync converts O_SYNC operations to be asynchronous operations. The<br />

file‟s dirty buffers/pages are then flushed when the file is closed. While this option speeds up<br />

applications that use O_SYNC, it may be harmful for applications that rely on data to be written to<br />

disk before the write() system call completes.<br />

The convosync option delay converts all O_SYNC operations to be asynchronous. This option is<br />

similar to convosync=closesync except that the file‟s buffers/pages are not flushed when the file is<br />

closed. While this option will improve <strong>performance</strong>, applications that rely on the O_SYNC behavior<br />

may fail after a system crash.<br />

The convosync option dsync converts O_SYNC operations to O_DSYNC operations. Data writes will<br />

continue to be synchronous, but the associated inode time updates will be performed asynchronously.<br />

The convosync options direct <strong>and</strong> unbuffered cause O_SYNC <strong>and</strong> O_DSYNC operations to bypass<br />

buffer/file cache <strong>and</strong> perform direct I/O. These options are similar to the dsync option as the inode<br />

time update will be delayed.<br />

All of the convosync options require the <strong>HP</strong> OnlineJFS product when using <strong>VxFS</strong> 5.0 or earlier. The<br />

direct option is available with the Base JFS product on <strong>VxFS</strong> 5.0.1.<br />

21


22<br />

cio<br />

The cio option enables the file system for concurrent I/O.<br />

Prior to <strong>VxFS</strong> 5.0.1, a separate license was needed to use the concurrent I/O feature. Beginning with<br />

<strong>VxFS</strong> 5.0.1, concurrent I/O is available with the OnlineJFS license. Concurrent I/O is recommended<br />

with applications which support its use.<br />

remount<br />

The remount option allows for a file system to be remounted online with different mount options. The<br />

mount -o remount can be done online without taking down the application accessing the file system.<br />

During the remount, all file I/O is flushed to disk <strong>and</strong> subsequent I/O operations are frozen until the<br />

remount is complete. The remount option can result in a stall for some operations. Applications that<br />

are sensitive to timeouts, such as Oracle Clusterware, should avoid having the file systems remounted<br />

online.<br />

Intent log options<br />

There are 3 levels of transaction logging:<br />

tmplog – Most transaction flushes to the Intent Log are delayed, so some recent changes to the file<br />

system will be lost in the event of a system crash.<br />

delaylog – Some transaction flushes are delayed.<br />

log – Most all structural changes are logged before the system call returns to the application.<br />

Tmplog, delaylog, log options all guarantee the structural integrity of the file system in the event of a<br />

system crash by replaying the Intent Log during fsck. However, depending on the level of logging,<br />

some recent file system changes may be lost in the event of a system crash.<br />

There are some common misunderst<strong>and</strong>ings regarding the log levels. For read() <strong>and</strong> write() system<br />

calls, the log levels have no effect. For asynchronous write() system calls, the log flush is always<br />

delayed until after the system call is complete. The log flush will be performed when the actual user<br />

data is written to disk. For synchronous write() systems calls, the log flush is always performed prior to<br />

the completion of the write() system call. Outside of changing the file‟s access timestamp in the inode,<br />

a read() system call makes no structural changes, thus does not log any information.<br />

The following table identifies which operations cause the Intent Log to be written to disk synchronously<br />

(flushed), or if the flushing of the Intent Log is delayed until sometime after the system call is complete<br />

(delayed).


Table 4. Intent log flush behavior with <strong>VxFS</strong> 3.5 or above<br />

Operation log delaylog tmplog<br />

Async Write Delayed Delayed Delayed<br />

Sync Write Flushed Flushed Flushed<br />

Read n/a n/a n/a<br />

Sync (fsync()) Flushed Flushed Flushed<br />

File Creation Flushed Delayed Delayed<br />

File Removal Flushed Delayed Delayed<br />

File Timestamp changes Flushed Delayed Delayed<br />

Directory Creation Flushed Delayed Delayed<br />

Directory Removal Flushed Delayed Delayed<br />

Symbolic/Hard Link Creation Flushed Delayed Delayed<br />

File/Directory Renaming Flushed Flushed Delayed<br />

Most transactions that are delayed with the delaylog or tmplog options include file <strong>and</strong> directory<br />

creation, creation of hard links or symbolic links, <strong>and</strong> inode time changes (for example, using<br />

touch(1M) or utime() system call). The major difference between delaylog <strong>and</strong> tmplog is how file<br />

renaming is performed. With the log option, most transactions are flushed to the Intent Log prior to<br />

performing the operation.<br />

Note that using the log mount option has little to no impact on file system <strong>performance</strong> unless there<br />

are large amounts of file create/delete/rename operations. For example, if you are removing a<br />

directory with thous<strong>and</strong>s of files in it, the removal will likely run faster using the delaylog mount option<br />

than the log mount option, since the flushing of the intent log is delayed after each file removal.<br />

Also, using the delaylog mount option has little or no impact on data integrity, since the log level does<br />

not affect the read() or write() system calls. If data integrity is desired, synchronous writes should be<br />

performed as they are guaranteed to survive a system crash regardless of the log level.<br />

The logiosize can be used to increase throughput of the transaction log flush by flushing up to 4 KB at<br />

a time.<br />

The tranflush option causes all metadata updates (such as inodes, bitmaps, etc) to be flushed before<br />

returning from a system call. Using this option will negatively affect <strong>performance</strong> as all metadata<br />

updates will be done synchronously.<br />

noatime<br />

Each time a file is accessed, the access time (atime) is updated. This timestamp is only modified in<br />

memory, <strong>and</strong> periodically flushed to disk by vxfsd. Using noatime has virtually no impact on<br />

<strong>performance</strong>.<br />

nomtime<br />

The nomtime option is only used in cluster file systems to delay updating the Inode modification time.<br />

This option should only be used in a cluster environment when the inode modification timestamp does<br />

not have to be up-to-date.<br />

23


24<br />

qio<br />

The qio option enables Veritas Quick I/O for Oracle database.<br />

Dynamic file system tunables<br />

File system <strong>performance</strong> can be impacted by a number of dynamic file system tunables. These values<br />

can be changed online using the vxtunefs(1M) comm<strong>and</strong>, or they can be set when the file system is<br />

mounted by placing the values in the /etc/vx/tunefstab file (see tunefstab(4)).<br />

Dynamic tunables that are not mentioned below should be left at the default.<br />

read_pref_io <strong>and</strong> read_nstream<br />

The read_pref_io <strong>and</strong> read_nstream tunables are used to configure read ahead through the buffer/file<br />

cache. These tunables have no impact with direct I/O or concurrent I/O. The default values are good<br />

for most implementations. However, read_pref_io should be configured to be greater than or equal to<br />

the most common application read size for sequential reads (excluding reads that are greater than or<br />

equal to the discovered_direct_iosz).<br />

Increasing the read_pref_io <strong>and</strong> read_nstream values can be helpful in some environments which can<br />

h<strong>and</strong>le the increased I/O load on the SAN environment, such as those environments with multiple<br />

fiber channel paths to the SAN to accommodate increased I/O b<strong>and</strong>width. However, use caution<br />

when increasing these values as <strong>performance</strong> can degrade if too much read ahead is done.<br />

write_pref_io <strong>and</strong> write_nstream<br />

The write_pref_io <strong>and</strong> write_nstream tunables are used to configure flush behind. These tunables have<br />

no impact with direct I/O or concurrent I/O. On <strong>HP</strong>-<strong>UX</strong> 11i v3, flush behind is disabled by default,<br />

but can be enabled with the system wide tunable fcache_fb_policy. On 11i v3, write_pref_io <strong>and</strong><br />

write_nstream should be set to be greater than or equal to read_pref_io <strong>and</strong> read_nstream,<br />

respectively.<br />

max_buf_data_size<br />

The max_buf_data_size is only valid for <strong>HP</strong>-<strong>UX</strong> 11i v2 or earlier. This tunable does not impact direct<br />

I/O or concurrent I/O. The only possible values are 8 (default) <strong>and</strong> 64. The default is good for most<br />

file systems, although file systems that are accessed with only sequential I/O will perform better if<br />

max_buf_data_size is set to 64. File systems should avoid setting max_buf_data_size to 64 if r<strong>and</strong>om<br />

I/Os are performed.<br />

Note<br />

Only tune max_buf_data_size to 64 if all file access in a file system will be<br />

sequential.<br />

max_direct_iosz<br />

This max_direct_iosz tunable controls the maximum physical I/O size that can be issued with direct<br />

I/O or concurrent I/O. The default value is 1 MB <strong>and</strong> is good for most file systems. If a direct I/O<br />

read or write request is greater than the max_direct_iosz, the request will be split into smaller I/O<br />

requests with a maximum size of max_direct_iosz <strong>and</strong> issued serially, which can negatively affect<br />

<strong>performance</strong> of the direct I/O request. Read or write requests greater than 1 MB may work better if<br />

max_direct_iosz is configured to be larger than the largest read or write request.


ead_ahead<br />

The default read_ahead value is 1 <strong>and</strong> is sufficient for most file systems. Some file systems with nonsequential<br />

patterns may work best if enhanced read ahead is enabled by setting read_ahead to 2.<br />

Setting read_ahead to 0 disables read ahead. This tunable does not affect direct I/O or concurrent<br />

I/O.<br />

discovered_direct_iosz<br />

Read <strong>and</strong> write requests greater than or equal to discovered_direct_iosz are performed as direct I/O.<br />

This tunable has no impact if the file system is already mounted for direct I/O or concurrent I/O. The<br />

discovered_direct_iosz value should be increased if data needs to be cached for large read <strong>and</strong> write<br />

requests, or if read ahead or flush behind is needed for large read <strong>and</strong> write requests.<br />

max_diskq<br />

The max_diskq tunable controls how data from a file can be flushed to disk at one time. The default<br />

value is 1 MB <strong>and</strong> must be greater than or equal to 4 * write_perf_io. To avoid delays when flushing<br />

dirty data from cache, setting max_diskq to 1 GB is recommended, especially for cached storage<br />

arrays.<br />

write_throttle<br />

The write_throttle tunable controls how much data from a file can be dirty at one time. The default<br />

value is 0, which means that write_throttle feature is disabled. The default value is recommended for<br />

most file systems.<br />

Extent allocation policies<br />

Since most applications use a write size of 8k or less, the first extent is often the smallest. If the file<br />

system uses mostly large files, then increasing the initial_extent_size can reduce file fragmentation by<br />

allowing the first extent allocation to be larger.<br />

However, increasing the initial_extent_size may actually increase fragmentation if many small files<br />

(


26<br />

cache grows quickly as new buffers are needed. The buffer cache is slow to shrink, as memory<br />

pressure must be present in order to shrink buffer cache.<br />

Figure 13. <strong>HP</strong>-<strong>UX</strong> 11iv2 buffer cache tunables<br />

The buffer cache should be configured large enough to contain the most frequently accessed data.<br />

However, processes often read large files once (for example, during a file copy) causing more<br />

frequently accessed pages to be flushed or invalidated from the buffer cache.<br />

The advantage of buffer cache is that frequently referenced data can be accessed through memory<br />

without requiring disk I/O. Also, data being read from disk or written to disk can be done<br />

asynchronously.<br />

The disadvantage of buffer cache is that data in the buffer cache may be lost during a system failure.<br />

Also, all the buffers associated with the file system must be flushed to disk or invalidated when the file<br />

system is synchronized or unmounted. A large buffer cache can cause delays during these operations.<br />

If data needs to be accessed once without keeping the data in the cache, various options such as<br />

using direct I/O, discovered direct I/O, or the VX_SETCACHE ioctl with the VX_NOREUSE option<br />

may be used. For example, rather than using cp(1) to copy a large file, try using dd(1) instead using<br />

a large block size as follows:<br />

# dd if=srcfile of=destfile bs=1024k<br />

On <strong>HP</strong>-<strong>UX</strong> 11i v2 <strong>and</strong> earlier, cp(1) reads <strong>and</strong> writes data using 64 KB logical I/O. Using dd, data<br />

can be read <strong>and</strong> written using 256 KB I/Os or larger. The large I/O size will cause dd to engage the<br />

Discovered Direct I/O feature of <strong>HP</strong> OnlineJFS <strong>and</strong> the data will be transferred using large direct I/O.<br />

There are several advantages of using dd(1) over cp(1):<br />

Transfer can be done using a larger I/O transfer size<br />

Memory<br />

dbc_min_pct dbc_max_pct<br />

Buffer cache is bypassed, thus leaving other more important data in the cache.<br />

Data is written synchronously, instead of asynchronously, avoiding large buildup of dirty buffers<br />

which can potentially cause large I/O queues or process that call sync()/fsync() to temporarily<br />

hang.<br />

Note<br />

On <strong>HP</strong>-<strong>UX</strong> 11i v3, the cp(1) comm<strong>and</strong> was changed to perform 1024 KB<br />

logical reads instead of 64 KB, which triggers discovered direct I/O similar<br />

to the dd(1) comm<strong>and</strong> mentioned above.


Unified File Cache on <strong>HP</strong>-<strong>UX</strong> 11i v3<br />

Filecache_min / filecache_max<br />

The Unified File Cache provides a similar file caching function to the <strong>HP</strong>-<strong>UX</strong> buffer cache, but it is<br />

managed much differently. As mentioned earlier, the UFC is page-based versus buffer-based. <strong>HP</strong>-<strong>UX</strong><br />

11i v2 maintained separate caches, a buffer cache for st<strong>and</strong>ard file access, <strong>and</strong> a page cache for<br />

memory mapped file access. The UFC caches both normal file data as well as memory mapped file<br />

data, resulting in a unified file <strong>and</strong> page cache.<br />

The size of the UFC is managed through the kernel tunables filecache_min <strong>and</strong> filecache_max. Similar<br />

to dbc_mac_pct, the default filecache_max value is very large (49% of physical memory), which<br />

results in unnecessary memory pressure <strong>and</strong> paging of non-cache data. The filecache_min <strong>and</strong><br />

filecache_max values should be evaluated for the appropriate values based on how the system is<br />

used. For example, a system may have 128 GB of memory, but the primary application is Oracle <strong>and</strong><br />

the database files are mounted for Direct I/O. So the UFC should configured be to h<strong>and</strong>le the non-<br />

Oracle load, such as setting the filecache_max value to 4 GB of memory.<br />

fcache_fb_policy<br />

The fcache_fb_policy tunable controls the behavior of the file system flush behind. The default value of<br />

0 disables file system flush behind. The value of 1 enables file system flush behind via the fb_daemon<br />

threads, <strong>and</strong> a value of 2 enables “inline” flush behind where the process that dirties the pages also<br />

initiates the asynchronous writes (similar to the behavior in 11i v2).<br />

<strong>VxFS</strong> metadata buffer cache<br />

vxfs_bc_bufhwm<br />

File system metadata is the structural information in the file system, <strong>and</strong> includes the superblock,<br />

inodes, directories blocks, bit maps, <strong>and</strong> the Intent Log. Beginning with <strong>VxFS</strong> 3.5, a separate<br />

metadata buffer cache is used to cache the <strong>VxFS</strong> metadata. The separate metadata cache allows<br />

<strong>VxFS</strong> to do some special processing on metadata buffers, such as shared read locks, which can<br />

improve <strong>performance</strong> by allowing multiple readers of a single buffer.<br />

By default, the maximum size of the cache scales depending on the size of memory. The maximum<br />

size of the <strong>VxFS</strong> metadata cache varies depending on the amount of physical memory in the system<br />

according to the table below:<br />

Table 5. Size of <strong>VxFS</strong> metadata buffer cache<br />

Memory Size (MB)<br />

<strong>VxFS</strong> Metadata<br />

Buffer Cache (KB)<br />

<strong>VxFS</strong> Metadata Buffer Cache<br />

as a percent of memory<br />

256 32000 12.2%<br />

512 64000 12.2%<br />

1024 128000 12.2%<br />

2048 256000 12.2%<br />

8192 512000 6.1%<br />

32768 1024000 3.0%<br />

131072 2048000 1.5%<br />

27


28<br />

The kernel tunable vxfs_bc_bufhwm specifies the maximum amount of memory in kilobytes (or high<br />

water mark) to allow for the buffer pages. By default, vxfs_bc_bufhwm is set to zero, which means the<br />

default maximum sized is based on the physical memory size (see Table 5).<br />

The vxfsstat(1M) comm<strong>and</strong> with the -b option can be used to verify the size of the <strong>VxFS</strong> metadata<br />

buffer cache. For example:<br />

# vxfsstat -b /<br />

: :<br />

buffer cache statistics<br />

120320 Kbyte current 544768 maximum<br />

98674531 lookups 99.96% hit rate<br />

3576 sec recycle age [not limited by maximum]<br />

Note<br />

The <strong>VxFS</strong> metadata buffer cache is memory allocated in addition to the <strong>HP</strong>-<br />

<strong>UX</strong> Buffer Cache or Unified File Cache.<br />

<strong>VxFS</strong> inode cache<br />

vx_ninode<br />

<strong>VxFS</strong> maintains a cache of the most recently accessed inodes in memory. The <strong>VxFS</strong> inode cache is<br />

separate from the HFS inode cache. The <strong>VxFS</strong> inode cache is dynamically sized. The cache grows as<br />

new inodes are accessed <strong>and</strong> contracts when old inodes are not referenced. At least one inode entry<br />

must exist in the JFS inode cache for each file that is opened at a given time.<br />

While the inode cache is dynamically sized, there is a maximum size for the <strong>VxFS</strong> inode cache. The<br />

default maximum size is based on the amount of memory present. For example, a system with 2 to 8<br />

GB of memory will have maximum of 128,000 inodes. The maximum number of inodes can be tuned<br />

using the system wide tunable vx_ninode. Most systems do not need such a large <strong>VxFS</strong> inode cache<br />

<strong>and</strong> a value of 128000 is recommended.<br />

Note also that the HFS inode cache tunable ninode has no affect on the size of the <strong>VxFS</strong> inode cache.<br />

If /st<strong>and</strong> is the only HFS file system in use, ninode can be tuned lower (400, for example).<br />

The vxfsstat(1M) comm<strong>and</strong> with the -i option can be used to verify the size of the <strong>VxFS</strong> inode cache:<br />

# vxfsstat -i /<br />

: :<br />

inode cache statistics<br />

58727 inodes current 128007 peak 128000 maximum<br />

9726503 lookups 86.68% hit rate<br />

6066203 inodes alloced 5938196 freed<br />

927 sec recycle age<br />

1800 sec free age<br />

While there are 128007 inodes currently in the <strong>VxFS</strong> inode cache, not all of the inodes are actually<br />

in use. The vxfsstat(1M) comm<strong>and</strong> with the -v option can be used to verify the number of inodes in<br />

use:<br />

# vxfsstat -v / | grep inuse<br />

vxi_icache_inuseino 1165 vxi_icache_maxino 128000


Note<br />

When setting vx_ninode to reduce the JFS inode cache, use the -h option<br />

with kctune(1M) to hold the change until the next reboot to prevent<br />

temporary hangs or Serviceguard TOC events as the vxfsd daemon<br />

become very active shrinking the JFS inode cache.<br />

vxfs_ifree_timelag<br />

By default, the <strong>VxFS</strong> inode cache is dynamically sized. The inode cache typically exp<strong>and</strong>s very<br />

rapidly, then shrinks over time. The constant resizing of the cache results in additional memory<br />

consumption <strong>and</strong> memory fragmentation as well as additional CPU used to manage the cache.<br />

The recommended value for vxfs_ifree_timelag is -1, which allows the inode cache to exp<strong>and</strong> to its<br />

maximum sizes based on vx_ninode, but the inode cache will not dynamically shrink. This behavior is<br />

similar to the former HFS inode cache behavior.<br />

Note<br />

To reduce memory usage <strong>and</strong> memory fragmentation, set<br />

vxfs_ifree_timelag to -1 <strong>and</strong> vx_ninode to 128000 on most systems.<br />

Directory Name Lookup Cache<br />

The Directory Name Lookup Cache (DNLC) is used to improve directory lookup <strong>performance</strong>. By<br />

default, the DNLC contains a number of directory <strong>and</strong> file name entries in a cache sized by the<br />

default size of the JFS Inode Cache. Prior to <strong>VxFS</strong> 5.0 on 11i v3, the size of the DNLC could be<br />

increased by increasing the size of vx_ninode value, but the size of the DNLC could not be<br />

decreased. With <strong>VxFS</strong> 5.0 on 11i v3, the DNLC can be increased or decreased by <strong>tuning</strong> the<br />

vx_ninode value. However, the size of the DNLC is not dynamic so the change in size will take effect<br />

after the next reboot.<br />

The DNLC is searched first, prior to searching the actual directories. Only filenames with less than 32<br />

characters can be cached. The DNLC may not help if an entire large directory cannot fit into the<br />

cache, so an ll(1) or find(1) of a large directory could push out other more useful entries in the cache.<br />

Also, the DNLC does not help when adding a new file to the directory or when searching for a nonexistent<br />

directory entry.<br />

When a file system is unmounted, all of the DNLC entries associated with the file system must be<br />

purged. If a file system has several thous<strong>and</strong>s of files <strong>and</strong> the DNLC is configured to be very large, a<br />

delay could occur when the file system is unmounted.<br />

The vxfsstat(1M) comm<strong>and</strong> with the -i option can be used to verify the size of the <strong>VxFS</strong> DNLC:<br />

# vxfsstat -i /<br />

: :<br />

Lookup, DNLC & Directory Cache Statistics<br />

337920 maximum entries in dnlc<br />

1979050 total lookups 90.63% fast lookup<br />

1995410 total dnlc lookup 98.57% dnlc hit rate<br />

25892 total enter 54.08 hit per enter<br />

0 total dircache setup 0.00 calls per setup<br />

46272 total directory scan 15.29% fast directory scan<br />

While the size of the DNLC can be tuned using the vx_ninode tunable, the primary purpose of<br />

vx_ninode it to tune the JFS Inode Cache. So follow the same recommendations for <strong>tuning</strong> the size of<br />

the JFS inode cache discussed earlier.<br />

29


30<br />

<strong>VxFS</strong> ioctl() options<br />

Cache advisories<br />

While the mount options allow you to change the cache advisories on a per-file system basis, the<br />

VX_SETCACHE ioctl() allows an application to change the cache advisory on a per-file basis. The<br />

following options are available with the VX_SETCACHE ioctl:<br />

VX_RANDOM - Treat read as r<strong>and</strong>om I/O <strong>and</strong> do not perform read ahead<br />

VX_SEQ - Treat read as sequential <strong>and</strong> perform maximum amount of read ahead<br />

VX_DIRECT - Bypass the buffer cache. All I/O to the file is synchronous. The application buffer must<br />

be aligned on a word (4 byte) address, <strong>and</strong> the I/O must begin on a block (1024 KB) boundary.<br />

VX_NOREUSE - Invalidate buffer immediately after use. This option is useful for data that is not<br />

likely to be reused.<br />

VX_DSYNC - Data is written synchronously, but the file‟s timestamps in the inode are not flushed to<br />

disk. This option is similar to using the O_DSYNC flag when opening the file.<br />

VX_UNBUFFERED - Same behavior as VX_DIRECT, but updating the file size in the inode is done<br />

asynchronously.<br />

VX_CONCURRENT - Enables concurrent I/O on the specified file.<br />

See the manpage vxfsio(7) for more information. <strong>HP</strong> OnLineJFS product is required to us the<br />

VX_SETCACHE ioctl().<br />

Allocation policies<br />

The VX_SETEXT ioctl() passes 3 parameters: a fixed extent size to use, the amount of space to reserve<br />

for the file, <strong>and</strong> an allocation flag defined below.<br />

VX_NOEXTEND - write will fail if an attempt is made to extend the file past the current reservation<br />

VX_TRIM - trim the reservation when the last close of the file is performed<br />

VX_CONTIGUOUS - the reservation must be allocated in a single extent<br />

VX_ALIGN - all extents must be aligned on an extent-sized boundary<br />

VX_NORESERVE - reserve space, but do not record reservation space in the inode. If the file is<br />

closed or the system fails, the reservation is lost.<br />

VX_CHGSIZE - update the file size in the inode to reflect the reservation amount. This allocation<br />

does not initialize the data <strong>and</strong> reads from the uninitialized portion of the file will return unknown<br />

contents. Root permission is required.<br />

VX_GROWFILE - update the file size in the inode to reflect the reservation amount. Subsequent<br />

reads from the uninitialized portion of the file will return zeros. This option creates Zero Fill On<br />

Dem<strong>and</strong> (ZFOD) extents. Writing to ZFOD extents negates the benefit of concurrent I/O.<br />

Reserving file space insures that you do not run out of space before you are done writing to the file.<br />

The overhead of allocating extents is done up front. Using a fixed extent size can also reduce<br />

fragmentation.<br />

The setext(1M) comm<strong>and</strong> can also be used to set the extent allocation <strong>and</strong> reservation policies for a<br />

file.<br />

See the manpage vxfsio(7) <strong>and</strong> setext(1M) for more information. <strong>HP</strong> OnLineJFS product is required to<br />

us the VX_SETEXT ioctl().


Patches<br />

Performance problems are often resolved by patches to the <strong>VxFS</strong> subsystem, such as the patches for<br />

the <strong>VxFS</strong> read ahead issues on <strong>HP</strong>-<strong>UX</strong> 11i v3. Be sure to check the latest patches for fixes to various<br />

<strong>performance</strong> related problems.<br />

Summary<br />

There is no single set of values for the tunable to apply to every system. You must underst<strong>and</strong> how<br />

your application accesses data in the file system to decide which options <strong>and</strong> tunables can be<br />

changed to maximize the <strong>performance</strong> of your file system. A list of <strong>VxFS</strong> tunables <strong>and</strong> options<br />

discussed in this paper are summarized below.<br />

Table 6. List of <strong>VxFS</strong> tunables <strong>and</strong> options<br />

newfs<br />

mkfs<br />

bsize<br />

logsize<br />

version<br />

Mount<br />

Option<br />

blkclear<br />

datainlog<br />

nodatainlog<br />

mincache<br />

convosync<br />

cio<br />

remount<br />

tmplog<br />

delaylog<br />

log<br />

logiosize<br />

tranflush<br />

noatime<br />

nomtime<br />

qio<br />

File system<br />

tunables<br />

read_pref_io<br />

read_nstream<br />

read_ahead<br />

write_pref_io<br />

write_nstream<br />

max_buf_data_size<br />

max_direct_iosz<br />

discovered_direct_iosz<br />

max_diskq<br />

write_throttle<br />

initial_extent_size<br />

max_seqio_extent_size<br />

qio_cache_enable<br />

System wide<br />

tunables<br />

nbuf<br />

bufpages<br />

dbc_max_pct<br />

dbc_min_pct<br />

filecache_min<br />

filecache_max<br />

fcache_fb_policy<br />

vxfs_bc_bufhwm<br />

vx_ninode<br />

vxfs_ifree_timelag<br />

per-file attributes<br />

VX_SETCACHE<br />

VX_SETEXT<br />

31


For additional information<br />

For additional reading, please refer to the following documents:<br />

<strong>HP</strong>-<strong>UX</strong> <strong>VxFS</strong> mount options for Oracle Database environments,<br />

http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA1-9839ENW&cc=us&lc=en<br />

Common Misconfigured <strong>HP</strong>-<strong>UX</strong> Resources,<br />

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c01920394/c01920394.pdf<br />

Veritas File System 5.0.1 Administrator’s Guide (<strong>HP</strong>-<strong>UX</strong> 11i v3),<br />

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02220689/c02220689.pdf<br />

Performance Improvements using Concurrent I/O on <strong>HP</strong>-<strong>UX</strong> 11i v3 with OnlineJFS 5.0.1 <strong>and</strong> the <strong>HP</strong>-<br />

<strong>UX</strong> Logical Volume Manager,<br />

http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA1-5719ENW&cc=us&lc=en<br />

To help us improve our documents, please provide feedback at<br />

http://h20219.www2.hp.com/ActiveAnswers/us/en/solutions/technical_tools_feedback.html.<br />

© Copyright 2004, 2011 Hewlett-Packard Development Company, L.P. The information contained herein is<br />

subject to change without notice. The only warranties for <strong>HP</strong> products <strong>and</strong> services are set forth in the express<br />

warranty statements accompanying such products <strong>and</strong> services. Nothing herein should be construed as<br />

constituting an additional warranty. <strong>HP</strong> shall not be liable for technical or editorial errors or omissions<br />

contained herein.<br />

Oracle is a registered trademarks of Oracle <strong>and</strong>/or its affiliates.<br />

c01919408, Created August 2004; Updated January 2011, Rev. 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!