19.06.2013 Views

DB2 UDB for z/OS Version 8 Performance Topics - IBM Redbooks

DB2 UDB for z/OS Version 8 Performance Topics - IBM Redbooks

DB2 UDB for z/OS Version 8 Performance Topics - IBM Redbooks

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Choosing between UTF-8 and UTF-16<br />

When choosing to store data in a Unicode database, you have a choice of using either UTF-8<br />

(CHAR) or UTF-16 (GRAPHIC). This choice includes both per<strong>for</strong>mance and non-per<strong>for</strong>mance<br />

considerations. For example, since the collating sequences of UTF-8 and UTF-16 are<br />

different, you can choose column types based on the desired collating sequence rather than<br />

per<strong>for</strong>mance.<br />

However, the major per<strong>for</strong>mance consideration when comparing UTF-8 to UTF-16 is the<br />

amount of space used. The reason space is important is not just the cost of the space itself,<br />

but also the impact that the space has on I/O bound utilities and queries, including the time to<br />

recover, which affects data availability. Buffer pool hit ratios are also affected by space.<br />

A UTF-8 character is variable length and can be 1 to 4 bytes in length, of which the first 128<br />

code points match 7-bit ASCII. A UTF-8 character is used by <strong>DB2</strong> to store CHAR or<br />

VARCHAR columns. This is opposed to a character in UTF-16, which is 2 or 4 bytes. <strong>DB2</strong><br />

uses UTF-16 to store GRAPHIC or VARGRAPHIC columns. The first 128 code points match<br />

UTF-8 and ASCII, while the rest have a different collating sequence.<br />

Consider the type of data you will be storing in <strong>DB2</strong> and choose UTF-8 (CHAR) or UTF-16<br />

(GRAPHIC), depending on the characteristics of the data. UTF-8 is efficient <strong>for</strong> text that is<br />

predominantly English alphanumeric while UTF-16 is more efficient <strong>for</strong> Asian characters.<br />

To reduce the space required by Unicode tables, you can use <strong>DB2</strong> compression. With no <strong>DB2</strong><br />

compression and <strong>for</strong> alphanumeric characters, UTF-8 has the same space characteristics as<br />

EBCDIC, however UTF-16 doubles the storage requirements. Compressed alphanumeric<br />

UTF-16 strings use more storage than compressed alphanumeric UTF-8 strings, but the ratio<br />

is generally much less than 2 to 1. Tests have shown this ratio can be less than 1.5 to 1.<br />

Whether compression is used or not, the overall effect of Unicode on space usage depends<br />

on the amount of text data versus non-text data.<br />

CPU time can also be another consideration. Compression itself costs a lot of CPU time,<br />

although its I/O benefits can outweigh the CPU costs. In some applications Unicode<br />

conversion can be avoided, but if conversion must be done, it is best done by a remote client<br />

in order to distribute the CPU overhead. If conversion must be done on the zSeries machine,<br />

then you should consider what the dominant application CCSID will be. Generally, the choice<br />

which tends to minimize space usage also tends to minimize CPU usage.<br />

Choosing between fixed and variable length strings may also be another consideration.<br />

Remember, the length of a UTF-8 string is unpredictable, unless the application limits the<br />

characters to the common set and fixed length strings.<br />

Consider, <strong>for</strong> example, a column that was defined as CHAR(1) in EBCDIC. If you try to insert<br />

a multibyte UTF-8 character into CHAR(1), you get a -302 SQL error, because the data does<br />

not fit into one byte. If you change the column definition to GRAPHIC(1), you need two bytes.<br />

If you change the column definition to VARCHAR, you need 3 bytes in order to account <strong>for</strong> the<br />

two-byte length field. Knowing that a field is restricted to alphanumeric characteristics allows<br />

you to continue using CHAR(1).<br />

It is especially good to avoid using variable length fields in indexes, especially nonpadded<br />

indexes, since key comparisons of variable length fields are more expensive. On the other<br />

hand, you should try to avoid padding costs with fixed length fields if Unicode conversion is<br />

being done.<br />

Remember, there is nothing stopping you from storing both UTF-8 and UTF-16 encoded data<br />

in the same table.<br />

Chapter 4. <strong>DB2</strong> subsystem per<strong>for</strong>mance 183

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!