Beginning Microsoft SQL Server 2008 ... - S3 Tech Training

Beginning Microsoft SQL Server 2008 ... - S3 Tech Training Beginning Microsoft SQL Server 2008 ... - S3 Tech Training

cdn.s3techtraining.com
from cdn.s3techtraining.com More from this publisher
17.06.2013 Views

Chapter 9: SQL Server Storage and Index Structures The Pros Clustered indexes are best for queries when the column(s) in question will frequently be the subject of a ranged query. This kind of query is typified by use of the BETWEEN statement or the < or > symbols. Queries that use a GROUP BY and make use of the MAX, MIN, and COUNT aggregators are also great examples of queries that use ranges and love clustered indexes. Clustering works well here because the search can go straight to a particular point in the physical data, keep reading until it gets to the end of the range, and then stop. It is extremely efficient. Clusters can also be excellent when you want your data sorted (using ORDER BY) based on the cluster key. The Cons 286 There are two situations in which you don’t want to create that clustered index. The first is fairly obvious — when there’s a better place to use it. I know I’m sounding repetitive here, but don’t use a clustered index on a column just because it seems like the thing to do (primary keys are the common culprit here). Be sure that you don’t have another column that it’s better suited to first. Perhaps the much bigger no-no use for clustered indexes, however, is when you are going to be doing a lot of inserts in a non-sequential order. Remember that concept of page splits? Well, here’s where it can come back and haunt you big time. Imagine this scenario: You are creating an accounting system. You would like to make use of the concept of a transaction number for your primary key in your transaction files, but you would also like those transaction numbers to be somewhat indicative of what kind of transaction it is (it really helps trouble - shooting for your accountants). So you come up with something of a scheme — you’ll place a prefix on all the transactions indicating what sub-system they come out of. They will look something like this: ARXXXXXX Accounts Receivable Transactions GLXXXXXX General Ledger Transactions APXXXXXX Accounts Payable Transactions where XXXXXX will be a sequential numeric value. This seems like a great idea, so you implement it, leaving the default of the clustered index going on the primary key. At first glance, everything about this setup looks fine. You’re going to have unique values, and the accountants will love the fact that they can infer where something came from based on the transaction number. The clustered index seems to make sense since they will often be querying for ranges of transaction IDs. Ah, if only it were that simple. Think about your inserts for a bit. With a clustered index, we originally had a nice mechanism to avoid much of the overhead of page splits. When a new record was inserted that was to go after the last record in the table, then even if there was a page split, only that record would go to the new page — SQL Server wouldn’t try and move around any of the old data. Now we’ve messed things up though. New records inserted from the General Ledger will wind up going on the end of the file just fine (GL is last alphabetically, and the numbers will be sequential). The AR and AP transactions have a major problem though — they are going to be doing non-sequential inserts. When AP000025 gets inserted and there

isn’t room on the page, SQL Server is going to see AR000001 in the table and know that it’s not a sequential insert. Half the records from the old page will be copied to a new page before AP000025 is inserted. The overhead of this can be staggering. Remember that we’re dealing with a clustered index, and that the clustered index is the data. The data is in index order. This means that when you move the index to a new page, you are also moving the data. Now imagine that you’re running this accounting system in a typical OLTP environment (you don’t get much more OLTP-like than an accounting system) with a bunch of data-entry people keying in vendor invoices or customer orders as fast as they can. You’re going to have page splits occurring constantly, and every time you do, you’re going to see a brief hesitation for users of that table while the system moves data around. Fortunately, there are a couple of ways to avoid this scenario: ❑ Choose a cluster key that is going to be sequential in its inserting. You can either create an identity column for this, or you may have another column that logically is sequential to any transaction entered regardless of the system. ❑ Choose not to use a clustered index on this table. This is often the best option in a situation like this, since an insert into a non-clustered index on a heap is usually faster than one on a cluster key. Even as I’ve told you to lean toward sequential cluster keys to avoid page splits, you also have to realize that there’s a cost there. Among the downsides of sequential cluster keys are concurrency (two or more people trying to get to the same object at the same time). It’s all about balancing out what you want, what you’re doing, and what it’s going to cost you elsewhere. This is perhaps one of the best examples of why I have gone into so much depth about how things work. You need to think through how things are actually going to get done before you have a good feel for what the right index to use (or not to use) is. Column Order Matters Chapter 9: SQL Server Storage and Index Structures Just because an index has two columns, it doesn’t mean that the index is useful for any query that refers to either column. An index is only considered for use if the first column listed in the index is used in the query. The bright side is that there doesn’t have to be an exact one-for-one match to every column — just the first. Naturally, the more columns that match (in order), the better, but only the first creates a definite do-not-use situation. Think about things this way. Imagine that you are using a phone book. Everything is indexed by last name and then first name — does this sorting do you any real good if all you know is that the person you want to call is named Fred? On the other hand, if all you know is that his last name is Blake, the index will still serve to narrow the field for you. One of the more common mistakes that I see in index construction is the belief that one index that includes all the columns is going to be helpful for all situations. Indeed, what you’re really doing is storing all the data a second time. The index will totally be ignored if the first column of the index isn’t mentioned in the JOIN, ORDER BY, or WHERE clauses of the query. 287

isn’t room on the page, <strong>SQL</strong> <strong>Server</strong> is going to see AR000001 in the table and know that it’s not a sequential<br />

insert. Half the records from the old page will be copied to a new page before AP000025 is inserted.<br />

The overhead of this can be staggering. Remember that we’re dealing with a clustered index, and that<br />

the clustered index is the data. The data is in index order. This means that when you move the index to a<br />

new page, you are also moving the data. Now imagine that you’re running this accounting system in a<br />

typical OLTP environment (you don’t get much more OLTP-like than an accounting system) with a<br />

bunch of data-entry people keying in vendor invoices or customer orders as fast as they can. You’re going<br />

to have page splits occurring constantly, and every time you do, you’re going to see a brief hesitation for<br />

users of that table while the system moves data around.<br />

Fortunately, there are a couple of ways to avoid this scenario:<br />

❑ Choose a cluster key that is going to be sequential in its inserting. You can either create an identity<br />

column for this, or you may have another column that logically is sequential to any transaction<br />

entered regardless of the system.<br />

❑ Choose not to use a clustered index on this table. This is often the best option in a situation like<br />

this, since an insert into a non-clustered index on a heap is usually faster than one on a cluster key.<br />

Even as I’ve told you to lean toward sequential cluster keys to avoid page splits, you also have to realize<br />

that there’s a cost there. Among the downsides of sequential cluster keys are concurrency (two or more<br />

people trying to get to the same object at the same time). It’s all about balancing out what you want,<br />

what you’re doing, and what it’s going to cost you elsewhere.<br />

This is perhaps one of the best examples of why I have gone into so much depth about how things work.<br />

You need to think through how things are actually going to get done before you have a good feel for<br />

what the right index to use (or not to use) is.<br />

Column Order Matters<br />

Chapter 9: <strong>SQL</strong> <strong>Server</strong> Storage and Index Structures<br />

Just because an index has two columns, it doesn’t mean that the index is useful for any query that refers<br />

to either column.<br />

An index is only considered for use if the first column listed in the index is used in the query. The bright<br />

side is that there doesn’t have to be an exact one-for-one match to every column — just the first. Naturally,<br />

the more columns that match (in order), the better, but only the first creates a definite do-not-use<br />

situation.<br />

Think about things this way. Imagine that you are using a phone book. Everything is indexed by last<br />

name and then first name — does this sorting do you any real good if all you know is that the person<br />

you want to call is named Fred? On the other hand, if all you know is that his last name is Blake, the<br />

index will still serve to narrow the field for you.<br />

One of the more common mistakes that I see in index construction is the belief that one index that includes<br />

all the columns is going to be helpful for all situations. Indeed, what you’re really doing is storing all the<br />

data a second time. The index will totally be ignored if the first column of the index isn’t mentioned in<br />

the JOIN, ORDER BY, or WHERE clauses of the query.<br />

287

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!