11.11.2014 Views

Codd - A Relational Model for Large Shared Data Banks

Codd - A Relational Model for Large Shared Data Banks

Codd - A Relational Model for Large Shared Data Banks

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Programming Languages (CS302 2007S)<br />

<strong>Codd</strong> - A <strong>Relational</strong> <strong>Model</strong> <strong>for</strong> <strong>Large</strong> <strong>Shared</strong> <strong>Data</strong> <strong>Banks</strong><br />

Comments on:<br />

<strong>Codd</strong>, E. F. (1970). A <strong>Relational</strong> <strong>Model</strong> of <strong>Data</strong> <strong>for</strong> <strong>Large</strong> <strong>Shared</strong> <strong>Data</strong> <strong>Banks</strong>. Communications of the<br />

ACM 13(6), June 1970.<br />

At what stage is the normalization procedure carried out on a relation? Why is it not possible to enter data<br />

(or <strong>for</strong>m relations) in a normal <strong>for</strong>m as an initialization step?<br />

Remember, this was theory and not yet practice. The key idea was that people might think naturally in<br />

terms of nested relations (because there was some of that kind of approach in the more<br />

implementation-oriented database designs), which made it important to show that you could do without<br />

nested relations. Typically, one writes non-nested relationships from the get-go.<br />

I do not fully understand the join operation described in section 2.1.3. On the first paragraph of page 384,<br />

the author writes that the join of âÂR with Sâ is different from join of âÂS with RâÂ. Why is<br />

this case?<br />

Here’s a simple one. Suppose R = { , , } and S = { , }. Since no right<br />

value in R is a left value in S, the join of R and S is the empty set. In contrast, the natural join of S and R is<br />

{ , , , , , }.<br />

Also, how does the join in the paper relate to the different joins (left, right, inner, outer joins) found in<br />

MySQL?<br />

I’ll admit that I don’t know the different joins in SQL.<br />

This paper was written in 1970, how has the relational model evolved over time? Does this model still<br />

apply many of the implementations noted in the paper (e.g the method <strong>for</strong> dealing with inconsistency)?<br />

The key aspects of the model don’t seem to have changed significantly, although some subtleties have been<br />

introduced. I think we’ve found that normal <strong>for</strong>m is much more natural than he’d predicted.<br />

Lastly, the author argues against tree-structured files and network models. Does this mean that relational<br />

databases do not store data in tree structures or is he referring to the external presentation as opposed to<br />

internal representation of databases? I ask this because I think I learned that 2-4 trees are one of the most<br />

efficient ways to store data, I may be mistaken.<br />

He is arguing about a clear distinction between the abstract interface and the underlying implementation.<br />

And, in some ways, 2-3 trees (and, I suppose, 2-4 trees, although I know little about them) are appropriate<br />

<strong>for</strong> some tasks, but not every tasks. For example, I expect that a tree representation makes joins difficult.<br />

1


I must confess that <strong>for</strong> most of the reading <strong>for</strong> tonight, I had no idea what he’s talking about. I am<br />

interested in how the previous data bank models worked and what was so disadvantageous about them.<br />

The previous models focused much more on the underlying structure. If you think about how you might<br />

arrange a collection of data without using a DBMS, you’re on your way toward some of those underlying<br />

models. The disadvantages were the obvious ones - if you knew enough about the implementation, people<br />

tended to program to it, so if the implementation changed, they ended up with either non-working or<br />

badly-working code. Also, some model choices ended up working well <strong>for</strong> some kinds of data and less well<br />

<strong>for</strong> other kinds. The abstraction that <strong>Codd</strong> emphasizes makes it easier to have the implementation adapt to<br />

the particular data or queries.<br />

I am also unsure as to the relationship between the sublanguage and the expressible set. Are they the same<br />

thing? Or does the sublanguage describe the expressible set?<br />

The sublanguage is the language used (in <strong>Codd</strong>’s apparent intent, as a kind of library <strong>for</strong> another<br />

language) to describe the relations. The expressible set is the set of relations describable by the<br />

sublanguage.<br />

Also, what, exactly, does codd mean when he says that the universality of the data sublanguage lies in its<br />

descriptive ability, not its computing ability?<br />

Abstraction, not implementation. His goal is to describe data, not to build a model of computation (which<br />

one would then be expected to prove equivalent to Turing Machines).<br />

It seems that some of the ideas presented in this paper are also found in Object-Oriented programming.<br />

There seem to be parallels between the idea of data encapsulation and interfaces, to reduce implementation<br />

dependencies, and the idea of using a relation view of data to avoid ordering dependencies, access path<br />

dependencies and other implementation dependencies. Did the Object-Oriented paradigm influence the<br />

Declarative or was it the other way around? Or were both views simply applying the same theoretical<br />

abstractions?<br />

I think both are applying the same general principle - abstraction is your friend.<br />

The paper says normalization is an advantage <strong>for</strong> storage (381, third paragraph). I don’t know how to<br />

prove this to be true. From the employee example, I can see the primary key man# are duplicated many<br />

times. Duplication doesn’t sound like an advantage <strong>for</strong> storage.<br />

The keys are typicaly integers, which doesn’t add much overhead. I don’t know enough about the previous<br />

designs to tell you how much overhead there was in making subrelations as values.<br />

Is there a systematic method to design a relation or database tables? Like how to decide what and how<br />

many tables, primary key, <strong>for</strong>eign key are needed?<br />

There are some clear ways to analyze problems to determine potential relations, but I expect that there is<br />

also some art (and there<strong>for</strong>e some experience) required.<br />

2


Can you use a real problem to walk us through the process?<br />

The paper walks you through a real example. If there’s time on Friday, I’ll try to walk you through<br />

another.<br />

In the reading on page 379 in the bottom right-hand side paragraph, <strong>Codd</strong> first explains why in a relation,<br />

if columns are labeled by the name of corresponding domains, the ordering of columns matter. He gives an<br />

example of columns with the same name and why it is important there. However, next <strong>Codd</strong> says that<br />

several in<strong>for</strong>mation systems fail to provide data representations <strong>for</strong> relations which have two or more<br />

identical domains. Why is that the case and here is he referring to the case where the columns are<br />

unordered?<br />

My guess is that the designers did not see natural problems that required multiple domains. The ordering<br />

does not seem all that relevant.<br />

In what kinds of systems would the three kinds of data dependencies i.e.ordering, indexing, access path<br />

dependence be similar or the same and how as <strong>Codd</strong> mentions on page 377 in the second last paragraph<br />

are the not clearly seperable sometims?<br />

The three are different implementation techniques, They can be tied together, in that the implementation<br />

uses more than one, and they are interrelate.<br />

What is the difference between a composition and a permutation? Aren’t they all just projections?<br />

A composition is a projection of a join. A permutation is a projection in which every column is preserved.<br />

So yes, they are both projections, but they are different kinds of projections.<br />

How are relationships between tables defined in SQL? I have worked with MySQL but used PHP to deal<br />

with the relationships/data dependencies etc.<br />

I’m not sure what you mean by "relationships". Typically, you indicate that one relation may be used in<br />

conjunction with another relation through <strong>for</strong>eign keys.<br />

Can <strong>Codd</strong>’s <strong>Relational</strong> <strong>Model</strong> fit under a language paradigm outside of declarative? It seems as though his<br />

ideas on "maximal independence between programs on the one hand and machine representation and<br />

organization of data on the other" fit perfectly under the declarative paradigm. It almost feels like <strong>Codd</strong> is<br />

creating a declarative language.<br />

I think <strong>Codd</strong> was thinking about this relational model more in an imperative context (which is why he<br />

expresses it as a “sublanguage”. But yes, many computer scientists think of this as the creation of a<br />

declarative language.<br />

Looking at the date of the article, it seems like a radical idea <strong>for</strong> 1970 when computers were slow and data<br />

probably wasnâÂt stored in memory. <strong>Codd</strong> even seems like he is softening the blow at points when<br />

referring to the need <strong>for</strong> data independence as something needed in the "future".<br />

3


I expect that “future” meant “near future”. And yes, it took some time <strong>for</strong> his ideas to be accepted.<br />

With the exception of the "indexing dependent" database, I don’t understand how users would access<br />

things in these ’older’ styles of database. The paper seems to gloss over that and assume we’re already<br />

familiar.<br />

It depends on the particular implementation, but, as I understand it, there was probably a decent amount<br />

of pointer chasing.<br />

Because I don’t know how users accessed these old databases, I don’t really understand how the relational<br />

logic databases differ. The first descriptions <strong>Codd</strong> offers seems to emphasize more the underlaying<br />

implementation (and some of the in<strong>for</strong>mation that could be accessed by using knowledge of the<br />

implementation), whereas the rest of the paper focuses on theory. But since <strong>Codd</strong> doesn’t give solid details<br />

<strong>for</strong> the implementation of this database, I don’t see how it necessarily differs from the first examples. It<br />

seems that the top examples, with little to no additional abstraction, could be used in this relational way.<br />

That, or maybe I can’t read.<br />

You are correct. What he is suggesting is primarily that we need to separate more clearly the interface<br />

from the implementation.<br />

4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!