Talk:Database normalization/Archive 2

Is that really 3NF?
I'm not really an expert on the subject but from my understanding, I suspect there's an error in the BCNF section. There's an example provided in the BCNF section that is as follows...

BEGIN QUOTE

Example of a relation that is in 3NF form but not in BCNF:

Relation: {A,B,C,D} AB is a candidate key, BC is candidate key and A->C.

END QUOTE

For a relation to be in 3NF, it has to satisfy the conditions of 2NF too, and therefore 1NF.

In this example, AB is a candidate key, and A->C so there's partial dependency if that is chosen as the PK. BC is a candidate key, and C->A (since A->C) so there's partial dependency if that becomes the PK. Doesn't that violate the conditions for 2NF, and therefore the 3NF?

K nitin r 06:55, 10 July 2006 (UTC)

I guess the example isn't very clear. There's a better example available at:



K nitin r 07:20, 10 July 2006 (UTC)

I find this page very hard to understand...
...and I've been working with relational databases for years. I wonder why.


 * Could you be a little more specific? Remember, we are always seeking to improve the article, so if you could do a little thinking: what was hard to understand? &mdash; Ambush Commander (Talk) 03:15, 14 November 2005 (UTC)


 * I didn't really understand anything either. Now, I don't know much about databases at all, so maybe that's the reason. Honestly, I never understood the whole point of databases at all. People say normalization is one important reason for using databases, but this page didn't enlighten me much. --Apoc2400 02:47, 21 July 2006 (UTC)

I'm not sure I understood this, either, I'm pretty new to relational databases - only been working with them a few years, self-taught on mySQL, but I've had some experience with logic and symbolic logic. I can see how normalization helps the programmer - but it seems to me that normalization is more of a shortcut for programmers than a more AI heavy approach to actually analyzing and reconsidering the presented data might be. But I might be way off-base - I mean, I started learning with flat-files and personally "tweaking" each one - I could see how this could be cumbersome for larger data-sets, but normalization seems like it would lose some very important "facts" (or, in some cases, "opinions" or preferences of the person entering the data) coming in from the different data sources. I'll look into this further, and am limiting myself to only editing the article for readability. Perhaps one of the more experienced database engineers could steer me towards some methods of data reconciliation that doesn't involve removing redundancies, but makes use of them? gnomelock 06:21, 16 August 2006 (UTC)

How about this?
Look at the examples given on this page:

http://www.troubleshooters.com/littstip/ltnorm.html

Look at the examples, not the text. They are coherent. There are no examples of non-normalized within the normalized forms. Only one example per form. What do you think? I say do it.

Examples lacking/biased
I work with multi-valued databases (OpenQM, QM, uniVerse, UniDATA, mvBase, mvEnterprise, jBase, D3, etc.) and these examples are not good examples of how something as simple as an invoice would be organized. It looks like they were put together by some that doesn't use non-1st-normal forms, or has a Codd-like bias against them.

This is the first time I've encountered an article in Wikipedia that I can help fine-tune, so I'm really not sure of the best way to provide feedback, etc.

74.135.100.106 15:50, 7 February 2007 (UTC) Ray

5th Normal Form Subsection
Help!

I found the subsection on 5th normal form. Generally speaking, the subject is a welcome addition to complete the article. The contributor signed his addition in the article with

BY MAHESH M             MTECH -COMPUTER SCIENCE(BMSCE)

Mahesh M. seems to have supplied the makings of an understandable example &mdash; something involving students and classes &mdash; but the typesetting of the example data is bad.

This needs an editor to fix the typesetting, who clearly understands the 5th Normal Form. Unfortunately, I only understand enough to sort of get the idea that the contribution seems like it's good, if only the typesetting was clear. Tom Lougheed 18:30, 4 January 2006 (UTC)


 * The normal form subsection has been replaced, yet again, with a different set of information. Sigh... &mdash; Ambush Commander (Talk) 01:32, 8 January 2006 (UTC)


 * Yes, I replaced it because after reading the paper in the references and also my old textbook "Database System Concepts" by Abraham Silberschatz et. al., it wasn't clear to me that the person knew what he was talking about. Also, when I looked through the history right before he made his submission he wrote: "The information is not sufficient please any one update it.". This combined with the fact that the typesetting was bad, led me to think that he most likely had some school assignment involving the material, and either wasn't sure that his answer was right or didn't know what to put. Although some of the definitions looked consistent with "Database System Concepts" I guess I just felt it would be more confusing than enlightening. Jjjjjjjjjj 02:05, 8 January 2006 (UTC)


 * This section is wrong. Go ahead and compute any join of the three decomposed relation as the author suggests and you'll find that you do not recover the original relation: you will insert additional tuples that don't belog.  In fact, the original relation *is* in 5NF, since you cannot further decompose the relation without losing information.  I'll try to fix it at some point.


 * I've fixed this, I think. Hopefully the Psychiatrist example makes things more clear.  --Nabav 14:07, 15 August 2006 (UTC)

Style guide for this article
The manner in which tables are specified in this article number at about three. This is unacceptable. We should adopt a standard style for all tables in this article, and then switch them all to it. Although, as a developer, I would prefer working with SQL-style table definitions, I recognize that real tables with fake data help demonstrate the relationships better. In that case, would we use class="wikitable" to format those tables? Just throwing out some ideas. &mdash; Ambush Commander (Talk) 01:31, 8 January 2006 (UTC)


 * I have added cleanup to this article for this very reason. More discussion please! :-) &mdash; Ambush Commander (Talk) 21:55, 24 January 2006 (UTC)


 * Gonna be bold. &mdash; Edward Z. Yang (Talk) 01:02, 14 April 2006 (UTC)

5NF is a bad example
Kate knowing Classical Guitar and Jazz Violin does not imply in any way that Kate knows Jazz Guitar or Classical Violin. Somebody who understands this stuff better, please clean this up :) -- Vstarre 17:46, 7 March 2006 (UTC)

Hmm, I think that needs a completely new example. I will see what I can do to improve this. Jatos 11:41, 12 May 2006 (UTC)

Yep, try (james, piano, jazz) (james, trumpet, classical) (kate, trumpet, jazz) (kate, piano, classical). Join it back together and what do you get? A mess.

That's because in some respects, "jazz drumkit" and "rock violin" are still instruments. (e.g. a rock guitar and a classical guitar are two very different instruments.) --DavidHOzAu 11:24, 19 May 2006 (UTC)

I am only a student and I might be completely on the wrong track here, but possible alternatives are using an attribute for the instrument's type (wood, brass, strings, percussive, electronic) or shape (saxaphone, guitar, piano, drums). My reasoning is that once an instrument is a wooden instrument, it stays a wooden instrument. Also, learning to play a brass instrument is quite different to playing the drums, yet there isn't much difference between playing, say, a grand piano and an electric piano. Team this up with more than one instrument for a type, (e.g. violin, cello,) and the 5NF form will start to look much better. Please correct if I am wrong. --DavidHOzAu 11:24, 19 May 2006 (UTC)

Indeed, types of instruments such as brass, woodwind, strings would be better. The instrument - genre table is entirely bogus. --DavidConrad 21:23, 13 July 2006 (UTC)

-- This has the makings of a good example but I think goes slightly astray; for example it is not correct that “Joining these three tables together will return the original relation” (normalization is a reversible process). If the example were slightly different and the original table represented all permitted combinations, including the fact that Lois could play jazz piano, it would not be in fifth normal form and could be replaced by the three tables as shown. If the original table cannot be derived from (decomposed into) two or more simpler tables it is already in fifth normal form.

Assuming that the intention of the original table is to show actual combinations (not permitted), and is therefore already in fifth normal form, there is still value in creating separate tables to represent the constraints, but they do not allow the original table to be replaced. In fact these tables substitute for the permitted table that is otherwise concealed within (but derivable from) the table of actual occurrences (e.g. earlier it was possible to derive the fact that Lois could play jazz piano from: (a) Lois plays piano (b) piano is a jazz instrument (c) Lois plays jazz. There are two sets of multi-valued facts (actual and permitted) involving the same three values. To achieve fourth normal form they must be separated; to achieve fifth normal form the permitted table must be simplified into three pairs of tables!

There is a readable discussion of normalization in 'Data Modeling Essentials' by Graeme Simsion (http://www.simsion.com.au/)

Robertwilliamsau 08:10, 11 August 2006 (UTC)

--

[Hmmm. In the unnormalized table, Lois does NOT play Jazz Piano. But she appears to do so, in the normalized version.] User:206.11.112.251 16:59, 13 July 2006

[ Comment: As the comment above suggests, the three tables as presented do not accurately model the initial unnormalized table.

We can, however, "normalize" the initial table simply by interpreting it differently. Let's assume that we have three tables: a musician table, an instrument table and a genre table.

Each of these tables should have a primary key, which is generally numeric, but for this example assume that the primary keys in the musician table are "James", "Kate" and "Lois". The primary keys for the instrument table are "Piano", "Trumpet", "Drums", "Clarinet", "Saxophone", "Violin" and "Guitar". The primary keys for the genre table are "Classical", "Jazz", and "Rock".

With these assumptions, the above "unnormalized" table becomes a three-way many-to-many relationship table that is exactly the correct way to model the data. ] User:SteveHL 23:31, 8 August 2006

On reading the William Kent article and example, I think the problem with the example given is that 5NF only applies if there is a data rule such as "If person A plays instrument P and style Z, person A plays instrument I in style Z". This is (in real life) not true in the case of instruments and musical styles, which is why the example is unconvincing. The example given on the William Kent page talks about agents, vehicle manufacturers and vehicle types, but is also unconvincing in my opinion. I'm trying to think of a plausible example from a different situation so here are some suggestions, none of which I'm quite convinced about: The only ones of those where I think the rule is plausible are the last two, in particular the languages/books rule. But is it just a bit too artificial to have a table listing 'books people can read' with a separate entry for each person-language-book combination, to then normalise? --VinceBowdren 12:35, 9 August 2006 (UTC)
 * If engineer A knows the use of tool I and knows the maintenance of vehicle type P, then engineer A can use tool I to maintain vehicles of type P
 * If cook A knows the use of utensil I and knows the style of cooking P, then cook A can use utensil I to cook in style P.
 * If person A works for company I and has skill P, then person A uses skill P in the employ of company I.
 * If a company A makes product I and sells to market P, then company A sells product I in market P.
 * If a theme park A has customer I and ride P, then customer I can go on ride P.
 * If a person I knows language A and book P is in language A, person I can read book P.

I've altered the example to something I reckon is more realistic. --Nabav 17:30, 14 August 2006 (UTC)

Minimal Cover
Minimal Cover should be in mentioned   --Tim 17:45, 14 March 2006 (UTC)


 * Agreed, but if we go into minimal cover, then we need to cover F and X closures. No pun intended.  Railgun 18:05, 14 March 2006 (UTC)


 * The canonical (minimal) cover is not even mentioned in the functional dependency page, nor are closures. They would have to be elaborated there first. And because closures play a big role from BNCF and onward, it would make sense to explain them here as well. Then again, this is another reason to split the article because not all normal forms work with functional dependencies.--Citral 21:56, 14 August 2006 (UTC)

BCNF needs good example
Is it possible to put a practical/real example to understand BCNF? Bcos I'm finding it little difficult to understand this with just ABCDs! Many thanks for whoever does this.


 * I agree! The presentation lapses into shorthand that no-one besides an expert is likely to understand.   I am, however, curious; is this the relational algebra of which one hears but almost never sees?


 * In its earliest form, this article was completely mathematically oriented, an example: http://en.wikipedia.org/w/index.php?title=Database_normalization&oldid=8906405 I really couldn't tell you. &mdash; Edward Z. Yang (Talk) 21:20, 4 June 2006 (UTC)


 * An example has now been provided. --Nabav 14:14, 15 August 2006 (UTC)

No definition of multi-valued dependancies
The Fourth Normal form is described as being similar to the third normal form following the removal of multi-valued dependancies, which are not defined or explained at all.
 * I'm fixing that with the creation of the multivalued dependency page. Cheers, --Citral 12:22, 11 August 2006 (UTC)

expert?
Anyone know why this article has an "expert" tag on it? What was specifically in need of review? -- Mikeblas 15:01, 6 June 2006 (UTC)
 * While normal forms one through three are fairly sufficient for most needs, the examples for 4NF and 5NF are horrendous. That's primarily the reason. &mdash; Edward Z. Yang (Talk) 19:35, 6 June 2006 (UTC)

Normalization: not just for relational databases...or is it?
I've used the phrase "poorly normalized" as a way of characterizing all kinds of things (the duplication of content between wikipedia articles, for example). This article seems to be talking specifically about the relational model. This article spends its time talking about the formal characteristization of redundancy in relational databases. Do other database forms speak of "normalization"?

In any case, someone made a stub for "redundancy (database)" which I redirected here. That's based on the belief that there isn't enough to talk about on the subject of redundancy that isn't taken care of here or in the general definition of the term. Metaeducation 03:56, 25 October 2005 (UTC)


 * I have been a professional in database technology for 20+ yrs. Prior to there being a relational model I used normalization to design databases.  It was most beneficial when designing for a CODASYL database, but it was also helpful for designing a TOTAL or an IMS database.  It was even helpful when designing a set of VSAM files (which aren't managed by a DBMS at all).  Take a look at IDEF for a description of some of the history of data modeling was applied prior to the advent of relational technology. KeyStroke 04:36, 25 October 2005 (UTC)

If it is not relational how can it be normalized?
In this edit I removed:


 * &mdash;especially in systems that do not use the relational model (such as OLAP).

How can a system be normalized if it is not relational? Does this mean that non-relational systems (like OLAP) are denormalized to the extreme? For now, it has been removed: the abstract notion of the desirability of denormalization is good enough, methinks. &mdash; Ambush Commander (Talk) 21:52, 24 January 2006 (UTC)


 * Long prior to the acceptance of RDBMSs we in Data Processing (as it was called then) used normalization for various kinds of databases and even non-database systems. I, myself, have used normalization for two kinds of CODASYL databases as well as for two kinds of Hierarchical databases.  Normalization is a discipline for organizing data of any kind.  It does not have to be implemented into an RDBMS to be normalized. KeyStroke 17:54, 28 February 2006 (UTC))

2nd / 3rd NF
I'm revising for my exams on databases at the moment... and I'm sure that 2NF and 3NF are identical in this article - not the same as in my book... or am I just reading it all wrong? Paul 14:16, 17 June 2006 (UTC)

Yea, the formal definitions out of Elmasari [ISBN 0321122267] are easier to understand than this article. I think this is a perfect example of good brainstorming that is now in need of good editing. I'd like to see the normal forms defined both informally (for the non-techies) as well as their formal definitions. It's all there, just not organized. It's easy to say 2NF is no partial key dependencies and 3NF no transitive - but as we use to say, "that's the undergrad definition." Railgun 14:29, 17 June 2006 (UTC)

some more examples
You may find the examples here useful: http://www.robin-beaumont.co.uk/virtualclassroom/chap7/s7/index.htm (Link is now fixed)

Split the article
I am in favour of splitting each of the normal forms into a separate article. I think the separate articles should contain all the specifics and formal definitions, and we can maintain a pithy, abstract summary about each here. Any detractors? — Direvus 19:06, 6 August 2006 (UTC)

I agree: normal forms are a complex isssue that requires more in depth explanation of each. There might be a general article about normalization but there should also be a more profound article on each normal form. Relational databases are quite important these days and normalization is a big issue. --200.122.153.226 17:53, 12 August 2006 (UTC) (Crio de La Paz).

Is there any reason not to split this article? I think it's more than safe to go about splitting now. --Citral 22:01, 14 August 2006 (UTC)

I've started the split by moving the 5NF section out to a new article. --VinceBowdren 10:15, 15 August 2006 (UTC)

I support splitting each normal form as a seperate article as they are one of complex to understand yet very important topics in database designing. --Rahullahurikar 12:55, 2 September 2006 (UTC)

I too am for splitting each norm into separate articles.

I agree, great idea. 219.88.83.179 22:08, 6 October 2006 (UTC)

I strongly disagree, as most texts on normalization are unnecessarily wordy, and can be condensed as much as 80%. The main difficulty with normalization is teaching the underlying terms and concepts, such as modification anomalies, partial, transitive, functional, and multi-valued dependencies, determinents, candidate keys, etc. However, those terms should be linked to other articles, while Database normalization should retain the details on all normal forms, but without trying to explain the underlying terms. There are lots of Wikipedia articles much longer than required to adequately explain database normalization. See my entries on normal forms at Database. That's a quick approach, while the Database normalization should cover all normal forms and go into more depth on each. - Mugs 19:56, 6 March 2007 (UTC)

Mugs, the splitting into separate articles occurred some time ago; the discussion above, which ceased in October 2006, referred to an earlier incarnation of the Database Normalization article that was growing unwieldy. We resolved the problem by creating separate articles on First Normal Form, Second Normal Form, etc. so that each normal form could be explored in depth. In addition there are separate articles on Multivalued Dependency, Functional Dependency, and so on. But this isn't to say we don't attempt to give good descriptions of the normal forms (and the various dependencies, etc.) in the Database Normalization article too. We do; it's just that we do it fairly succinctly, and with links to the more specific articles in case the reader is after further details. --Nabav 16:57, 12 March 2007 (UTC)

Template and category for split articles
The Databases template is a bit big, and looks like overkill for the separate articles for 1NF, 2NF etc. How about we keep that template on the main article, but introduce a new one for the split articles, looking something like this:

Also a category called 'Database normalization' (probably a subcategory of the Database constraints category), to link them all together. --VinceBowdren 12:13, 16 August 2006 (UTC)


 * Done, see Template:Database normalization. I made the category a subcategory of the Databases category in the end, seemed more appropriate there. --VinceBowdren 19:07, 21 August 2006 (UTC)

Denormalization
Just happened across a separate article about Denormalization. I've now linked to it from here, so now we should get on with replacing the long description of denormalization here with a potted summary, and expanding the separate article. --VinceBowdren 18:57, 21 August 2006 (UTC)

Useful references
Just came across this. I haven't yet edited any of the articles as a result, so I'm just linking it here for now. --VinceBowdren 08:43, 22 August 2006 (UTC)

And another one: --VinceBowdren 08:52, 22 August 2006 (UTC)

minor edit
I changed the beginning bit from:

==Description=smita

to

Description
--Wakimakirolls 20:13, 14 October 2006 (UTC)

No defn of join dependencies
5NF seems to be defined in terms of join dependencies, which we don't have a definition of yet. --VinceBowdren 10:19, 15 August 2006 (UTC)

Generally Speaking
This article seems pretty straightforward. This type of information will only be understood by the technical database using community. Therefore, it summarizes Normalization quit adequately.

However if a non technical database individual accesses this page, then he or she will not know what is going on. However who that isn't into databases already would access this page. It would require more detailed information to be accessible by non technical individuals.

I believe there has been a suggestion to split each NF into a different page. This might be fine...it just about the granularity of the explaination. For my purposes, the current granularity is fine.

If someone has time and willingness to draw out examples and table diagrams; that would help.

But as I stated, for my purposes, this article doesn't need expert help.

References to 'the primary key' should refer to 'candidate keys' or 'a primary key'
In the summaries (on the page for which this is the discussion) and main articles on normal forms, there is frequent reference to the primary key. I think this is too restrictive and also confusing.

As an example, the summary of second normal form refers to a composite key then later in the sentence refers to the composite key, leaving it unclear whether second normal form requires that, should a relation have more than one composite key, its constraints are required to be satisfied for all of them or whether satisfying it for just one will do.

Also, the main article for second normal form almost immediately and exclusivey refers to the primary key, suggesting that second normal form is only interested in either (a) relations with a single composite key, or (b) relations with more than one composite key, but where one has already been chosen as 'primary'.

It is my understanding, which I need to recover my copy of TTM in order to confirm or correct, that if a relation is in first normal form, it may have several keys, some of which may be composite. In order for the relation to be in second normal form, the constraints of second normal form must hold for every composite key. The same is true for higher normal forms.

The use of primary key is generally confusing actually, as its meaning has changed since Codd used it in 1970; it appears that when Codd used the term primary key, he was referring to any non-redundant combination of a relation's domains that uniquely identifies tuples in the relation and that it was possible for there to be more than one such combination. Each of these combinations can be referred to as a primary key, once of which might be selected and referred to as the primary key. Hence the use of the indefinite or definite article becomes all important - far too subtle.

Lancewalton 16:27, 26 December 2006 (UTC)

Added more context to the introduction
There was a request to add more context. I took a stab at it. It needs to be edited to flow a little better now. Wilsondavidc 00:12, 11 January 2007 (UTC)

Good work, Wilsondavidc. Your first three sentences provide the lay person an excellent entry to the subject, and inspired me to work on that middle section of the introduction. The introduction section is now 5 paragraphs, a little long, but I think the length is justified given the difficulty in making the subject approachable.

I have structured the introduction in a series, each paragraph diving a little deeper, till the last paragraph mentions the relationship of higher -vs- lower normal forms (courtesy of Wilsondavidc).

Please have a look and let us know your comments or ideas. Do you think we can remove the insufficient context tag, or does it need more work?

-- Matthew 1130 14:13, 11 January 2007 (UTC)

Thanks Matthew! I tweaked the last sentence of the first paragraph to distinguish that relation db's are one subset of digital record storage and normalization applies to it. Is data quality the primary purpose of normalization? I would have guessed that transaction speed and data compression were the first benefits and quality is a primary advantage. I don't know where I got that impression. Wilsondavidc 15:53, 11 January 2007 (UTC)

Hi Wilsondavidc,

I had another crack at that third sentence, trying to improve the wording. You're welcome to change it some more if you can see another way to improve it.

Is data quality the primary purpose of normalization? For the sake of simplicity in the introduction, I've taken the most outstanding benefit of normalization today, and called it the purpose. In reality, people have as many purposes for normalization as there are benefits, which is something I'd like to expand upon further down in the article, but all that would be far too much to include in the introduction (refer Guide to layout).

Many years back, transaction speed and data compression probably were significant benefits of normalization, but this is becoming much less of a focus with the advent of larger storage and faster comms. System performance issues nowadays tend to be far more esoteric, not just about how to cram more data into the same number of bytes. A couple relevant examples are relational databases and XML, which frequently are rather space inefficient in their native format, but which yield key advantages to justify their use.

If saving space is the primary goal, using a compression algorythm is by far the cheapest approach, allowing all that redundancy to be tucked neatly away, without any need to think about it at all. Files with more redundant data have a correspondingly high compression ratio.

-- Matthew 1130 13:26, 12 January 2007 (UTC)

PS. Note that I'm also working on the related article Entity-relationship model if you're interested to be involved there too.

Actually speed is not typically a benefit of normalization; on the contrary, people frequently DE-normalize for performance reasons. There is a widespread sense that fully normalized databases are more likely to be slow. Now, I don't agree wholeheartedly with this: the picture is more complicated than most people realize (see the brief but excellent Wikipedia article on denormalization). But what is certain is that the reason we normalize tables is not to increase transaction speed or anything like that.

Normalization is a logical procedure that is concerned with guaranteeing that a table will not fall victim to certain types of inconsistencies and anomalies. The physical implications of implementing normalized tables (performance is degraded or performance improves; space requirements increase or space requirements decrease - any of these things may happen) are mere side-effects; they have nothing particularly to do with the purpose of normalization, which is to guarantee logical soundness as much as possible.

I've tried to get this point across in the revised introduction.

--Nabav 14:31, 13 February 2007 (UTC)