A Little Database Theory - Part 1

The other day I came across a forum on the web and spotted a post by someone who wanted to know the difference between relational calculus and relational algebra. As happens so often on the web, all of the responses were written by people who had virtually no idea what they were talking about. So, having a couple of hours to kill I thought I’d give it a go.

Almost everyone has come across the term “relational database” and probably to many people “relational databases” and “databases” are the same thing. They’re not, but the relational model has become so popular that non-relational databases have (except in very specialist situations) pretty much disappeared.

I thought I’d split this article into 3 separate posts:

What did databases look like before the relational model?
What does “relational model” actually mean?
A comparison between relational calculus and relational algebra

Let’s have a look at what the e-Quip database might have looked like before the Ted Codd presented his ground-breaking paper in 1969. Imagine that you have a book and you have decided to have one page per device. At the front of the book is an index which tells you the page number for a particular device. To get information about Equipment No 12345 you would look in the index and see that this device record can be found on page 253. You then quickly flick through to that page and Bob’s your uncle, you can see all the information you need.

If the book becomes a computer disk file then there were, and still are, plenty of ways of maintaining an index and quickly getting to the right part of the file based on the value from the index. Perhaps the most common was known as ISAM, or Indexed, Sequential Access Method. This meant that you used the index to get to the right place and then you read each character sequentially from that point onward.

If you’re in charge of the book you might come up with some rules, like:

The first 10 characters on line 1 are the Equipment No
The next 10 characters on line 1 are the Serial No
The 2nd line holds the model name and description
The first 10 characters on the 3rd line hold the purchase date in the format dd/mm/yyyy
The next 10 characters on the 3rd line hold the installation date in the same format
…

So far so good(-ish). How about jobs? You might decide to have 2 lines per job, starting from line 20, which might give you enough space for 20 jobs.

isam

I have actually worked on plenty of applications like this. The biggest problems that we had were a) the dates were in dd/mm/yy format (oops! what’s going to happen the day after 31/12/99?) and b) nobody ever got round to writing down the rules so if we ever needed to add anything we just picked a bit of space somewhere on the “page” that looked vacant and used that. Unfortunately, after a few years the vacant spaces get few and far between. When space gets really tight things get difficult. Suppose we need to add another date (e.g. Replacement Date) but there aren’t 10 unused spaces in the file. There might be 3 spaces available on line 3, one at the start of line 4 and another 2 at the end of line 4, and 4 in the middle of line 7. To read this date the programmer needs to read data from 4 different places in the file, join them together in memory and then interpret the whole as a date. Writing a date is even more complex. The programmer would start with “01/12/2015” and have to split this into 4 separate pieces (“01/”, “1”, “2/” & “2015”) then write each piece to the appropriate place in the file.

What happens when you finally run out of space on the page? Suppose a device has 21 jobs. You can’t just write this 21st job underneath the 20th because you will be overwriting the next equipment record. Well, there are things you can do, like start another book (with its own index) just for all the additional information do but it all gets complicated and the programmers have to handle the complexity.

Searching for things was very difficult. While you could easily find the right page for a particular device, how do you find all jobs completed last month? You start at page 1 and look through every line (from line 20 onward), then look at the date in columns 10 – 20. If that date is in last month, then add the Job No to your list. Then turn to page 2, and keep doing this until you have read the whole file. You could get clever and create more indexes at the front of the book but this all added to the complexity.

It’s not really surprising that these databases were often processed in batch. i.e. once a month you run a program which produces a whole load of printed reports. If you want to know how many jobs there were last month, then get last month’s report off the shelf and read it. Interactive software applications where you could ask any question any time you liked were not that common.

What do you do if some eejit writes the wrong thing in the wrong space? You have positioned the file at a place where you are expecting to see a date, but instead of “21/01/2015” you find “HELLO”. Just letting your program collapse in a heap wouldn’t be a good idea, although plenty did. You ended up having to write complex code to do simple things, just to take care of possible errors.

Another problem is that the file (i.e. the virtual equipment book) is completely separate from the application. A file is just a file – any programmer can write whatever they like to it, regardless of the format that you have specified. To keep the contents of the file meaningful you have to rely on every program that accesses it a) knowing the rules and b) sticking to the rules. Even if everyone does this, programs will inevitably have bugs. The programmer might have intended to write “Graseby MS26 Syringe Driver” starting from the 1st character on the 2nd line of the particular equipment record, but programmers make mistakes, especially when keeping track of where they are in a large, complicated file. The end result is that it is hugely difficult to prevent the file from being corrupted.

As an aside, I once worked on a database which managed the bonuses paid to travel agents depending on how quickly they submitted details of the travellers’ cheques that they had sold. The database was exactly like the ISAM file described above. In order to handle errors, the standard technique was:

1000 IF something goes wrong

1010 THEN LET X = 1/0

All of the applications were written Data General Business BASIC (which conveniently had ISAM built into the language). If you’ve ever used BASIC you’ll know that it is an interpreted language. The interpreter reads a line, translates it into machine code then executes it. It then moves on to the next line and does the same again. If something goes wrong the interpreter just stops and displays a message, then sits and waits for someone to tell it what to do. LET X = 1/0 is going to cause an error, since you can’t divide anything by 0. The interpreter will display something like “Divide by zero error at line 1010” then just sit and wait.

The computer operator could then call the programmers, who could then a) sit in front of the computer, b) find out what has gone wrong, c) fix it, then d) tell the interpreter to carry on.

From a programmer’s perspective this is awful. Every single database application is different, even in the same organisation. You might be an expert on one system and not have the slightest idea about another. Codd’s paper describing his new relational model, published in 1969, revolutionised the data processing world. Data processing was now based on mathematics. There were now rules which everyone knew and understood. That’s why virtually all databases that you see are relational and why the old-style ISAM (Indexed Sequential Access Method) databases disappeared.

My next post will look at what we actually mean by “relational model”.

A Little Database Theory – Part 1

Submit a Comment Cancel reply

Recent Posts

A Little Database Theory – Part 1

Submit a Comment Cancel reply

Recent Posts

Tag Cloud