Tuesday, November 27, 2012

Data processing... 70's style

noSQL is the new trend... Seriously?

It is really amazing to see that in a purposedly scientific/engineering discipline as is (or should be) data processing we find some trends that we could name perfectly as "fashion" phenomena. Things that come in full force and dissapear whithout a trace. What is the hot development platform today will be forgotten (and badmouthed) tomorrow. I don't think I need to name an example of any of those. Specially in the field of "modern" java frameworks and environments. And meanwhile, in datacenters split al over the world, we can find lots and lots of lovely handcrafted COBOL and PLI code doing its job silently and faithfully, while a big part of the world economy relies in software build without all this modern fuss about patterns, architectures and... fancy trends...

Well, I guess the previous rant puts me definitely in the Grumpy Old Fart team. Not that I'm really so old (I am "just" 48 years old), but, to say it shortly, I'm really p*ss*d off about the younglings that try to teach me my job, day after day. Damn! I was crunching files when they were wearing diapers! :) Most of those guys are unable to code a master-transaction process program without using 2 gigabytes of framework code, a lot of XML descriptors, a fancy GUI and a Nespresso machine! Well, forget about the Nespresso, I use one of those myself :). And, to be honest, I amb not against the modern frameworks which save a lot of time and a lot of boilerplate code... I can write in java and I consider myself quite good at it, and I do know about some of those modern frameworks and I actually find some of those really briliant and smart. But, some times, the old fart in me has to rebel and yell out. Enter noSql, the last trend in database management.

Aparently some smart guys have noticed that the SQL layer adds a (sometimes) unnecessary overhead to a data processing application. And that the integrity safeguards imposed by the SQL databases (not a good choice of words... relational databases would be more precise) impair the scalability when an application reaches the multi-terabyte level. Then those smart guys have the really briliant thought of getting rid not only of the SQL layer, of but the relational model itself, and go to a pure, non-constrained key-value pair model...

At that time my semi-obsolete neurons start to send signals to my visual cortex, and I almost can see, in green over black letters the words ORGANIZATION IS INDEXED, ACCESS IS RANDOM. It looks like the smart java guys have rediscovered the indexed files! Of course, they are not going to call their amazing discovey "indexed file". It sounds too mainframish. They will use fancy words and fancy names to call that old, purposedly concept of tying a record... Ooops... I mean... an OBJECT to a key.

Now I will try a prescience exercise. I predict some of those brilliant guys will realize sometime in the next years that all those nice objects described by key-value parts have... oh, I'm feeling myself smart now... RELATIONSHIPS between them. And that those relationships can be modeled and integrated into the data storage. Now that is a revolutionary concept :) The only problem is that this revolution happened 40 years ago. Before the Codd model went mainstream, there were at least another two database models. And, what is even more revolutionary, those database models are still widely used today...

Pre-relational database models: hyerarchies and networks

You could build any information system just using plain old sequential files... but you don't want to do it. You could also do it using pure indexed files, and for very simple data models perhaps it could be a good choice. But if your data model gets complex (let's say... three or more files) your data management code will explode in a burst of tedious file and record management routines. The Database Management Sytems (DBMS) were designed to help with this complexity. At a very basic level, a DBMS will help the programer to:

  • Coordinate the changes between several data sets (record types), ensuring the consitence between them.
  • Isolate the programmer from the physical design of the data storage. 
  • Protect the information, providing utilities to back up and restore the information to a consistent state.
  • Providing the developers with views of the data restricted to their needs.
The first available DBMS models responded to two models:
  • A hyerarchical model, in which the different record types are related via "parent-child" relationships (for instance, a machine to its parts).
  • A network model, in which the different record types can be related via arbitrary relationships, not necessarily hyerarchical.
The network model can be seen as a generalization of the hyerarchical model, and the hyerarchical model can be seen as a restriction of the network one.

Take my data to the moon

The most known implementation of the hyerarchical model is  IBM IMS/DB. IMS is not just a database system, is also a transaction processing monitor. It runs in IBM zSeries mainframes, and it is widely used today in the banking and insurance industries, as well as in government related information management. It is insanely powerful, and eats heavy transactions like a kid devours candy. I would love to introduce my readers to IMS, but unfortunately there is no legal way to run IMS at home unless you have an indecent amount of cash. And, of course, IMS design and programming is quite ugly... That does not mean it is not fun. Actually, the restrictions of the hyerarchical model forces the designer to think twice (and trice!) about the design, and forces the programmer to get a deep" knowledge about the problem domain he is working in. Oh, and it is noSQL :). The basic method to access IMS data is to issue a 'GU' call, which returns the record ("segment" in IMS tongue) which satisfies a "SSA" (Segment Search Argument), which is basically a key search expression. So IMS gives you values associated to keys. Key-value parts... with the add-on of chains of "child" segments physically attached to the main (or "root") segments. Oh, by the way, the origin of IMS was the need to keep track of the parts of one of the most complex machines built by the mankind: the Apollo spacecraft. So, each time you use an ATM to get some cash you are probably using a piece of technology born to help to put the astronauts in the moon!


Untangle the network

The network model allows more flexibility than the hyerarchical one. Actually, it could look similar to the more used and known relational model. The main difference is in the relational mode there are (mostly) no physical relationships between record types (relations or tables); the relationships are built at run time using joins based on foreign keys. In a network database, the relationships do exist in the database, usually as physical pointers which relate the different record types. 

The network model databases were standarized at the late sixties and the first seventies by the CODASYL comitee. The network databases are also known as CODASYL databases, and that is the name we are going to use from now.

A CODASYL database is described by a SCHEMA definition. The schema definition contains:
  • The physical characteristics of the database, like the devices it uses, the areas (or parts) it has, the size of those areas in pages, and the size of the pages in bytes or words.
  • The different record types present in the database. For each record type the designer specifies the details about its physical placement, the retrieval and location methods (direct, by physical address, or using hashed keys) and the record structure (the composition of each record in fields). In contrast to the more formal relational model, a CODASYL record can contain arrays and "unstructured" byte bags.
  • The relationships between records, named "sets" in the CODASYL nomenclature. For each set, the designer specifies the "owner" and the "members". The owner and the members are joined using pointers, and the designer has some degree of control about which pointers to use. The most basic structure has a pointer in the "owner" to the first "member", and pointers relating each member to the one following it in a linked list. That linked list can be enhanced using backwards pointers (making it a double linked list) and pointers to the owner.
  • One or more "subschemas", which are subsets of the whole schema that can be used by the application programs. The designer/administrator has some (weak) tools to restrict the subschemas visible to the programmers; this allows him to hide the salary data of a personnel database to the guys doing work not related to payroll. I think you get the idea.
The best known CODASYL database is probably IDMS, owned by Computer Associates. It is still used nowadays. We cannot use IDMS legally (as far as I know), but we can use one of its derivatives. DEC licensed IDMS for use in its PDP-10 mainframes, and sold it as DBMS-10 and DBMS-20. And., guess what? We can run those under SIMH!

A little bit of time travel

We will need some material to set up our retro-database management experiment:

If you are like me and decided to do the full install by yourself you will find how to do it here. You will not find how to install COBOL and DBMS there, but you can look at the BASIC and FORTRAN instructions. The TOPS-10 installations are basically manual: you restore a save set to a working directory and then move the files by hand to the SYS: and HLP: directories. Piece of cake! Oh, I recommend to install EDT unless you want to add the need to learn an editor to your pdp10 adventure. You will find EDT and other goodies in this tape.

Once you have the TOPS-10 system up and running with a working COBOL compiler and an installed DBMS-10 database you will need to do some magic to add DBMS support to COBOL. Just follow the docs here and here. It is not really complex. Basically, restore the full DBMS tape into a working directory, copy C74LIB.REL and C74SHR.REL to this directory and submit a file to rebuild DBMS. When it is done, copy back both files AND C74O12.EXE to SYS: and you will be ready. If you got stuck, feel free to post a comment and I will try to help.

The real stuff: managing your LP (not CDs please) 70's style.

I have written some code as an example of what could be a classic data management application. You can find and download the code from this github repositotry. The files you will find are:
  • The schema file, RECRDS.DDL, which defines a database with two record types and one set:
    • LP-RECORD holds information about LPs (yes, those big circular black pieces of vinyl)
    • TRACK-RECORD holds information about the tracks of a record.
    • LP-SET relates a set of tracks to a record
  • REC001.CBL, a COBOL program to load the database from a flat, sequential file
  • REC002.CBL, a COBOL program to mantain the database using transactions from a sequential file
  • REC003.CBL, a COBOL program to empty the database.
  • LP.CBL, TRACK.CBL and TRANS.CBL: "copybook" files with the input records.
  • COMP.MIC, a comand procedure to compile and link the above programs.
  • RECA01.DAT, a sample input file for REC001.
  • RECA02.DAT, a sample input file for REC002.
By the way, the user executing these code must have the ENQ-DEQ privilege. You must use REACT to add it, or just use the OPERATOR account (which, in a real production system would be anathema, of course).

Oh, remember this is COBOL-74. That means there are no scoped statements. No END-IF. No END-PERFORM. No inline PERFORM. You get the idea...

Last details

If you want to try to build the programs you will need to create a COBOL text library using LP.CBL, TRACK.CBL and TRANS.CBL. This is the recipe:

. R LIBARY
*LIBARY=LIBARY
*INSERT LP,LP.CBL
*INSERT TRACK,TRACK.CBL
*INSERT TRANS,TRANS.CBL
*END
*^C

You will probably want to read the docs:
And this is all by now. This is a different post, leaving the system level discussion for a while. Enjoy data processing, oldies style!