Agile Database Techniques

The goal of data architecture is to define the entities that are relevant to the enterprise and to avoid designing the logical and physical storage systems. At this point, you may be thinking about how to apply several of the topics discussed within this chapter for your database administrators. We feel that a slight deviation is warranted here. We will discuss techniques that database administrators (DBAs) may use to become more agile and to empower development teams.

For additional reading, see Agile Modeling: Effective Practices for Extreme Programming and the Unified Process by Scott Ambler and Ron Jeffries.

The traditional database management approach has multiple problems, including but not limited to the following:

Iterations of the data model or database are not allowed.
Changes to the production environment are rare or not allowed.
Migrating from one data model to another is a project in and of itself.
There is usually only one development, QA, or other such database.
The database administrator is a roadblock.

When considering how important databases are to the development of applications, one finds multiple opportunities for taking advantage of agile principles to save time and speed development. Now let's look at some agile principles we can apply to this discipline and then show how they can be realized.

Applying the Agile Approach

The agile approach to database management acknowledges that functionality in any undertaking should be done in increments. This is the first challenge to the previous bullet points. The agile approach is an advocate of automated testing. If one can have automated procedures for testing, one could potentially consider having more than one database for each release level (QA, development, etc.). If we could solve for this, both developers and QA could develop and test not only independently of each other but they could test different versions of future functionality. Also, the agile approach acknowledges the importance of team interaction. This challenges the notion that database administrators have to do everything related to databases, which sometimes causes developers to wait around while twiddling their thumbs.

Over time, both applications and databases grow more complex and undergo frequent rounds of refactoring. It becomes very important for architects to focus on the flexibility of the database. If an organization has an application in which business users can create complex scenarios within a half hour and run acceptance tests immediately, the importance of being agile is reduced. However, if the testing process requires intervention by multiple parties and running test scripts (even automated ones) for hours, if not days, then an enterprise can benefit by using agile approaches.

In an agile database approach, everyone gets his or her own copy of the database. A database should be thought of no differently than an application. In many Java shops, all developers are equipped with an application server on their desktops. The same should hold true for databases. This allows developers to work independently of other developers. Developers who understand how databases work (the vast majority) can have their own copies of the database so they can try out various changes to the database structure without involving the DBA. This allows developers to use the database as a scratch pad.

To learn about sandbox diagrams, see www.agiledata.org/essays/tools.html.

If each person were to have an individual copy of the database, matters could get out of hand quickly unless scripts (automation) allowed everyone to recreate a clean database at will. This illustrates a concept that can be manifested differently, depending upon the database in use (e.g., Oracle, MySQL, Pick, etc.).

Database administrators may ask "What's in it for me?" The answer is simple. In the agile approach, DBAs will receive fewer requests for work, allowing them to concentrate on what is truly important, which is maintaining the quality of the data and the structure of the production database. DBAs can maintain specialized data sets for each context (development, load testing, demos, etc.) that further empower the development team. By allowing each person to have a copy of their own database, developers can independently test incremental changes that allow for the migration of older data sets to newer versions of the application. For organizations that employ a release management process, the database can be thought of simply as just another component within the application and can be assigned a build number and context. In an agile approach, the DBA has more of a stewardship role toward databases instead of worker bee.

To make this work, several procedures must be put into place. First, modifications to the schema and setup data must be logged chronologically and the version controlled using the same manner as the application. By using version control practices, any member of the team can simply check out the appropriate scripts needed to support the version level they require and can execute them to get the proper version of the database. Since the schemas are within version control, this can happen on demand without DBA intervention. This same version control mechanism will allow old database instances to be rolled forward simply by applying the logged changes that were made to the schema and setup data.

In an agile database, everything starts with the notion of a database instance. Database instances are equivalent to application instances because they are a working copy of the system at a point in time. The vast majority of large projects use some form of source control (e.g., PVCS, Sourcesafe, Clearcase, etc.) that becomes the system of record for all project assets. For example, if the Canaxia development team were currently working on the sixth release of its enterprise customer relationship management system but expressed a desire to analyze the code base as of the fourth release, one would simply use the source control product and check out the code as of that release (Figure 11-7).

Figure 11-7. Development timeline.

graphics/11fig07.gif

As we can see, the development team is working on the current release of software (version 6.0). When this build of software has been signed off on, exactly one version of the code base will be present at the end of the iteration. Part of this iteration includes an instance of the database that is comprised of the database schema and setup data. Since Canaxia uses scripts to create its database schemas and maintain setup data, the development team is able to recreate a functional database instance at any point in time and in the same manner as code.

Table 11-4. Database iterations.
Version/Iteration	Function
4.0 Iteration 1	DBA makes a copy of certified schema from previous version and loads the database with programmatically generated data.
4.0 Iteration 2	DBA creates a new data set to allow for automated regression testing.
4.0 Iteration 3	DBA creates a new data set to allow for acceptance testing.
4.0 Iteration 4	Version has been certified. DBA creates production database.

The version control should also introduce the notion of context. Context represents the notion that multiple parties may require different sets of data loaded into the database for varying reasons, such as development, regression testing, acceptance testing, and production deployment. A development timeline is two-dimensional in that releases occur in order. Context establishes a third dimension that handles the usage of data within a development timeline. Table 11-4 presents a timeline to further explain this point.

Based on this scenario, you may have noted that each iteration had only one version of code data sets but multiple lineages (regression and acceptance testing). The schemas were the same throughout the iterations, yet the data sets could contain radically different data. For example, the development data set could contain only employees with the first name of Ernie. The acceptance data set could contain only data required to fulfill test cases that require preexisting data. Likewise, production databases created in the last iteration would be sizable and could require data conversion.

A lineage is a preestablished context that is tracked across time and inherits its attributes from its ancestors.

It is important to track the changes in database lineages in a manner different from what is done for typical code. Ideally, lineages should be maintained using scripts and storing them using version control. Sometimes, using lineages is not possible, so you may have to revert to snapshots. In any case, a lineage includes a master instance, a change log, an update list, and a collection of child instances that were derived from it. The first lineage of any application is the lineage that supports the application itself. Ideally, a new lineage is created every time an application branches or whenever the team requires a unique instance or specialized data set.

Working with Scripts

Over time, the quantity of scripts required to create a database can become excessive. In such a situation, we recommend periodically creating new master scripts that essentially start a new lineage. This is referred to as the master lineage. The master should contain all changes to schemas and setup data from prior lineages.

Let's look at a database script and show the information that we will record within it.

 /* **    SUPPLIER.SQL **    Creates table for supplier and signon. **    Copyright 2003, Canaxia.      All rights reserved. ** **    Build: 2226 **    Release: 4.0 **    Date: September 10, 2001 **    Author: Little James */ use CRMDB ; create table supplier (       suppid int not null,       name varchar(80) null,       status char(2) not null,       addr1 varchar(80) null,       addr2 varchar(80) null,       city varchar(80) null,       state varchar(80) null,       zip char(5) null,       phone varchar(80) null,       constraint pk_supplier primary key (suppid) ) ; create table signon (       username varchar(25) not null,       password varchar(25) not null,       constraint pk_signon primary key (username) ) ;

Scripts should be employed that handle the following tasks:

Copy last certified release of database locally.
Create a local copy of database specifying a version number.
Create a new baseline and assign it a version number.
Perform differential of local database versus last certified release.
Perform differential of local database versus baseline.

Having the ability to use version control for databases will allow an organization to do the following:

Conduct acceptance testing against a copy of current production data using a modified schema.
Allow production databases to be migrated to a future state of development.
Allow for changes to production databases to happen on a weekly or even daily basis since it is very easy to do so.

Normalization

Another important step toward having agile databases is not to be savage in the pursuit of normalization. In decision support applications, a star schema is typically used. Star schemas usually require fact tables to be third normal form and dimension tables to support second normal form. With most databases, 99 percent of the access occurs in a read-only fashion in which updates represent the other 1 percent. Using star schemas results in queries that are easy to write and run efficiently.

Agile modeling techniques are presented in detail in Chapter 8.

Online transaction processing (OLTP) databases are typically used nowadays in conjunction with application servers for data access. For example, J2EE-based application servers support the notion of entity beans, which provide an object representation to a relational database. Since entity beans map one for one with a database's underlying table/view structure, it makes sense to consider, where appropriate, a less normalized view of databases. Many application servers can enforce the integrity rules that would otherwise be contained within the database, so following strict rules of normalization is no longer required. Traditionally, following rules of normalization resulted in an increase in performance. However, using this same mind-set with J2EE applications, and specifically with entity beans, could result not only in additional coding but also in lost performance since the application server has to manage relationships as well as create an object representation of each table/row run-time.

Star schemas are also easier to refactor because they are less normalized. They also map better to an object-oriented run-time model, especially when used in conjunction with adaptive object models that can build their representations using metadata obtained from a database.

When using star schemas, it is better to have a column containing precalculated values instead of performing calculations at run-time. This also has the effect of simplifying queries. For example, if your database supported a calendar table represented as a dimension in which each row at the day level has a "number of days since time began" column, your applications would be made simpler because would not have to calculate across month boundaries and so on. Dimension tables that are wide and contain pre-exploded values allow one to develop applications faster. Besides, columns are very easy to add to a database today.

One of the thoughts behind normalization was related to increasing the quality of data. Using normalization beyond Third Normal Form (3NF) does have the effect of potentially increasing data quality but comes at the expense of being agile. We believe a better approach is to build the quality information directly into the tables. The quality information could explain when, where, how good, etc., which is simply additional metadata.

Automating processes using metadata, creation of database scripts, and giving everyone the tools they need to become productive are the key outcomes of an agile process. DBAs should automate every process they perform on a repeated basis and, whenever possible, should ensure that these processes and the tools they use are accessible by team members. Process automation is the only activity that will allow DBAs to respond in a timelier manner.

We now present final best practices, which range across all the topics discussed previously in this chapter.

Applying the Agile Approach

Figure 11-7. Development timeline.

Table 11-4. Database iterations.

Working with Scripts

Normalization