TITLE
VERSION
ARCHITECTURE OVERVIEW
DATA MODULES
LOGIC MODULES
PRESENTATION MODULES
CONFIGURATION MODULES
GENERAL FLOW FOR HANDLERS
SQL CONVENTIONS
TABLE DESCRIPTIONS
TIPS
AUTHOR

TITLE

CMap Code Overview

VERSION

$Revision: 1.10 $

CMap is a CGI application for viewing comparative and genetic maps. Written entirely in Perl, this application will run on many different operating systems and relational database management systems (RDBMS), including Oracle, MySQL, Sybase and PostgreSQL. CMap can create images using "libgd" for standard image formats like PNG and JPEG as well as creating SVG (Scalable Vector Graphics). The code was originally written for the Gramene project (http://www.gramene.org/), a comparative mapping resource for crop grasses, but much has been done to make the application generic enough to be used with many different types of data.

ARCHITECTURE OVERVIEW

Care has been given to carefully separate functionally different parts of the code into different modules, roughly corresponding to a traditional "three-tiered" structure of layers for the data, the logic, and the presentation layers. You'll find all the database interaction encapsulated into the Bio::GMOD::CMap::Data* modules, all the "logic" (the code that lays out the map components) in the Bio::GMOD::CMap::Drawer* modules, and all the HTML generation in the Bio::GMOD::CMap::Apache* modules.

DATA MODULES

As stated above, all the database interaction happens in the Bio::GMOD::CMap::Data* modules. One goal of this project has always been compatibility with multiple RDBMSs (perhaps if only from necessity, as the system was developed using MySQL but is deployed on Oracle). As a consequence, all the SQL will be placed (eventually) into object-oriented modules where the statements can be sub-classed and modified to run with a particular database without affecting any other SQL.

The Bio::GMOD::CMap::Data module has as a component an "SQL" object, with the choices right now confined to Bio::GMOD::CMap::Data::[Generic|MySQL|Oracle]. The "Generic" module is the superclass of the other two (and conceivably any others, such as classes for PostgreSQL, Sybase, etc.). All SQL statement methods are defined in the Generic module, and any that don't work for a particular RDBMS can be overridden in a subclass. This also allows users of other systems to create their own modules and drop them into place with very little effort. All that need happen is to subclass Bio::GMOD::CMap::Data::Generic (as noted in the perldocs), and then add a line to the Bio::GMOD::CMap::Constants to point to the new module.

LOGIC MODULES

All the modules that actually do something toward laying out the comparative maps live in the Bio::GMOD::CMap::Drawer* namespace. The top level, "Drawer.pm," is basically the coordinator of the objects it manipulates. The Drawer creates a "Map" object for each map (or map set) that the user has requested. It asks each Map to lay itself out, then it adjusts the frame, and writes the image to a file. It then is able to tell the calling object the filename of the image and its height and width.

Eventually other modules should fall within this classification, especially the module for administrative functions such as creating and editing maps sets, maps, features, correspondences, etc. All of those functions are currently spread around in the Bio::GMOD::CMap::Admin and Bio::GMOD::CMap::Apache::AdminViewer modules and the cmap_admin.pl script. Eventually I hope to move all the logic into Bio::GMOD::CMap::Admin and have the web- and command-line interfaces simply invoke methods on this Admin object.

PRESENTATION MODULES

The modules in the Bio::GMOD::CMap::Apache namespace are responsible for actually displaying the maps through a web interface. All of the modules are basic Perl classes and are objects inheriting from the Bio::GMOD::CMap::Apache superclass. This superclass creates the Template Toolkit object, the "page" object (see perldocs), and handles any errors thrown by the derived classes, reducing the amount of code to create a new handler.

You'll notice that there is no HTML mixed with Perl code as all the web pages are generated with the Template Toolkit Perl module (http://www.template-toolkit.com/) written by Andy Wardley. Template Toolkit is powerful and freely available Perl templating system, and the hope is that by using it, non-technical people who want to tweak the HTML to do so without interfering with the code.

CONFIGURATION MODULES

There is Bio::GMOD::CMap::Config which handles the reading in and parsing of the config files.

All local configuration of CMap should be done through the "cmap.conf" directory. Of course, the directory doesn't have to be called "cmap.conf." It can be called whatever you like, so long as the absolute path to the directory is in the Bio::GMOD::CMap::Constants file. This path is automatically written during installation if you do the standard "perl Build.PL; ./Build; ./Build install" process.

There are now two types of config files. There is the "global.conf" that handles information that all the data sources need, like the default data source. There is also one config file for each data source. This file is handles most of the configurable options.

There are defaults provided for most every option in the local config file with the exception of the database connection info and the template and image cache directories. The latter two should be set during installation, and the first should be set by the installer after installation (they are promted to do this after "./Build install"). If you comment out any of the options in "cmap.conf" (except "database," "template_dir" and "cache_dir"), there are defaults in the Bio::GMOD::CMap::Constants file.

Feature, map and evidence types are now defined and controlled in the data source config files.

GENERAL FLOW FOR HANDLERS

The web presentation modules are all located under the Bio::GMOD::CMap::Apache namespace and are instantiated as objects. In order to understand how they are invoked, I will describe how the main map viewer (Bio::GMOD::CMap::Apache::MapViewer) works.

A user goes to "/cgi-bin/cmap/viewer" with a browser.
The "cmap" CGI script passes control to Bio::GMOD::CMap::Apache.
Bio::GMOD::CMap::Apache inspects the "path_info" and sees that the user is asking for the "viewer." It looks up an internal table to see if that is a valid path. It is, so the module then instantiates an object of the proper type (in this case, Bio::GMOD::CMap::Apache::MapViewer) which is a subclass to handle the request. Control is then passed to the new object.
The module needs some data for the <form> elements it has, so it calls the appropriate methods Bio::GMOD::CMap::Data module for what it needs.
If there are enough arguments to display a map, it will create a Bio::GMOD::CMap::Drawer object. The drawer will call the Bio::GMOD::CMap::Data object for the data it needs to draw the map(s) (features, relationships, map titles, etc.), layout all the map elements, and finally write the image out to a temporary file.
If the image is created properly, then the drawer passes back to the map viewer handler the name of the map image and the coordinates on the map of elements (to make the image clickable) that the browser needs to display the map.
If all goes well, then the handler uses Template Toolkit to format the HTML so that the user sees a map that they can click on to see other things.

The above scenario is probably the most involved process in the comparative maps, but it shows the way that distinct pieces of the problem are split into specialized modules and objects.

SQL CONVENTIONS

The tables used by the comparative maps follow a fairly rigid naming convention so that they should be able to integrate easily with existing databases.

Every table used by the comparative maps begins with the prefix "cmap_" so that it would be highly unlikely that table names would conflict with those in an existing database.
Tables are always named in the singular ("cmap_feature" not "cmap_features").
Every descriptor (table, field, index names, etc.) is named in lower case letters (and some numbers) with underscores separating words. Additionally, certain conventions are followed for fields according to their function:

- Primary key:

The primary key of the table is always defined as the first field in the table (though the first field in a table is not always guaranteed to be the primary key). A table's primary key will be the name of the table minus the "cmap_" prefix and the token "_id." So, for example, the table "cmap_feature" has as its primary key "feature_id." Additionally, the primary key of a table will always be an ascending integer value (like MySQL's "auto_increment" field, only handled in Perl code so as to be portable to databases without such types of fields). This naming convention is never used for any other type of field except the "accession id", which is exampled by "feature_acc" . It is always obvious the type and purpose of any field ending in "_id": if it is the same name as the table, then it is a primary key, else it is a foreign key.

- Boolean fields:

Since not all databases have a boolean datatype ("Yes/No," "On/Off" kind of data), the fields that hold this kind of data are declared using a small integer datatype (e.g., "tinyint" in MySQL). Names of boolean fields include a verb indicating something that the records "is" or "has" (or "can" or "wants" or whatever) some value, always in the affirmative, e.g. "is_relational_map."

- Date-time fields:

Date-time fields will be named like "*_on," e.g., "published_on."
There are no stored procedures or referential integrity (RI) constraints at the database level (apart from primary key and some unique indices/constraints). This is intentional in order to make the code portable to RDBMSs which don't fully support such features (such as MySQL, one of the databases used in the development of the code). All such logic and RI is addressed in the Perl code (for better or worse).

Table Aliases:

When writing queries using multiple tables, a query uses table aliases, e.g.:

  select map.map_name,
         f.feature_start,
  from   cmap_map map,
         cmap_feature f
  where  upper(f.feature_name)=?
  and    f.map_id=map.map_id

Placeholders:
Whenever possible, placeholders are used in the SQL instead of direct variable substitution. The only places where this is not feasible is in "IN" statements [e.g., "foo_id in (1,2,3)"]. The advantages are numerous, the biggest being that the database ends up correctly quoting and escaping your parameters.

TABLE DESCRIPTIONS

In the "docs" directory, you will find a schema diagram illustrating the structure and relationships of the tables. In the "sql" directory, you will find create statements of the tables for MySQL, Oracle and PostgreSQL. Following is a general description of each table, what kind of data it is supposed to hold, and how it fits in with the others. Most of the fields in the tables are described more fully in the ADMINSTRATION document when discussing the forms presented in the web admin tool.

cmap_map_set:
Arguably the "base" table, a "map set" is a collection of "maps," which themselves are just points on a line. The map set can be disabled (effectively removing it from all views) by setting the "is_enabled" field to zero.
cmap_map_type (depricated):
This table has been removed and it's functionality moved to the config file and other tables.

Maps are of some type determined by the data curator. It is useful to researchers to know that a map is a "genetic" map and that the distances given are in "centiMorgans" as opposed to a "physical" map where the distances are "bands." More importantly, though, the curator can decide how different maps are drawn so as to make them visually distinctive. By selecting from shapes, colors and widths, the curator has great discretion over the presentation of maps. Additionally, the curator can determine if the maps can stand alone (like most genetic maps) or if they are a special kind of map set that CMap calls "relational," meaning that the maps can only be shown *in relation to* some other map (of the first, "non-relational" kind, no less). Examples of this include physical fingerprint contig maps (FPC), QTLs, and haplotypes. These maps are usually composed of many smaller fragments that have some correspondences to other maps. As such, they are just drawn to a size relative to the distance their ======= some other map. Examples of this include physical fingerprint contig maps (FPC), QTLs, and haplotypes. These maps are usually composed of many smaller fragments that have some correspondences to other maps. As such, they are just drawn to a size relative to the distance their >>>>>>> 1.3 correspondences cover on the reference map.
cmap_species:
A fairly obvious table, given that we hold genetic data. Map sets are tied to a species.
cmap_map:
Like a "map" from Picayune to New Orleans might tell you that the distance is 50 miles and that there is a coffee shop on exit 3, a map says that it is some measure of distance long and that some "feature" can be found at some location.
cmap_map_cache (depricated):
This table has been removed.

This is a "permanent temp" table used only to speed up the web queries. It is used to remember which maps occur in each of the "slots" while collating the data used to draw the maps for each request. The requests are identified by the PID (process ID) of the Apache child serving the user, and the requests *should* be deleted after the request has finished. Records have a timestamp that indicates the age of the request. If you see records in this table that are older than even a few minutes, it should be safe to delete them.
cmap_feature:
This is the "something" that occurs on a map. It has a name and a starting position. The feature may span, in which case it will have a stop position, too. Certain features can be designated as being "landmarks," and the user can decide to only view landmarks so as to make the maps simpler.
cmap_feature_alias:
Holds any number of alternate names for each feature.
cmap_attribute:
This table allows you to attach any arbitrary key/value pair of data to almost any database object. See the "attributes-and-xrefs.pod" document for more information.
cmap_feature_type (depricated):
This table has been removed and it's functionality moved to the config file and other tables.

Each feature is of some type. What particular types of features exist are completely determined by the data curator. The data curator decides how to draw the feature on the map, selecting from pre-determined shapes and colors.
cmap_feature_correspondence:
The whole point of comparative maps is to show the relationships among maps. These correspondences are all held in this table, and they simply state that one feature is related to another. The correspondence can be disabled by setting the "is_enabled" field to zero.
cmap_correspondence_evidence:
This table holds the reasons why we can say that one feature has some correspondence to another. There may be several reasons, each of a different type, optionally with some sort of score.
cmap_evidence_type:
This table has been removed and it's functionality moved to the config file and other tables.

These are the types of evidence that we can use in the previous table. For instance, we can say two features are related because they have the same name or the same sequence. Some evidences are stronger than others, so they can be ranked accordingly.
cmap_correspondence_lookup:
This is sort of a cheater table introduced in order to make it easier to find correspondences. It doubles the number of records for any single correspondence. E.g., if "foo" corresponds to "bar," then two records go into this table, one saying "foo=>bar" and the other saying "bar=>foo" and both pointing to the "canonical" correspondence record in the "cmap_feature_correspondence" table.
cmap_correspondence_matrix:
This is another cheater table used to make lookups of feature correspondences fast. The idea is that correspondences don't often change, so it's quicker and easier to pre-compute all the possible combinations for the matrix and cache them in a table. Everytime a the number of correspondences changes for any reason, it is necessary to manually reload this table using the cmap_admin.pl tool.
cmap_xref:
This table holds database cross-references which can attach to individual database objects (just like attributes) or to whole classes of objects. See "attributes-and-xrefs.pod" for more information.
cmap_next_number:
This table stores the next number for the primary key of every table needing such. It's basically a simple workaround for MySQL's "auto_increment" feature, Oracle's "sequence" type, PostgreSQL's "serial" type, etc. By implementing this in Perl, I don't have to worry how (or even "if") a particular RDBMS supports a "one-up" field.
cmap_saved_link:
This table stores the stored links. The session_step_object contains a frozen (freeze() in Storable.pm) object that holds the session information.

TIPS

If you'd like to write a script to access the database directly, you can get a handle to the database quite easily using the Bio::GMOD::CMap modules. Here's an example:

  #!/usr/bin/perl

  use strict;
  use Bio::GMOD::CMap;

  my $cmap = Bio::GMOD::CMap->new or die Bio::GMOD::CMap->error;

  # optional, only if you have muliple data sources defined
  # $cmap->data_source('Foo');

  my $db = $cmap->db or die $cmap->error;

AUTHOR

Ken Y. Clark, kclark@cshl.edu Ben Faga, faga@cshl.edu