Bio-graphics, BioSQL and Rails part 2

In  part 1 of this series we created a rails application and connected it to a BioSQL database. We also overwrote the rails convections to accommodate our legacy schema.

To understand the BioSQL schema, please review the documentation here. A brief overview of is as follows. Every record we enter into our database is a ‘bioentry’ and goes to the bioenty table. A bioentry can be composed of the following entities: the record’s public name, public accession and version, its description and an identifier field.

The actual sequence data is stored in the biosequence table which contains raw sequence information associated with a bioentry, and alphabet information (‘protein’, ‘dna’, ‘rna’). This is because not all records in our database need to be associated with a raw sequence. Additional sequence information is stored in the seqfeature table together with other qualifiers.

The location of each seqfeature (or sub-seqfeature) is defined by a location entity, describing the stop and start coordinates and strand. This information is stored in the location table.

In our rails application we are going to create some models and a few controllers. In RESTful language, we are actually creating resources. In this example we will be very simplistic and just create a biodatabase, taxon, bioentry, biosequence, seqfeature, location resources. We will also create associations between them in their model classes. But before that delete the index.html file from your rails application public folder and add the following line to your configurations/routes.rb file

 map.root :controller => "biosequences"

To quickly create the models, controllers, associated views and a test suite for each of our resources, just run the rails generate scaffold command, passing the name of the model as an argument. For example,

generate scaffold Bioentry

will create a bioentry model, a bioentries_controller, associated views (index,show,edit and new), a migration file, though in our case we do not need it. When you finish scaffolding, the routes.rb file should have the following resources declared.

  map.resources :seqfeatures
  map.resources :locations
  map.resources :bioentries
  map.resources :biosequences
  map.resources :taxon
  map.resources :biodatabases

Let us create some mandatory associations for the models.

Edit the /models/biodatabase.rb file by adding the following

 has_many :bioentries #a biodatabase is associated with many bioentries
 validates_uniqueness_of :name  #The name foe each biodatabase is unique!

Edit the /models/bioentry.rb file by adding the following

    belongs_to :biodatabase
    belongs_to :taxon
    has_one :biosequence

Edit the /models/taxon.rb and add

   has_one :bioentry

Edit the /models/biosequence.rb file by adding:

  set_primary_key :bioentry_id #biosequence uses bioentry_id as a primary key!
  belongs_to :bioentry

edit the /models/location.rb file by adding:

 belongs_to :seqfeature

Edit the /models/seqfeature.rb file by adding:

  belongs_to :bioentry
  has_many :locations

Note that most likely you will be adding huge files to the database. BioSQL comes with a set of  perl scripts to enable you do that. Until bioruby 1.3 is released you will have to use the perl scripts to add huge datasets. All the documentation to do that is available from the BioSQL website. I used a perl script load_ncbi_taxonomy.pl to load taxon data to my database. This script comes with the BioSQL. (It did not seem to work on my system, I will sort that later)

To make this post shorter and get to the meat of it, i will assume that you have some existing data in your biosql database. If not, create some dummy data to populate, the biodatabase, bioentry,biosequence, seqfeature and location tables. In Part 3, I will show you how to create the necessary views to populate the database. After all biologists don’t want to interact with raw SQL queries and sometimes have no idea of running scripts, however they are very web savy!

Edit the /biosequences/show.html.erb to look as follows:

<h2><%= @biosequence.bioentry.name%>(<%= @biosequence.alphabet %>)</h2>
<p>Sequence</p>
<%= @biosequence.seq %><br/>


<%= link_to 'Edit', edit_biosequence_path(@biosequence) %> 

Now navigate to http://localhost:3000/biosequences/1

and then navigate to http://locahost:3000/biosequences/1.xml The XML version of your sequence is also available!

Lets add some ability to render graphics for the sequences.

Add the following lines at the top of the biosequence.rb model file

 require 'stringio'
 require 'base64' 

In the biosequence.rb model class, create a new method called draw_graphic.

def self.draw_graphic(value)
      #get the name and length of the main feature to be drawn
     main_feature = Bioentry.find(value)
     len = main_feature.biosequence.length.to_i
     name = main_feature.name

    #create a Biographics panel and add a track
      @my_panel = Bio::Graphics::Panel.new(len,:width=> 900)
      @track = @my_panel.add_track("#{name}",:glyph=>'directed_generic')

     #specify the range for the main feature
     main_feature_range = "1..#{len}"
      @track.add_feature(Bio::Feature.new("#{name}",main_feature_range), :label=>" ")

    #write the output to memory
        output = StringIO.new
        @my_panel.draw(output)
        return output.string
  end

This method will be called by an action method in biosequence_controller.rb file.

  def to_image
    begin
      image = Biosequence.draw_graphic(Biosequence.find(params[:id]))
      send_data(image, :filename => "graphic.svg", :disposition => "inline")
    rescue  ActiveRecord::RecordNotFound
      add_error("Error:Attempt to call image without specifying a biosequence  ID")
      redirect_to :action=>'index'
    end
  end

We add a rescue block to capture record not found errors. In RESTful applications a controller is limited to seven actions. So we need to add a collection to our biosequence resource in routes.rb. This is how we do it.

  map.resources :biosequences,:collection=>{:to_image=>:get}

Now we need to modify our /biosequences/show.html.erb file, to enable rendering of the graphic. For that we will create a helper method so that our show.html.erb view is ‘clean’. In helpers/biosequences_helper.rb file, add the following code

  def render_image(feature_obj)
     image_tag(url_for({:action=>'to_image',:id=>feature_obj}))
  end

And in the /views/biosequences/show.html.erb file add the following line of code

<%= render_image(@biosequence) %><br/>

Now assuming  that you have a biosql database with valid data, navigate to

http://localhost:3000/biosequences/show/1

screenshort

screenshort

The above is a screen shot from my example application while I was writing this tutorial.

The source code for this example  application is available from github

For a full review of the methods available for biographics please check the project’s git repository and the rdoc.

Bio-graphics, BioSQL and Rails part 1

In these series I will show you how to quickly add graphics support to a bioinformatics database rails application. We are going to use the biographics library by Jan Aerts, the BioSQL database schema, and rails 2.2.2 (also works with 2.3.2)  In this simple example we want to represent a sequence as a graphic, such that we can view it in a web browser more or less the way Gbrowse works. Each main feature has different subfeatures at different locations along it.

——————————————————————- main feature

——- ——– ———— ——– ——    subfeatures

We need to have the following installed, rails 2.1.1, bio 2.1, biographics 1.4 all available as gems and a database based on BioSQL schema.

We need to download the BioSQL schema located here. The latest version as of this writing is BioSQL v1.0 (code-named Tokyo) release, v1.0.1. Create a database called biosql_development. I am on Ubuntu Linux with Mysql 5.0.

george:>mysql -u george -p
:enter password
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 27
Server version: 5.0.51a-3ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database biosql_development;
Query OK, 1 row affected (0.02 sec)

mysql> 

I have created a database called biosql_development. Why am i not using migrations? The reason is that BioSQL has some agreed standards on table names and schema convection which are not compatible with rails database creation and table naming conventions. However Rails allows us to override these default convections, when working with legacy databases, as will be our case.

After creating the database, load the BioSQL schema to the empty database. First we need to tell mysql which database to use.

mysql> use biosql_development;

then load the schema

mysql> source /home/george/Desktop/downloadsfolder/biosql-1.0.1/sql/biosqldb-mysql.sql;
Query OK, 0 rows affected, 1 warning (0.48 sec)

Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

 ........ trucated
mysql>

Now we need to create a rails application and connect to this database.

I use the Netbeans IDE development environment for creating ruby and rails applications. Go ahead and create a rails application and specify to use mysql as the database adapter.

To connect to our legacy database, we need to override some convections. First disable table plurulization, and tell rails that the table primary name is named as tablename_id as opposed to just the id column expected by rails. To do that

Create a new file in your application configurations/initializers directory called override_rails.rb (you can call it whatever).

 class ActiveRecord::Base
  self.pluralize_table_names = false

  self.primary_key_prefix_type = :table_name_with_underscore
 end

The two lines above tells ActiveRecord not to expect the table names to be plural and that the primary key for each table is named as tablename_id format.

Also create another one called external_libraries.rb in the initializers directory, as you can tell this is where I want to put my require statements for loading external libraries.

require 'rubygems'

#load the bioinformatics library
require 'bio'

#load the biographics library
require 'bio-graphics'

#load the sql views extension library
gem 'rails_sql_views'
require 'rails_sql_views'

This file loads our gems. The rails_sql_views gem allows us to create views and access them by creating models corresponding to the views.

At this point if you run rake db:schema:dump, we will have a rails based BioSQL schema and which we can conveniently use to create a BioSQL database on any Relational database that rails supports and this includes Microsoft SQL server, DB2, Oracle, SQLlite and a host of others. All that would be required is to change the database.yml file to suit the adapter of choice and then execute rake db:schema:load to load the BioSQL schema.

Please note that if your are using rails 2.2.2,  you may want to comment the lines

unless Kernel.respond_to?(:gem)
  Kernel.send :alias_method, :gem, :require_gem
end

in rails_sql_views(0.6.1), otherwise running db:schema:dump will cause rake to abort.

In the next part I will describe how to create the necessary resources for our RESTful(Representational State Transfer) bioinformatics web application and rendering of the graphics.


What’s new in Bioruby edge

Changes to Bio::Blast

Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!

Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an  -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work

require 'rubygems'
require 'bio'

report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
  puts report.class
end

Unless blast_output2.txt is in XML format

In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,

require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff|
ff.each do |rep|
   puts rep.statistics
   rep.iterations.each do |itr|
      puts itr.hits.size

    itr.hits.each_with_index do |hit,i|
      puts hit.hit_id
      puts hit.len
     end
   end
 end
end

Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.

Changes to Bio:sequence

It is possible to create  sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method

gb = Bio::GenBank.new(genbank_file.gb)
gb.to_biosequence

Bio::SQL Support

Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using  ActiveRecord.

#Make a connection

connection = Bio::SQL.establish_connection(path_to_database.yaml,'development')
#list databases

databases =Bio::SQL.list_databases

#retrieve a sequence
sample_seq = Bio::SQL.fetch_accession('some_accession_number')

#get number of seqeunces in the database
puts Bio::SQL.list_entries

#get references associated with an entry
puts sample_seq.references

#create an embl format
puts sample_seq.to_biosequence.output(:embl)

Changes to Bio::GFF2 and Bio::GFF3

GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated.  Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=

Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for  formatted text.

Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a  hash, use the to_hash method

To support data output for GFF2/GFF3, new methods have been added:  Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s

Lots of other changes have been incorporated for the GFF classes and you can view the change log at github

CodeML parser

A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added.  The class provides methods  for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates  provides simpel classes  for accessing the codeml report and rates file.  This example is from the example given in the source code

require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
     alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
     # Reads newick tree from a file
     tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
  # Creates a Codeml object
  codeml = Bio::PAML::Codeml.new
     # Sets parameters
     codeml.parameters[:runmode] = 0
     codeml.parameters[:RateAncestor] = 1
     # You can also set many parameters at a time.
   codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
     # Executes codeml with the alignment and the tree
     report = codeml.query(alignment, tree)

Lots of Bugs  have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!



PlasmoDB 5.4 released

PlasmodbThe ApiDB team have announced a new release of the Plasmodb database version 5.4. The database hosts genomic and proteomic data for different species of the parasitic eukaryote Plasmodium, which is the causative agent for malaria. It brings together data provided by numerous laboratories worldwide. From an email sent to registered database users,

New data in the new release include:

  • A slightly modified reference genome for P. falciparum
  • P. berghei gametocyte proteomics data
  • Many additional P. falciparum SNPs
  • Additional ESTs
  • Expression profiling data for antigenic and adherent variants of P. falciparum 3D7
  • User comments submitted prior to June 2007 have now been incorporated into the official annotation.

A brief list of new features include:

  • faster loading of Gene and Genome Browser pages.
  • Improved synteny views in the Genome Browser.
  • Browser views of rodent malaria genomes colored to indicate chromosomes.
  • Gene page links to various external data sources (including PlasmoMAP, TDRtargets, UCSC P. falciparum genome browser, Ontology-based Pattern Identification and literature databases).
  • More convenient access to help … please click the “Ask us a Question” link on the left of every page, or the “Contact Us” at bottom to report problems or suggest improvements to the database.

Many thanks to the Plasmodb team and the entire ApiDB team for the the recent improvements and for the new datasets.

Plasmodium falciparum re-annotation workshop opens

The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.

The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.

Issues that will be addressed and visited include:

- standards for the use of structured gene ontologies in gene/genome annotation

- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc

- naming conventions for large gene families

- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data

- standards for transfering annotations to orthologs in other Plasmodium species

- plans and proposals for further Plasmodium sequencing and other genomics resources

- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc

- future requirements and needs for Plasmodium informatics resources

- annotation projects not completed during the workshop … and strategies for ensuring completion

The workshop is sponsored by Sanger institute and plasmodb