Tagged: bioruby

Bio-graphics, BioSQL and Rails part 1

In these series I will show you how to quickly add graphics support to a bioinformatics database rails application. We are going to use the biographics library by Jan Aerts, the BioSQL database schema, and rails 2.2.2 (also works with 2.3.2)  In this simple example we want to represent a sequence as a graphic, such that we can view it in a web browser more or less the way Gbrowse works. Each main feature has different subfeatures at different locations along it.

——————————————————————- main feature

——- ——– ———— ——– ——    subfeatures

We need to have the following installed, rails 2.1.1, bio 2.1, biographics 1.4 all available as gems and a database based on BioSQL schema.

We need to download the BioSQL schema located here. The latest version as of this writing is BioSQL v1.0 (code-named Tokyo) release, v1.0.1. Create a database called biosql_development. I am on Ubuntu Linux with Mysql 5.0.

george:>mysql -u george -p
:enter password
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 27
Server version: 5.0.51a-3ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database biosql_development;
Query OK, 1 row affected (0.02 sec)


I have created a database called biosql_development. Why am i not using migrations? The reason is that BioSQL has some agreed standards on table names and schema convection which are not compatible with rails database creation and table naming conventions. However Rails allows us to override these default convections, when working with legacy databases, as will be our case.

After creating the database, load the BioSQL schema to the empty database. First we need to tell mysql which database to use.

mysql> use biosql_development;

then load the schema

mysql> source /home/george/Desktop/downloadsfolder/biosql-1.0.1/sql/biosqldb-mysql.sql;
Query OK, 0 rows affected, 1 warning (0.48 sec)

Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

 ........ trucated

Now we need to create a rails application and connect to this database.

I use the Netbeans IDE development environment for creating ruby and rails applications. Go ahead and create a rails application and specify to use mysql as the database adapter.

To connect to our legacy database, we need to override some convections. First disable table plurulization, and tell rails that the table primary name is named as tablename_id as opposed to just the id column expected by rails. To do that

Create a new file in your application configurations/initializers directory called override_rails.rb (you can call it whatever).

 class ActiveRecord::Base
  self.pluralize_table_names = false

  self.primary_key_prefix_type = :table_name_with_underscore

The two lines above tells ActiveRecord not to expect the table names to be plural and that the primary key for each table is named as tablename_id format.

Also create another one called external_libraries.rb in the initializers directory, as you can tell this is where I want to put my require statements for loading external libraries.

require 'rubygems'

#load the bioinformatics library
require 'bio'

#load the biographics library
require 'bio-graphics'

#load the sql views extension library
gem 'rails_sql_views'
require 'rails_sql_views'

This file loads our gems. The rails_sql_views gem allows us to create views and access them by creating models corresponding to the views.

At this point if you run rake db:schema:dump, we will have a rails based BioSQL schema and which we can conveniently use to create a BioSQL database on any Relational database that rails supports and this includes Microsoft SQL server, DB2, Oracle, SQLlite and a host of others. All that would be required is to change the database.yml file to suit the adapter of choice and then execute rake db:schema:load to load the BioSQL schema.

Please note that if your are using rails 2.2.2,  you may want to comment the lines

unless Kernel.respond_to?(:gem)
  Kernel.send :alias_method, :gem, :require_gem

in rails_sql_views(0.6.1), otherwise running db:schema:dump will cause rake to abort.

In the next part I will describe how to create the necessary resources for our RESTful(Representational State Transfer) bioinformatics web application and rendering of the graphics.


What’s new in Bioruby edge

Changes to Bio::Blast

Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!

Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an  -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work

require 'rubygems'
require 'bio'

report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
  puts report.class

Unless blast_output2.txt is in XML format

In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,

require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff|
ff.each do |rep|
   puts rep.statistics
   rep.iterations.each do |itr|
      puts itr.hits.size

    itr.hits.each_with_index do |hit,i|
      puts hit.hit_id
      puts hit.len

Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.

Changes to Bio:sequence

It is possible to create  sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method

gb = Bio::GenBank.new(genbank_file.gb)

Bio::SQL Support

Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using  ActiveRecord.

#Make a connection

connection = Bio::SQL.establish_connection(path_to_database.yaml,'development')
#list databases

databases =Bio::SQL.list_databases

#retrieve a sequence
sample_seq = Bio::SQL.fetch_accession('some_accession_number')

#get number of seqeunces in the database
puts Bio::SQL.list_entries

#get references associated with an entry
puts sample_seq.references

#create an embl format
puts sample_seq.to_biosequence.output(:embl)

Changes to Bio::GFF2 and Bio::GFF3

GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated.  Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=

Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for  formatted text.

Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a  hash, use the to_hash method

To support data output for GFF2/GFF3, new methods have been added:  Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s

Lots of other changes have been incorporated for the GFF classes and you can view the change log at github

CodeML parser

A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added.  The class provides methods  for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates  provides simpel classes  for accessing the codeml report and rates file.  This example is from the example given in the source code

require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
     alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
     # Reads newick tree from a file
     tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
  # Creates a Codeml object
  codeml = Bio::PAML::Codeml.new
     # Sets parameters
     codeml.parameters[:runmode] = 0
     codeml.parameters[:RateAncestor] = 1
     # You can also set many parameters at a time.
   codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
     # Executes codeml with the alignment and the tree
     report = codeml.query(alignment, tree)

Lots of Bugs  have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!

Bioruby mini-series: The Bio::Sequence::Common class

Sequence Transformation

Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.


Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example

A Bio::Sequence object is easily created like this;

require ‘bio’

my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat

my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.

Bio::Sequence::Common Non Modifying methods

  • to_s

    This method returns a sequence as a string. It does not modify the original sequence.

    puts my_dna.to_s #=> actagatatttgat

    puts my_dna.to_s.class #=> String

    An alias for this method is the to_str method.

    #=> actagatatttgat

  • seq

    This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,

    my_dna2 = my_dna.seq

    puts my_dna2 #=> actagatatttgat

    puts my_dna2.class #=>

Bio::Sequence::Common modifying methods

  • Normalize!

    This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified

    For example

    test_seq = Bio::Sequence::NA.new(“ACTG”)

    puts test_seq.normalize! #=>

  • Concatenating

    Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
    It is also referred to as << method.

    test_seq = Bio::Sequence::NA.new(“actg”)

    test_seq << “acagat”

    test_seq concat “acagat”

    puts test_seq #=>

Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example

test_seq = Bio::Sequence::NA.new(“actg”)

test_seq2 = test_seq + (“cttcccttttt” “tatatata”)

puts test_seq2 #=>

puts test_seq #=>actg

Working with subsequences

Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.

  • Subseq

    This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example

    my_seq = Bio::Sequence::NA.new(“agggatttc”)

    puts my_seq.subseq(2,5) #=>

    The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers

    When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.

    puts my_seq.subseq #=> agggatttc

  • window_search

    This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ‘steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.

    For example

    To print the average GC% on each 100bp you can write,

    s.window_search(100) do |subseq|

    puts subseq.gc