Bioruby mini-series: The Bio::Sequence::Common class

Sequence Transformation

Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.

Bio::Sequence::Common

Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example

A Bio::Sequence object is easily created like this;

require ‘bio’

my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat

my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.

Bio::Sequence::Common Non Modifying methods

  • to_s

    This method returns a sequence as a string. It does not modify the original sequence.

    puts my_dna.to_s #=> actagatatttgat

    puts my_dna.to_s.class #=> String

    An alias for this method is the to_str method.

    my_dna.to_str
    #=> actagatatttgat

  • seq

    This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,

    my_dna2 = my_dna.seq

    puts my_dna2 #=> actagatatttgat

    puts my_dna2.class #=>
    Bio::Sequence::NA

Bio::Sequence::Common modifying methods

  • Normalize!

    This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified

    For example

    test_seq = Bio::Sequence::NA.new(“ACTG”)

    puts test_seq.normalize! #=>
    actg

  • Concatenating

    Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
    It is also referred to as << method.

    test_seq = Bio::Sequence::NA.new(“actg”)

    test_seq << “acagat”

    test_seq concat “acagat”

    puts test_seq #=>
    actgacagat

Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example

test_seq = Bio::Sequence::NA.new(“actg”)

test_seq2 = test_seq + (“cttcccttttt” “tatatata”)

puts test_seq2 #=>
actgcttcccttttttatatata

puts test_seq #=>actg

Working with subsequences

Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.

  • Subseq

    This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example

    my_seq = Bio::Sequence::NA.new(“agggatttc”)

    puts my_seq.subseq(2,5) #=>
    ggga

    The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers

    When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.

    puts my_seq.subseq #=> agggatttc

  • window_search

    This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ’steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.

    For example

    To print the average GC% on each 100bp you can write,

    s.window_search(100) do |subseq|

    puts subseq.gc

    end