Category: tutorials

Keep track of Bioruby plugins

Biogems.info is a new site for keeping track of new and existing Bioruby plugins. Plugins are separate code libraries that split functionality out of the Bioruby main tree. The idea is to have a core Bioruby release and to allow Ruby developers to contribute to Bioruby through plugins. According to Bonnal, the maintainer of biogem (the bio-plugin crafting tool),  plugins are separately maintained and may represent experimental or work in progress.

To read more about Bioruby plugin system please refer to the wiki page on plugins.

Happy biology!

Bioruby 1.4.2 released!

The Bioruby development team has continued to work tirelessly to bring us the latest release of the Ruby bioinformatics library commonly referred to as bioruby. A list of all the new changes is available  here . One of the most pleasant news for beginners is that the Bioruby tutorial has been updated thanks to Michael O’Keefe and Pjotr Prins. The Release is largely a bug fix release with updates on web services from SOAP to REST interfaces. Upgrading to the latest release is easy…
gem update bio
or
gem install bio

Happy biology!

Processing netMHCII-pan prediction output

Like most informatics throughput methods, epitope prediction generates a lot of output and in a not so friendly format suitable for subsequent analysis. I considered writing a parser for the output using Ruby, but would that not take long? A simple vim function that I added to my .vimrc file to format the output and use a single keystroke worked the magic and saved time.

" formating output from netMHCII-pan program
function! FormatNetmhcOutput()
   g/^\#/norm dd 
   g/^--/norm dd
   g/^Protein/norm dd
   %le
   g/^pos/norm dd
   %s/<=\sWB//g
   %s/<=\sSB//g
   %s/\s\+$//
   %s/\s\+/,/g
   g/^$/d
endfunction
nmap   ;h  :call FormatNetmhcOutput()

This function can be called by pressing the ; and h key when in normal mode. It removes comments and provides a csv output that can be read with a simple R directive.

data <– read.csv("file.csv") 

sample output

Convert a fastA file to a hash

Sometimes you might want to convert a file of fastA sequences to a hash. Here is a one line method that might come handy for that.

require 'bio'
file_path = "example.fasta"

def fasta_to_hash
  Bio::FlatFile.auto(file_path){ |f| f.map {|entry| Hash.[](entry.definition.to_sym,[entry.seq.to_s])}}
end

 #=>[{:"seq1"=>["gatataggagatatcgttagag"]}]

The result is an array of hashes. Each hash key corresponds to the sequence name

Translating a nucleotide sequence in six frames with bioruby

Bioruby offers a very easy and simple way to translate nucleotide sequences.

seq= Bio::Sequence::NA.new("acctatagctctagcta")
seq.translate

We know that there are six posible reading frames for any given nucleotide sequence. Generally the longests Open reading frame is taken to be the correct frame, when we do not have information about the possible protein that is encoded by a given gene. By default the translate method performs translation in the first frame but it can take an argument that defines the translation frame

seq.translate(2) #translate using the second reading frame.

Given a long list of sequences how do we quickly determine the correct reading frame. We would want to have method to translate a given  sequence in all frames and pick the longest reading frame. Assuming that the correct reading frame has no stop codons, we can write a quick method to perform  the six frame translation.

 def longest_reading_frame(sequence)
  orfs = [] #a container for orfs(open reading frames)
  #translate a sequence in all 6 frames
   6.times do |frame|
   translated = Bio::Sequence::NA.new(sequence).translate(frame + 1)
   stop_codons = translated.scan(/\*/).size
    orfs << translated if stop_codons == 0
   end
  orfs[0]
end

This method uses an array to collect all translated sequences that contain no stop codons and returns the first sequence in the array. This might not scale very well for very long sequences but that will be a post for another day!

Happy Biology!

Converting sequence data from csv to fasta format

Many  times I find someone storing sequence data in excel Workbooks.(insert scream here) This is usually followed by a request which goes like this,

Someone: ” I will send you some sequences and then we can perform xyz analysis please?”

Me: “Are they in fasta format?”

Someone: “No, they are in Excel ”

Me: (supressing a laugh) “Ok, do you mind to convert them to Fasta and then we can do xyz?”

Someone:(with a wiggle on the face)  “How do I do that?, Is there a windows  program to do that?”

Me: (feeling superman-nish) “eeh we can create a quick script in perl or Ruby, I prefer Ruby … but you should lean some basic perl or Ruby…. and run away from windows. :)”

Me: “Save your data as CSV(File ->Save As-> csv),  then send me that file”

So here is a very simple script that reads a csv file and creates a fasta file using Ruby.

You need to specify the path to the input csv file and the output fasta file, the column number that contains the name of the sequence and the column number that contains the sequence data in the csv file.

require 'csv'
# read a csv file and create a fasta file
def csv_to_fasta(csv_file,output_file,name_col,seq_col)
  File.open(output_file,'w') do |file|
  count = 0
  CSV.foreach(csv_file) do |row|
   sequence_id = row[name_col]
   seq = row[seq_col]

  count = count+1
  puts sequence_id
  file.puts ">#{sequence_id} \n#{seq}"
 end
 puts "#{count} sequences processed"
end
csv_file    = "#{ENV['HOME']}/path_to_csv_file.csv"
fasta_file  = "#{ENV['HOME']}/path_to_fasta_file.fasta"

seq_name_col = 0 #assumes the first column contains the names
seq_data_col = 1 #second column contains the seq data

csv_to_fasta(csv_file,fasta_file,seq_name_col,seq_data_col)

Happy biology!