Convert a fastA file to a hash

Sometimes you might want to convert a file of fastA sequences to a hash. Here is a one line method that might come handy for that.

require 'bio'
file_path = "example.fasta"

def fasta_to_hash{ |f| {|entry| Hash.[](entry.definition.to_sym,[entry.seq.to_s])}}


The result is an array of hashes. Each hash key corresponds to the sequence name

My general purpose bioinformatics toolbox

I spend most of my time writing code and using an range of bioinformatics analysis packages. Unlike in many other professions, sometimes  there are no best tools for accomplishing a bioinformatics task. The tools are continuously improved and the choice of tools is dependent on the research question and the biology of whatever you are investigating. However  I have come to rely on some general purpose resources that make me more productive. Let me introduce you to my general purpose spanner box.

Code Editing


I have finally found nirvana in MacVim, which is the preferred version of  Vim for Mac OS. It allows screen splitting, window resizing and integrates with the console, such that you can run system commands right in the editor. You have to install the necessary scripts or plugins to support what you want to do.

It has increased my productivity, although it has a slightly steep learning curve.  This is a tool I recommend to any bioinformatician, if you are not already using it!

Cost: Free

Source Control


Git Source control

I use git for source control. It is awesome and fits very well with my workflow. Git has powerful features and easy to use and work with. I like the idea of distributed source control and it makes it easier to work on different versions of the same project!

Cost: Free

Bibliography manager



I use Papers, which is a commercial tool but I would recommend it to anyone. It helps me sort,annotate and read research articles. I once used Mendeley, which is an awesome tool as well.

Cost: $42 (has academic discounts)




One of the best tools which we may forget is a tool is the terminal! Since I use Mac OS, I enjoy the best of both worlds, a powerful Unix command line support, excellent graphics and support for proprietary software if need arises.



Pilot V Ball RT Pen


I don’t keep an electronic notebook since I prefer jotting down my notes and having a Notebook. I use a liquid ink Pilot V ball RT pen. The pen has a retractable cone-tip liquid ink rollerball, rubber grip and metal pocket clip. It is airplane-safe and writes a 0.4mm line. 

Cost: $5



D66174 NoteBook

My preference for a notebook is an  A4  D66174 Notebook.  Each book has about 180 pages.  It comes with a protective handcover. This is an archive for my written thoughts, discussions and workflows.

Cost $90 for a pack of five books.

What is your general purpose  bioinformatics toolbox?

Translating a nucleotide sequence in six frames with bioruby

Bioruby offers a very easy and simple way to translate nucleotide sequences.


We know that there are six posible reading frames for any given nucleotide sequence. Generally the longests Open reading frame is taken to be the correct frame, when we do not have information about the possible protein that is encoded by a given gene. By default the translate method performs translation in the first frame but it can take an argument that defines the translation frame

seq.translate(2) #translate using the second reading frame.

Given a long list of sequences how do we quickly determine the correct reading frame. We would want to have method to translate a given  sequence in all frames and pick the longest reading frame. Assuming that the correct reading frame has no stop codons, we can write a quick method to perform  the six frame translation.

 def longest_reading_frame(sequence)
  orfs = [] #a container for orfs(open reading frames)
  #translate a sequence in all 6 frames
   6.times do |frame|
   translated = + 1)
   stop_codons = translated.scan(/\*/).size
    orfs << translated if stop_codons == 0

This method uses an array to collect all translated sequences that contain no stop codons and returns the first sequence in the array. This might not scale very well for very long sequences but that will be a post for another day!

Happy Biology!

Converting sequence data from csv to fasta format

Many  times I find someone storing sequence data in excel Workbooks.(insert scream here) This is usually followed by a request which goes like this,

Someone: ” I will send you some sequences and then we can perform xyz analysis please?”

Me: “Are they in fasta format?”

Someone: “No, they are in Excel ”

Me: (supressing a laugh) “Ok, do you mind to convert them to Fasta and then we can do xyz?”

Someone:(with a wiggle on the face)  “How do I do that?, Is there a windows  program to do that?”

Me: (feeling superman-nish) “eeh we can create a quick script in perl or Ruby, I prefer Ruby … but you should lean some basic perl or Ruby…. and run away from windows. :)”

Me: “Save your data as CSV(File ->Save As-> csv),  then send me that file”

So here is a very simple script that reads a csv file and creates a fasta file using Ruby.

You need to specify the path to the input csv file and the output fasta file, the column number that contains the name of the sequence and the column number that contains the sequence data in the csv file.

require 'csv'
# read a csv file and create a fasta file
def csv_to_fasta(csv_file,output_file,name_col,seq_col),'w') do |file|
  count = 0
  CSV.foreach(csv_file) do |row|
   sequence_id = row[name_col]
   seq = row[seq_col]

  count = count+1
  puts sequence_id
  file.puts ">#{sequence_id} \n#{seq}"
 puts "#{count} sequences processed"
csv_file    = "#{ENV['HOME']}/path_to_csv_file.csv"
fasta_file  = "#{ENV['HOME']}/path_to_fasta_file.fasta"

seq_name_col = 0 #assumes the first column contains the names
seq_data_col = 1 #second column contains the seq data


Happy biology!

My first Bioruby plugin calculates the isoelectric point of a protein

Late last year,  there was a lot of talk about creating a plugin system for Bioruby. The idea is that more people can start to develop bioinformatics libraries using the Ruby language and the libraries can leverage on the bioruby framework. Bioruby maintainers can then concentrate on yet to be defined “core” parts of the library to ensure compatibility and support for the plugins.Together with Pascal Bentz we have created a library to calculate the Isoelectric point of a protein given a Pka set and an  amino acid sequence of a peptide/protein. The project lay domant for a while at github until now! I am happy to release my first bioruby plugin, bio-isoelectric point! Download it at Fork it and check the usage at github


require 'bio'
require 'bio-isoelectric_point'
protein_seq ="KKGFTCGELA")

#what is the protein charge at ph 14?
charge = protein_seq.charge_at(14) #=>-2.999795857467562

#calculate the ph using dtaselect pka set and round off to 3 decimal places
isoelectric_point = protein_seq.isoelectric_point(‘dtaselect’, 3) #=>8.219

# calculate the isoelectric point pH with a custom set
custom_pka_set = { “N_TERMINUS” => 8.1,
“K” => 10.1,
“R” => 12.1,
“H” => 6.4,
“C_TERMINUS” => 3.15,
“D” => 4.34,
“E” => 4.33,
“C” => 8.33,
“Y” => 9.5
iep_ph = protein_seq.isoelectric_point(custom_pka_set, 3) #=> 8.193

This gem supports the following Pka sets, as well as allowing a user to provide a custom Pka set.

    * dta_select
    * emboss
    * rodwell
    * wikipedia
    * sillero

Happy biology!

What’s new in bioruby


  I scouted the bioruby git repository the other day to see what might be new in the current snapshot.   These are some of the notable changes:

Bug fixes;

Lots of bug fixes. For example the Bio::Fasta.remote bug has been fixed, workaround for Zlib error, fixed method names

Increased ruby 1.9 support

Renaming of files and modules

Some files and modules have been renamed for example Bio::Fastq:QualityScore has been renamed to Bio::Sequence::QualityScore

Better documentation

There is a samples folder that include sample usage of some classes and methods

PhyloXML support

Support for the phyloxml parser and writer has been included. A new version(1.10) of the PhyloXML schema has been added.

This contribution was provided by the awesome Latvian girl through a Google Summer of Code project and working for NESCent organization.

Meme and Mast support

Contributed by Adam Kraut. Minimal and basic support for the motif finding application Meme and Mast has been added.


Speed up of Bio::Tree.children

“For speed up of Bio::Tree#children and parent, internal cache of
the parent for each node is added. The cache is automatically
cleared when the tree is modified. Note that the cache can only
be accessed from inside Bio::Tree.
* Bio::Tree#parent is changed to directly raise IndexError when
both of the root specified in the argument and preset in the
tree are nil (previously, the same error is raised in the path
method which is internally called from the parent method).
* Bio::Tree#path is changed not to call bfs_shortest_path if the
node1 and node2 are adjacent.”
To build a gem based on the current snapshot,  make sure the following lines have been included in the bioruby.gemspecs file. The current(today’s) snapshot may have already fixed this by now. :)
Note that this is the breeding edge version and things are bound to break.
Thank you for the awesome work! 

Standalone BLAST with Ruby revisited

Earlier  I showed a very simple way to perform a BLAST  using Ruby. Today I would like to revisit that topic for two reasons.

  1. The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
  2. The original post was not very through.

BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can  search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database  and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.

The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely  available from the NCBI website.The first thing  to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST

Once installed on your system  the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.

In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent  at parsing the results file. Using Bioruby with BLAST is  very straightforward:

#blasting the bioruby way #query_file: a list of query sequences in fasta format #database_path: a path to the actual BLAST formatted database #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
def bio_blast(program, database_path,query_file)

factory = Bio::Blast.local(program,database_path)
ff =, query_file)
ff.each do |entry|
report = factory.query(entry) # report will be a Blast::Report object
# iterate trough the hits
report.each do|hit|
puts hit.bit_score        # bit score (*)
puts hit.query_seq        # query sequence
puts hit.midline          # middle line string of alignment of homologous region (*)
puts hit.target_seq       # hit sequence
puts hit.evalue           # E-value
puts hit.identity         # % identity
puts hit.overlap          # length of overlapping region
puts hit.query_id         # identifier of query sequence
puts hit.query_def        # definition(comment line) of query sequence
puts hit.query_len        # length of query sequence
puts hit.target_id        # identifier of hit sequence
puts hit.target_def       # definition(comment line) of hit sequence
puts hit.target_len       # length of hit sequence
puts hit.query_start      # start position of homologous region in query sequence
puts hit.query_end        # end position of homologous region in query sequence
puts hit.target_start     # start position of homologous region in hit(target) sequence
puts hit.target_end       # end position of homologous region in hit(target) sequence
puts hit.lap_at           # array of above four numbers
hit.each do |hsp|
puts hsp.query_from