Most often in sequence analysis we want to compare how similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:
Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two strings.
This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.
Leveshtein edit distance
Is defined as the minimal costs involved in transforming one string into another by using deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.
Longest common substring
This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.
Longest common Subsequence
The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.
Look at the documentation for more explanations of the metrics and algorithms.
To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.
sudo gem install amatch
Then in script,
require 'rubygems' require 'amatch'
include Amatch require 'bio'
#with bioruby it would be easy to compare two sequence entries for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat") seq_obj2 = Bio::Sequence.auto("gccagatagttaat") #calculate the hamming distance m = Hamming.new(seq_obj1.to_seq) m.match(seq_obj2.to_seq) #=> #calculate pair-distances between the two sequences pair_distance_obj = PairDistance.new(seq_obj1.seq) pair_distance_obj.match(seq_obj2.seq) #=> # note that you can just substitute the strings directly to the metric object creation method without creating the sequence objects!
Note that amatch failed to install on windows XP with the following error
Building native extensions. This could take a while…
ERROR: Error installing amatch:
ERROR: Failed to build gem native extension.
C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.
Although i have nmake installed on my windows machine. I will look at that later.
Happy string matching!