Approximate string matching metrics with amatch


Most often in sequence analysis we want to compare how  similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:

Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two  strings.

Hamming distance

This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.

Leveshtein edit distance

Is defined as the minimal costs involved in transforming one string into another by using  deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.

Longest common substring

This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.

Longest common Subsequence

The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.

Look at the documentation for more explanations of the metrics and algorithms.

To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.

sudo gem install amatch

Then in script,

require 'rubygems'

require 'amatch'
include Amatch
require  'bio'
#with bioruby it would be easy to compare two sequence entries  for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat")
seq_obj2 = Bio::Sequence.auto("gccagatagttaat")

#calculate the hamming distance
 m = Hamming.new(seq_obj1.to_seq)
 m.match(seq_obj2.to_seq)
#=> 

#calculate pair-distances between the two sequences
pair_distance_obj = PairDistance.new(seq_obj1.seq)
pair_distance_obj.match(seq_obj2.seq)
 #=>
# note that you can just substitute the strings directly to the metric object creation method
without creating the sequence objects!

Note that amatch  failed to install on windows XP with the following error

Building native extensions.  This could take a while…
ERROR:  Error installing amatch:
ERROR: Failed to build gem native extension.

C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile

nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.

Although i have nmake installed on my windows machine. I will look at that later.

Happy string matching!


Advertisements

8 comments

  1. Rob Syme

    Just what I needed, thanks for the tip!

    I don’t think that Bio::Sequence objects have a ‘to_seq’ method, but they do have a ‘seq’ method :)
    -r

  2. John McLeod

    Hello,
    Just a question, how could I add ‘amatch’ to a standard find method?
    Thanks for any help.
    John

  3. Marwa

    It’s required from me to implement a simple spell checker
    using both algorithms (longest common subsequence & edit distance).
    So, what i asked for is how i can use both of them to get
    an optimized solution.

  4. leo

    I have a windows XP, and I just installed gnuwin32 make app but I was not able to install the gem. this is what I got:
    F:\Program Files\Ruby191\bin>gem install amatch
    Building native extensions. This could take a while…
    ERROR: Error installing amatch:
    ERROR: Failed to build gem native extension.

    “F:/Program Files/Ruby191/bin/ruby.exe” extconf.rb
    creating Makefile

    make
    makefile:154: warning: overriding commands for target `F:/Program’
    makefile:148: warning: ignoring old commands for target `F:/Program’
    make: *** No rule to make target `”/F/Program’, needed by `amatch.o’. Stop.

    Gem files will remain installed in F:/Program Files/Ruby191/lib/ruby/gems/1.9.1/
    r inspection.
    Results logged to F:/Program Files/Ruby191/lib/ruby/gems/1.9.1/gems/amatch-0.2.5

    F:\Program Files\Ruby191\bin>

    any idea?

    • biorelated

      As i had noted ealier i was not able to install the gem on a windows OS using ruby 1.8.6 at the time. Sorry i never explored the cause of the issue but i bet it has to do with the build and make programs needed to compile the source for windows natively. For now i would urge you to try it on a Unix/Linux system. Windows and DOS are not good systems for this kind of stuff.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s