Converting sequence data from csv to fasta format


Many  times I find someone storing sequence data in excel Workbooks.(insert scream here) This is usually followed by a request which goes like this,

Someone: ” I will send you some sequences and then we can perform xyz analysis please?”

Me: “Are they in fasta format?”

Someone: “No, they are in Excel ”

Me: (supressing a laugh) “Ok, do you mind to convert them to Fasta and then we can do xyz?”

Someone:(with a wiggle on the face)  “How do I do that?, Is there a windows  program to do that?”

Me: (feeling superman-nish) “eeh we can create a quick script in perl or Ruby, I prefer Ruby … but you should lean some basic perl or Ruby…. and run away from windows. :)”

Me: “Save your data as CSV(File ->Save As-> csv),  then send me that file”

So here is a very simple script that reads a csv file and creates a fasta file using Ruby.

You need to specify the path to the input csv file and the output fasta file, the column number that contains the name of the sequence and the column number that contains the sequence data in the csv file.

require 'csv'
# read a csv file and create a fasta file
def csv_to_fasta(csv_file,output_file,name_col,seq_col)
  File.open(output_file,'w') do |file|
  count = 0
  CSV.foreach(csv_file) do |row|
   sequence_id = row[name_col]
   seq = row[seq_col]

  count = count+1
  puts sequence_id
  file.puts ">#{sequence_id} \n#{seq}"
 end
 puts "#{count} sequences processed"
end
csv_file    = "#{ENV['HOME']}/path_to_csv_file.csv"
fasta_file  = "#{ENV['HOME']}/path_to_fasta_file.fasta"

seq_name_col = 0 #assumes the first column contains the names
seq_data_col = 1 #second column contains the seq data

csv_to_fasta(csv_file,fasta_file,seq_name_col,seq_data_col)

Happy biology!

Advertisements

5 comments

  1. GS

    Hi! First thank you very much for this script.So i have been trying to run the script on ruby- but i do not get any output. I am sorry but i am an absolute newbie to RUBY . so i am figuring out how to make this work.

    Thanks in advance!

    • biorelated

      Hi,
      This script sort of assumes you already have a grasp of Ruby, nevertheless, what is your target platform?(windows, linux or mac) It also assumes that your sequences are stored somewhere in the home directory. This might not make sense if you are not experienced a *nix platform. Although you can change the path to c:\pat_to_the_sequence.csv and the same for the fasta file path.

      Remember to append ruby name_of_script.rb when running the script! like this
      c\: ruby myscript.rb

  2. a

    Okay I am new to Ruby (downloaded it about half an hour ago!) I am using windows and have read the brief Ruby tutorial.

    This page is perfect as I want to convert a CSV fle to fasta…I am just a bit confused with this script…is it pasted into Ruby all at once? would you be able to highlight the bits where things should be edited to make it run correctly with my own files?

    Sorry to be a pain but it would be so uselful to me right now!

    x

    • biorelated

      You would run the script all at once. you would change the name of the file, i.e the path to the csv file and the name of the new fasta file. seq_name_col is the column number with the names of the sequences and the seq_data_col is the column number for the sequences.

  3. rzenka

    I wrote a little Excel plugin that loads .fasta into Excel and can save as well. It is designed for relatively short protein sequences (splits them into two fields if too long, like in the case of titin). My colleagues love it. I will try to make it more generic and release it publicly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s