Searching for Christmas-sy proteins

(Mood: Tropical)

Hello from balmy Singapore! (jealous? 😎 )

Anyway, I promise today’s Advent calendar will be a blast!

Errr, yeah, the blast that I’m talking about is the one that stands for basic local alignment search tool, which is an algorithm that deals with nucleotide and protein sequences. Sorry to blast, I mean burst, your bubble.

Here is a common thing you would hear in a biology lab: “Why don’t you blast that sequence?” They didn’t mean that you should strap that protein to a rocket and blast it to outer space along with their worries and frustrations (although I’m pretty sure someone has meant it that way somewhere at some point). Rather, they mean that you should submit that sequence to the algorithm for it to find a match or near-match in the sequence database.

In other words, blast is like Google: it can find occurrence of sequence in the protein database (instead of webpages). Also like Google’s fuzzy search, it can find near-matches too. This is especially important for proteins/nucleotide sequences, since near matches may mean the matches are separated by mutation/insertion/deletion. Me making silly typos in Google search: arguably less important.

Now, I deal with protein sequences a lot for my work and once in a while it brings a smile on my face to see some intelligible words appearing amidst the seemingly random letters. Since this is Advent, I wondered whether I can find some Christmas-sy words in the known protein universe.

In other words, let’s blast some Christmas-related terms!

This silly exercise will also show you a glimpse of data processing workflow commonly encountered in bioinformatics. Who says you can’t learn something while having fun? 😉


Note that this was done with Linux with local blastp (protein BLAST) installation.

1. First, let’s do a test case.  Go to NCBI blastp web interface and we will try to submit a peptide which has the sequence CHRISTMAS. Enter this in the query box:

>prot    
CHRISTMAS

And let’s see the top 3 results…

Query 1   CHRISTMAS 9 
          CHRI TM S 
Sbjct 313 CHRITTMSS 321

Query 1  CHR---ISTMAS 9 
         CHR   ISTMAS 
Sbjct 12 CHRLEKISTMAS 23

Query 1  CHRISTMAS 9 
         CHR+S MAS 
Sbjct 97 CHRVSSMAS 105

Hmmm, ok, so there is no CHRISTMAS sequence occurring yet in all the proteins that humans currently know! Very sad 😦

To give you some idea of our search space, the protein database we are searching is nr, RefSeq non-redundant protein sequences, which as of 2018/11/22 has 178,521,967 sequences!

Now let’s repeat this with other Christmas-related words. Although we can also still use the web interface, let’s try using a local installation of blastp to do the BLAST search instead. blastp is part of software suite BLAST+ made available by NCBI (see here for more information).

2. Let’s source a list of Christmas-related words. A cursory Google search leads me here. Copy-paste this to a text file. Make sure every word is in a separate line. Save this as xmas.raw.txt.

3. Do some clean up. This is easy to do in a Linux terminal. You can of course do manual clean up, but humans are inconsistent. It is better to automate the process with a script. The result will be consistent, traceable, easily customisable, and reproducible.

# cleanup.sh: clean up list of christmas words for blastp search

# convert everything to lowercase 
# note that we keep xmas.raw.txt untouched, for if we ever need to revisit the data cleanup again 
tr [:upper:] [:lower:] < xmas.raw.txt > xmas.clean.txt 
# further cleanup 
sed -i '/[bjouxz]/d      # delete all words containing non amino acid letters 
        /^.$/d           # delete lines with just one letter 
        s/ //g           # delete single spaces 
        s/[[:punct:]]//g # remove punctuations 
' xmas.clean.txt

4. Make FASTA entry for each word

# add FASTA header    
sed -i 's/^/>prot\n/' xmas.clean.txt

5. BLASTing through the snow, in a one-horse open sleigh…

Finally, it’s BLAST time! From my experience, the web and local blastp sometimes give different results because of different parameters. To ensure consistency, you can save the search strategy from blastp webpage, thus capturing all the parameters. Save the search strategy file as xmas.asn. I further edited my search strategy file to point to my local protein database. Now before running on all words, do a test run to see if it gives you the same result as the webpage.

blastp -import_search_strategy xmas.asn -out xmas.out

All good? Go ahead and run on all words. My desktop took around 20 minutes. Perfect time to take a fikapaus 🙂

# overwrite query in xmas.asn    
blastp -import_search_strategy xmas.asn -query xmas.clean.txt -out xmas.out

6. Jingle bells, jingle bells, oh what fun it is to, ahem, analyse your result!

Matches: angel candle candy charity chill cider creche elves family festival garland greeting icicle kings lights manger merry mittens myrrh nativity navidad partridge presents reindeer scarf sleigh stnick sweater tidings tinsel wassail winter wintry wiseman wish wrap wreath

No match: chimney christmastide giftgiving iceskate mincepie santaselves santashelper santaslist

Suprisingly no match: cap card elf

And here are some near-matches, for your amusement:

cannycane (candycane)    
emergreen (evergreen)    
fathprrhrlssmas (fatherchristmas)    
firpeplace (fireplace)    
fradkiksense (frankincense)    
mevarrlchrisim (merrychristmas)    
pvnetree (pinetree)    
sinterrlass (sinterklaas)    
widrertime (wintertime)    
wrarpigdgpaper (wrappingpaper)

There are quite a number of Christmas-sy proteins!

7. Just for fun, I also included NATALIE, our dear blog coach’s name, in the search, and coincidentally her name is Christmas-sy! (Latin: natalis dies Domini = birthday of the Lord). Her name turns out to be quite popular across the kingdoms so to speak: from bacteria, fish, octopus, to birds, there are proteins with NATALIE inside 😀

Here are some:

PTV49966.1 hypothetical protein DBL04_17595, partial [Acinetobacter seifertii]    
PAA83426.1 hypothetical protein BOX15_Mlig003849g1 [Macrostomum lignano]    
XP_009469590.1 PREDICTED: dynein heavy chain 5, axonemal-like [Nipponia nippon]    
XP_010576315.1 PREDICTED: dynein heavy chain 5, axonemal-like [Haliaeetus leucocephalus]    
XP_009884215.1 PREDICTED: dynein heavy chain 5, axonemal-like [Charadrius vociferus]    
XP_021242631.1 dynein heavy chain 5, axonemal-like isoform X1 [Numida meleagris]

Postscript

As I said in the beginning, there are some aspects of this exercise that can be applied to one’s real-life project:

1. Consistency
You might notice that consistency is a motif in the exercise. I did a test case in blastp webpage first and compare the result to when I run local blastp. I wrote script for consistent data processing.

2. Automate as much as possible, not only for ease but also reproducibility
Not only that you should write scripts, you should pay attention to how you name your scripts and variables, as well as documenting them with comments. The future you will thank you! This really takes deliberate effort and I can tell you that I did not do enough of this during my PhD and I am committing to do more.

So with these Christmas-sy proteins, may you have a protein-packed Christmas. MEVARRLCHRISIM!

Yossa is a former PhD student in KI and is now a structural bioinformatics postdoctoral fellow in Singapore. He can be reached through his personal webpage.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s