# Import modules
import pandas as pd
import pprint
43 Central dogma
The central dogma of molecular biology explains the flow of genetic information within a biological system. At its core, it involves the process of transcription and translation. In transcription, the genetic code in DNA is transcribed into messenger RNA (mRNA). This mRNA is then used in translation to create polypeptides, which are chains of aminoacids that form proteins. Critical to this process are codons, which are sequences of three nucleotides in the mRNA. Each codon corresponds to a specific aminoacid or a signal to start or stop the translation process by ribosomes. This elegant system of codons ensures that genetic information is accurately translated into the vast array of proteins essential for life.
In this exercise we will create a set of functions to decode DNA sequences with multiple polypeptides. Basically a sort of digital ribosome.
# Load data
= pd.read_csv('../datasets/codon_aminoacids.csv')
lookup lookup.head()
codon | letter | aminoacid | |
---|---|---|---|
0 | AAA | K | Lysine |
1 | AAC | N | Asparagine |
2 | AAG | K | Lysine |
3 | AAU | N | Asparagine |
4 | ACA | T | Threonine |
Transcription
Given a sequence of DNA bases we need to find the complementary strand. The catch here is that we also need to account for the fact that the base thymine
is replaced by the base uracil
in RNA.
To check for potential typos in the sequence of DNA or to prevent that the user feeds a sequence of mRNA instead of DNA to the transcription function, we will use the raise
statement, which will automatically stop and exit the for
loop and throw a custom error message if the code finds a base a base other than A,T,C, or G. The location of the raise
statement is crucial since we only want to trigger this action if a certain condition is met (i.e. we find an unknown base). So, we will place the raise
statement inside the if
statement within the for
loop. We will also return the location in the sequence of the unknown base using the find()
method.
The error catching method described above is simple and practical for small applications, but it has some limitations. For instance, we cannot identify whether there are more than one unknwon bases and we cannot let the user know the location of all these bases. Nonetheless, this is a good starting point.
def transcription(DNA):
'''
Function that converts a sequence of DNA bases into messenger RNA
Input: string of DNA
Author: Andres Patrignani
Date: 3-Feb-2020
'''
# Translation table
= DNA.maketrans('ATCG','UAGC')
transcription_table #print(transcription_table) {65: 85, 84: 65, 67: 71, 71: 67}
# Translate using table
= DNA.translate(transcription_table)
mRNA return mRNA
Translation
The logic of the translation function will be similar to our previous example. The only catch is that we need to keep track of the different polypeptides and the start
and stop
signals in the mRNA. These signals dictate the sequence of aminoacids for each polypeptide. Here are some steps of the logic:
Scan the mRNA in steps of three bases
Trigger a new polypeptide only when we find the starting ‘AUG’ codon
After that we know the ribosome is inside the mRNA that encodes aminoacids
The end of the polypeptide occurs when the ribosome finds any of the stop codons: ‘UAA’, ‘UAG’, ‘UGA’
# Translation function
def translation(mRNA):
'''
Function that decodes a sequence of mRNA into a chain of aminoacids
Input: string of mRNA
Author: Andres Patrignani
Date: 27-Dec-2019
'''
# Initialize variables
= dict() # More convenient and human-readable than creating a list of lists
polypeptides = False # Ribosome outside region of mRNA that encodes aminoacids
start = 0 # A counter to name our polypetides
polypeptide_counter
for i in range(0,len(mRNA)-2,3):
= mRNA[i:i+3] # Add 3 to avoid overlapping the bases between iterations.
codon = lookup.codon == codon # Match current codon with all codons in lookup table
aminoacid_idx = lookup.aminoacid[aminoacid_idx].values[0]
aminoacid
# Logic to find in which polypeptide the Ribosome is in
if codon == 'AUG':
= True
start += 1
polypeptide_counter = 'P' + str(polypeptide_counter)
polypeptide_name = []
polypeptides[polypeptide_name]
elif codon == 'UAA' or codon == 'UAG' or codon == 'UGA':
= False
start
# If the Ribosme found a starting codon (Methionine)
if start:
polypeptides[polypeptide_name].append(aminoacid)
return polypeptides
In the traslation function we could have used if aminoacid == 'Methionine':
for the first logical statement and elif aminoacid == 'Stop':
for the second logical statement. I decided to use the codons rather than the aminoacids to closely match the mechanics of the Ribosome, but the statements are equivalent in terms of the outputs that the function generates.
Q: What happens if you indent four additional spaces the line:
return polypeptide
in the translation function? You will need to modify, save, and call the function to see the answer to this question.
= 'TACTCGTCACAGGTTACCCCAAACATTTACTGCGACGTATAAACTTACTGCACAAATGTGACT'
DNA = transcription(DNA)
mRNA print(mRNA)
= translation(mRNA)
polypeptides pprint.pprint(polypeptides)
AUGAGCAGUGUCCAAUGGGGUUUGUAAAUGACGCUGCAUAUUUGAAUGACGUGUUUACACUGA
{'P1': ['Methionine',
'Serine',
'Serine',
'Valine',
'Glutamine',
'Tryptophan',
'Glycine',
'Leucine'],
'P2': ['Methionine', 'Threonine', 'Leucine', 'Histidine', 'Isoleucine'],
'P3': ['Methionine', 'Threonine', 'Cysteine', 'Leucine', 'Histidine']}