43  Central dogma

Author

Andres Patrignani

Published

January 17, 2024

The central dogma of molecular biology explains the flow of genetic information within a biological system. At its core, it involves the process of transcription and translation. In transcription, the genetic code in DNA is transcribed into messenger RNA (mRNA). This mRNA is then used in translation to create polypeptides, which are chains of aminoacids that form proteins. Critical to this process are codons, which are sequences of three nucleotides in the mRNA. Each codon corresponds to a specific aminoacid or a signal to start or stop the translation process by ribosomes. This elegant system of codons ensures that genetic information is accurately translated into the vast array of proteins essential for life.

In this exercise we will create a set of functions to decode DNA sequences with multiple polypeptides. Basically a sort of digital ribosome.

# Import modules
import pandas as pd
import pprint
# Load data
lookup = pd.read_csv('../datasets/codon_aminoacids.csv')
lookup.head()
codon letter aminoacid
0 AAA K Lysine
1 AAC N Asparagine
2 AAG K Lysine
3 AAU N Asparagine
4 ACA T Threonine

Transcription

Given a sequence of DNA bases we need to find the complementary strand. The catch here is that we also need to account for the fact that the base thymine is replaced by the base uracil in RNA.

To check for potential typos in the sequence of DNA or to prevent that the user feeds a sequence of mRNA instead of DNA to the transcription function, we will use the raise statement, which will automatically stop and exit the for loop and throw a custom error message if the code finds a base a base other than A,T,C, or G. The location of the raise statement is crucial since we only want to trigger this action if a certain condition is met (i.e. we find an unknown base). So, we will place the raise statement inside the if statement within the for loop. We will also return the location in the sequence of the unknown base using the find() method.

The error catching method described above is simple and practical for small applications, but it has some limitations. For instance, we cannot identify whether there are more than one unknwon bases and we cannot let the user know the location of all these bases. Nonetheless, this is a good starting point.

def transcription(DNA):
    '''
    Function that converts a sequence of DNA bases into messenger RNA
    Input: string of DNA
    Author: Andres Patrignani
    Date: 3-Feb-2020
    '''
    # Translation table
    transcription_table = DNA.maketrans('ATCG','UAGC')
    #print(transcription_table) {65: 85, 84: 65, 67: 71, 71: 67}
    
    # Translate using table
    mRNA = DNA.translate(transcription_table)
    return mRNA
    

Translation

The logic of the translation function will be similar to our previous example. The only catch is that we need to keep track of the different polypeptides and the start and stop signals in the mRNA. These signals dictate the sequence of aminoacids for each polypeptide. Here are some steps of the logic:

  • Scan the mRNA in steps of three bases

  • Trigger a new polypeptide only when we find the starting ‘AUG’ codon

  • After that we know the ribosome is inside the mRNA that encodes aminoacids

  • The end of the polypeptide occurs when the ribosome finds any of the stop codons: ‘UAA’, ‘UAG’, ‘UGA’

# Translation function

def translation(mRNA):
    '''
    Function that decodes a sequence of mRNA into a chain of aminoacids
    Input: string of mRNA
    Author: Andres Patrignani
    Date: 27-Dec-2019
    '''
    
    # Initialize variables
    polypeptides = dict() # More convenient and human-readable than creating a list of lists
    start = False # Ribosome outside region of mRNA that encodes aminoacids
    polypeptide_counter = 0 # A counter to name our polypetides
    
    for i in range(0,len(mRNA)-2,3):
        codon = mRNA[i:i+3] # Add 3 to avoid overlapping the bases between iterations.
        aminoacid_idx = lookup.codon == codon # Match current codon with all codons in lookup table
        aminoacid = lookup.aminoacid[aminoacid_idx].values[0]
        
        # Logic to find in which polypeptide the Ribosome is in
        if codon == 'AUG':
            start = True
            polypeptide_counter += 1 
            polypeptide_name = 'P' + str(polypeptide_counter)
            polypeptides[polypeptide_name] = []
        
        elif codon == 'UAA' or codon == 'UAG' or codon == 'UGA':
            start = False
        
        # If the Ribosme found a starting codon (Methionine)
        if start:
            polypeptides[polypeptide_name].append(aminoacid)
        
    return polypeptides
    

In the traslation function we could have used if aminoacid == 'Methionine': for the first logical statement and elif aminoacid == 'Stop': for the second logical statement. I decided to use the codons rather than the aminoacids to closely match the mechanics of the Ribosome, but the statements are equivalent in terms of the outputs that the function generates.

Q: What happens if you indent four additional spaces the line: return polypeptide in the translation function? You will need to modify, save, and call the function to see the answer to this question.

DNA = 'TACTCGTCACAGGTTACCCCAAACATTTACTGCGACGTATAAACTTACTGCACAAATGTGACT'
mRNA = transcription(DNA)
print(mRNA)
polypeptides = translation(mRNA)
pprint.pprint(polypeptides)
AUGAGCAGUGUCCAAUGGGGUUUGUAAAUGACGCUGCAUAUUUGAAUGACGUGUUUACACUGA
{'P1': ['Methionine',
        'Serine',
        'Serine',
        'Valine',
        'Glutamine',
        'Tryptophan',
        'Glycine',
        'Leucine'],
 'P2': ['Methionine', 'Threonine', 'Leucine', 'Histidine', 'Isoleucine'],
 'P3': ['Methionine', 'Threonine', 'Cysteine', 'Leucine', 'Histidine']}