ORF FINDER WITH PYTHON
creating a python program to extract all complete ORF in metagenomics reads
ORF finder searches for open reading frames (ORFs) in the DNA sequence you enter. the program return all the ORFS located.
the use of ORF FINDER is to locate potentials genes in a given fragment
I this lesion i well only work with pure python,( Biopython gives you some great functions to handle DNA sequences and parsing FASTA files).
The first step is extracting all the fragments from the fasta or multifasta file. the same method works for both, As for me using a multifasta file (SRA file)as shows bellow
To extract all the fragments we need fist to read our SRA file, then when ever the program encounter ‘>’ it start extracting the next line until another ‘>’ is encounter
Now after extracting all the fragments and their description( just in case we might need it).
for each fragments we’re going to extract the six frame translation. as for explanation this is how we extract it
1. Consider a hypothetical sequence:
CGCTACGTCTTACGCTGGAGCTCTCATGGATCGGTTCGGTAGGGCTCGATCACATCGCTAGCCAT
2. Divide the sequence into 6 different reading frames(+1, +2, +3, -1, -2 and -3). The first reading frame is obtained by considering the sequence in words of 3.
FRAME +1: CGC TAC GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT CGG TAG GGC TCG ATC ACA TCG CTA GCC AT
The second reading frame is formed after leaving the first nucleotide and then grouping the sequence into words of 3 nucleotides
FRAME +2: C GCT ACG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTC GGT AGG GCT CGA TCA CAT CGC TAG CCA T
The third reading frame is formed after leaving the first 2 nucleotides and then grouping the sequence into words of 3 nucleotides
FRAME +3: CG CTA CGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TCG GTA GGG CTC GAT CAC ATC GCT AGC CAT
The other 3 reading frames can be found only after finding the reverse complement.
Complement : GCGATGCAGAATGCGACCTCGAGAGTACCTAGCCAAGCCATCCCGAGCTAGTGTAGCGATCGGTA
Reverse complement: ATGGCTAGCGATGTGATCGAGCCCTACCGAACCGATCCATGAGAGCTCCAGCGTAAGACGTAGCG
Now same process as that of +1, +2 and +3 strands is repeated for -1, -2 and -3 strands with reverse complement sequence
FRAME -1: ATG GCT AGC GAT GTG ATC GAG CCC TAC CGA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACG TAG CG
FRAME -2: A TGG CTA GCG ATG TGA TCG AGC CCT ACC GAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CGT AGC G
FRAME -3: AT GGC TAG CGA TGT GAT CGA GCC CTA CCG AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC GTA GCG
now lets code this with python
the last step consist of another loop, where we’re going to loop the 6 open reading frames and extract ORFs.
Now all the extracted ORFs are stored in listOfOrf list.
summary
1 extract all the fragments from the SRA file
2 for each fragments we extract the six frame translation
3 for each frame in the six frame translation we extract the completes ORF
for more post follow me on my LinkedIn