ORF FINDER WITH PYTHON

Elfermi Rachid
3 min readOct 28, 2020

creating a python program to extract all complete ORF in metagenomics reads

ORF finder searches for open reading frames (ORFs) in the DNA sequence you enter. the program return all the ORFS located.

the use of ORF FINDER is to locate potentials genes in a given fragment

I this lesion i well only work with pure python,( Biopython gives you some great functions to handle DNA sequences and parsing FASTA files).

The first step is extracting all the fragments from the fasta or multifasta file. the same method works for both, As for me using a multifasta file (SRA file)as shows bellow

To extract all the fragments we need fist to read our SRA file, then when ever the program encounter ‘>’ it start extracting the next line until another ‘>’ is encounter

Now after extracting all the fragments and their description( just in case we might need it).

for each fragments we’re going to extract the six frame translation. as for explanation this is how we extract it

1. Consider a hypothetical sequence:

CGCTACGTCTTACGCTGGAGCTCTCATGGATCGGTTCGGTAGGGCTCGATCACATCGCTAGCCAT

2. Divide the sequence into 6 different reading frames(+1, +2, +3, -1, -2 and -3). The first reading frame is obtained by considering the sequence in words of 3.

FRAME +1: CGC TAC GTC TTA CGC TGG AGC TCT CAT GGA TCG GTT CGG TAG GGC TCG ATC ACA TCG CTA GCC AT

The second reading frame is formed after leaving the first nucleotide and then grouping the sequence into words of 3 nucleotides

FRAME +2: C GCT ACG TCT TAC GCT GGA GCT CTC ATG GAT CGG TTC GGT AGG GCT CGA TCA CAT CGC TAG CCA T

The third reading frame is formed after leaving the first 2 nucleotides and then grouping the sequence into words of 3 nucleotides

FRAME +3: CG CTA CGT CTT ACG CTG GAG CTC TCA TGG ATC GGT TCG GTA GGG CTC GAT CAC ATC GCT AGC CAT

The other 3 reading frames can be found only after finding the reverse complement.

Complement : GCGATGCAGAATGCGACCTCGAGAGTACCTAGCCAAGCCATCCCGAGCTAGTGTAGCGATCGGTA

Reverse complement: ATGGCTAGCGATGTGATCGAGCCCTACCGAACCGATCCATGAGAGCTCCAGCGTAAGACGTAGCG

Now same process as that of +1, +2 and +3 strands is repeated for -1, -2 and -3 strands with reverse complement sequence

FRAME -1: ATG GCT AGC GAT GTG ATC GAG CCC TAC CGA ACC GAT CCA TGA GAG CTC CAG CGT AAG ACG TAG CG

FRAME -2: A TGG CTA GCG ATG TGA TCG AGC CCT ACC GAA CCG ATC CAT GAG AGC TCC AGC GTA AGA CGT AGC G

FRAME -3: AT GGC TAG CGA TGT GAT CGA GCC CTA CCG AAC CGA TCC ATG AGA GCT CCA GCG TAA GAC GTA GCG

now lets code this with python

the last step consist of another loop, where we’re going to loop the 6 open reading frames and extract ORFs.

Now all the extracted ORFs are stored in listOfOrf list.

summary

1 extract all the fragments from the SRA file

2 for each fragments we extract the six frame translation

3 for each frame in the six frame translation we extract the completes ORF

for more post follow me on my LinkedIn

--

--

Elfermi Rachid

bioinformatician | data scientist | machine learning enthusiast