Deep learning for gene prediction in metagenomics fragments.

Elfermi Rachid
4 min readOct 3, 2020

CNN, RNN, LSTM and GRU for gene prediction in metagenomics fragments.

In this final lesion about the application of deep learning for gene prediction, we implement 4 different algorithms and conclude the pest suitable for predicting genes in metagenomics DNA.

The preprocessing in the same as my last post you can find it here.

preprocessing the data.

The preprocessing is the same for all the algorithms, we are using the data provided by this article that contains around 4 millions ORFs, i extracted 200000 ORFs for training and testing.

The data available in my drive.

As usual we load our data into google colab.

Once the data is loaded into the list, next we’re going to convert it into dataframe using pandas.

Now we’re going to treat the ORFs as a language by breaking down the DNA into k-mer length overlapping “words”, for more explanation see this post.

after the initial extraction of ‘words’, next we convert the lists of k-mers for each gene into string sentences.

Our data consist of 100000 coding ORFs and 100000 non coding.

Here we will use the Tokenizer class from Kears in order to convert words / K-mers into integers, then apply padding to handle standardize the length of the inputs.

One last step before creating the models is splitting the data into training and testing and defining vocabulary size.

RNN + Embedding Layer

Personally i don’t expect much from RNN because it suffers from the problem of vanishing gradients, which hampers learning of long data sequences in our case long ORFs. The gradients carry information used in the RNN parameter update and when the gradient becomes smaller and smaller, the parameter updates become insignificant which means no real learning is done.

RNN was computationally expensive more then CNN and LSTM and GRU. thus as expected the the models couldn’t learn.

with just an accuracy of 55%. RNN performed the worst among all the algorithms implemented.

LSTM + Embedding Layer

LSTM solve the problem of vanishing gradients, I recommend this post here, if you want to dive more into how LSTM networks solve the problem of vanishing gradients.

LSTM improves the accuracy a lot and are able to classify the ORFs into coding and non coding.

The accuracy improved to nearly 90%.

GRU+ Embedding Layer

The GRU architecture is similar to the LSTM but with a few important differences which you can read here and here .

The performance of GRU is similar to LSTM.

the overall performance of GRU is similar to LSTM which give it an advantage over LSTM for it’s less computational resources.

CNN + Embedding Layer

CNN based model always show promising results in many field. with an accuracy of 88% it’s less than LSTM and GRU, but in the other hand is the algorithms that uses less computational resources of all the ones implemented.

Discussion of results

we can clearly see that GRU and LSTM have nearly the same performance which gave GRU the advantage over LSTM, because it uses less computation resource and take less time to be executed.

while CNN comes in third it’s far the algorithms that take less time of all them and RNN is not worth mentioning.

I really recommend this post “The fall of RNN / LSTM”, in which the author present CNN as an alternative to LSTM and RNN.

Check me my LinkedIn.

Till next time.

--

--

Elfermi Rachid

bioinformatician | data scientist | machine learning enthusiast