Handling variable size DNA inputs

Elfermi Rachid
2 min readSep 25, 2020

DNA padding

Deep learning libraries assume a vectorized representation of your data.

In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same length.

This vectorization allows code to efficiently perform the matrix operations in batch for your chosen deep learning algorithms.

In this tutorial, you will discover techniques that you can use to prepare your variable length sequence data for sequence prediction problems in Python with Tensorflow.

first let’s create random DNA sequences.

We visualize DNA sequences generated and their length.

Next step we have to apply one-Hot encoding on the DNA sequences.

here we constate that we have different length.

To fix this issue we use the pad sequence function in the Keras deep learning library can be used to pad variable length sequences.

Pre-Sequence Padding

Pre-sequence padding is the default (padding=’pre’)

In short pad sequence function adds vectors of zeros at the beginning of the DNA sequences .

Post-Sequence Padding

Padding can also be applied to the end of the sequences, which may be more appropriate for some problem domains.

In short pad sequence function adds vectors of zeros at the end of the DNA sequences .

Summary

In this tutorial, you discovered how to prepare variable length DNA sequence for use with sequence prediction problems in Python.

Till next time.

Follow me on LinkedIn, GitHub

--

--

Elfermi Rachid

bioinformatician | data scientist | machine learning enthusiast