One hot encoding DNA

2 min readSep 25, 2020

One hot encoding is a way to represent categorical data as binary vectors

Machine learning algorithms cannot work with categorical data directly. Instead we use character-level one-hot encoding to represent the categorical data(for us DNA).

One Hot Encoding and DNA sequences using numpy.

Each base is encoded as a vector of all zeros except one in a specific position, A is encoded as (1,0,0,0), T as (0,0,0,1), C as (0,1,0,0), and G as (0,0,1,0).

DNA bases are encoded following there ascii order

In case you are working with raw data, or data containing errors in other words, DNA sequences containing characters other than {“A”, “T”, “C”, “G”}, if sequencing errors has no significant your studies; we can do this little trick by encoding any error with a vector of zeros (0,0,0,0)

One Hot Encoding and DNA sequences using scikit learn

See you next time.

Also check my LinkedIn , GitHub

One hot encoding DNA

One Hot Encoding and DNA sequences using numpy.

One Hot Encoding and DNA sequences using scikit learn

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Elfermi Rachid

Responses (2)