One hot encoding DNA

Elfermi Rachid
2 min readSep 25, 2020

One hot encoding is a way to represent categorical data as binary vectors

Machine learning algorithms cannot work with categorical data directly. Instead we use character-level one-hot encoding to represent the categorical data(for us DNA).

One Hot Encoding and DNA sequences using numpy.

Each base is encoded as a vector of all zeros except one in a specific position, A is encoded as (1,0,0,0), T as (0,0,0,1), C as (0,1,0,0), and G as (0,0,1,0).

DNA bases are encoded following there ascii order

In case you are working with raw data, or data containing errors in other words, DNA sequences containing characters other than {“A”, “T”, “C”, “G”}, if sequencing errors has no significant your studies; we can do this little trick by encoding any error with a vector of zeros (0,0,0,0)

One Hot Encoding and DNA sequences using scikit learn

See you next time.

Also check my LinkedIn , GitHub

--

--

Elfermi Rachid

bioinformatician | data scientist | machine learning enthusiast