One hot encoding DNA

Elfermi Rachid
2 min readSep 25, 2020

--

One hot encoding is a way to represent categorical data as binary vectors

Machine learning algorithms cannot work with categorical data directly. Instead we use character-level one-hot encoding to represent the categorical data(for us DNA).

One Hot Encoding and DNA sequences using numpy.

Each base is encoded as a vector of all zeros except one in a specific position, A is encoded as (1,0,0,0), T as (0,0,0,1), C as (0,1,0,0), and G as (0,0,1,0).

DNA bases are encoded following there ascii order

In case you are working with raw data, or data containing errors in other words, DNA sequences containing characters other than {“A”, “T”, “C”, “G”}, if sequencing errors has no significant your studies; we can do this little trick by encoding any error with a vector of zeros (0,0,0,0)

One Hot Encoding and DNA sequences using scikit learn

See you next time.

Also check my LinkedIn , GitHub

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Elfermi Rachid
Elfermi Rachid

Written by Elfermi Rachid

bioinformatician | data scientist | machine learning enthusiast

Responses (2)

Write a response