PROTEIN FAMILY CLASSIFICATION USING THE DEEP LEARNING MODELS
Understanding the relationship between amino acid and protein sequence is long standing problem in molecular biology and scientific implications.
We develop deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families PFAM database.
Business/Real World problem:
The task is: given the amino acid sequence of the protein domain, predict which class it belongs to. There are about 1 million training examples, and 17929 output classes.
we use the unaligned sequence data to train deep learning models to learn the distribution across the protein families to find the joint optimization.
we divide the data into train, development and cross- validation set. we use the train data to train our models , development data can be for the cross validation , test data is reserved for the evaluations.
we have various fields in the data
description of features:-
- sequence_name: Sequence name, in the form “$uniprot_accession_id/$start_index-$end_index”.
- sequence: These are usually the input features to your model. Amino acid sequence for this domain. There are 20 very common amino acids (frequency > 1,000,000), and 4 amino acids that are quite uncommon: X, U, B, O, Z.
- family_accession: These are usually the labels for your model. Accession number in form PFxxxxx.y (Pfam), where xxxxx is the family accession, and y is the version number.
- family_id: One word name for family.
- aligned_sequence: Contains a single sequence from the multiple sequence alignment.
Exploratory Data Analysis :
we perform the data visualization to understand the data and to remove the outliers that are present in the data . we also analyse the behavior of the data so that we can do the feature extraction and feature engineering over the data .
Feature extraction and feature engineering :
The input network maps a sequence of L amino acids to (L,26) binary array which are nothing but the alphabets that are present in the amino acids, where the each column is amino acid representation.
Sequences are padded to the length of longest sequence in the batch to (L,26) array containing the one hot amino acid representation of the sequence (L,F) array that contains the embedding for the sequence residue. All processing in the the subsequent embedding network is designed such that it is invariant to the padding that was introduced for a given sequence.
we have the class labels which are nothing but the family_accession that are provided in the data are label encoded to set as the target variables.
Our ProtCNN networks use residual networks (ResNets ,a variant of convolutional neural networks that train faster and are more stable, even with many layers The ProtCNN networks are translationally invariant, advantage for working with unaligned protein sequence data. Convolutional architectures build up layered representations spanning many residues. An n-dilated 1d-convolution takes standard convolution operations over every nth element in a sequence, allowing local and global information to be combined without greatly increasing the number of model parameters.
We use the Adam optimizer .The learning rate is subject to exponential decay.At train time, we present the model with randomly-drawn batches.
An important composite hyperparameter is the receptive field size of each per-residue feature, which describes the length of the subsequence that affects its value. Using dilated convolutions enables larger receptive field sizes to be obtained without an explosion in the number of model parameters. To our knowledge, this is the first application of dilated convolutions to protein sequence classification. The (L,F) array is then pooled along the length of the sequence ensuring invariance to padding.
BiLSTM is an extension of LSTM, in which an additional recurrence starts from the last timestep of the forward recurrence and proceeds backward to the first timestep of the forward recurrence. The information in the “future” steps thus can be captured and aids predictions making at earlier timesteps .
Increasing the number of model parameters via the number of filters, the kernel size and increasing the batch size can all produce performance improvements. Fundamentally, the memory footprint of the models we trained was limited by the amount of memory available on a single GPU, necessitating trade offs among these different factors. Among the experiments that we ran the best performing ProtCNN for Pfam full consisted of a single residual block with 2000 filters, a kernel size of 26, and a batch size of 128.
In addition to the CNN models we also trained a recurrent neural network (RNN) with single-layer bidirectional LSTM , which achieved accuracy of 0.982 on the Pfam seed dataset.