Title: | Computational Prediction of Proteins Encoded by Circadian Genes |
---|---|
Description: | A computational model for predicting proteins encoded by circadian genes. The support vector machine has been employed with Laplace kernel for prediction of circadian proteins, where compositional, transitional and physico-chemical features were utilized as numeric features. User can predict for the test dataset using the proposed computational model. Besides, the user can also build their own training model using their training dataset, followed by prediction for the test set. |
Authors: | Prabina Kumar Meher <[email protected]> |
Maintainer: | Prabina Kumar Meher <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.2 |
Built: | 2025-03-01 03:50:49 UTC |
Source: | https://github.com/cran/PredCRG |
The model1
is the trained model with the Q1 dataset using the developed approach.
data("model1")
data("model1")
Here, 1558 sequences of pos_Q1 and neg_Q1 datasets were used for training. For prediction, support vector machine with Laplace kernel has been trained in which compositionsl, transitional and physico-chemical features are utilized.
PredCRG, PredCRG_Enc, PredCRG_training
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model1 as CRG or non-CRG pred <- predict(model1, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model1 pred1 <- predict(model1, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model1 as CRG or non-CRG pred <- predict(model1, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model1 pred1 <- predict(model1, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
The model2
is the trained model with the Q2 dataset using the developed approach.
data("model2")
data("model2")
Here, 1596 sequences of pos_Q2 and neg_Q2 datasets were used for training. For prediction, support vector machine with Laplace kernel has been trained in which compositionsl, transitional and physico-chemical features are utilized.
PredCRG, PredCRG_Enc, PredCRG_training
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model2 as CRG or non-CRG pred <- predict(model2, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model2 pred1 <- predict(model2, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model2 as CRG or non-CRG pred <- predict(model2, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model2 pred1 <- predict(model2, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
The model3
is the trained model with the Q3 dataset using the developed approach.
data("model3")
data("model3")
Here, 1593 sequences of pos_Q3 and neg_Q3 datasets were used for training. For prediction, support vector machine with Laplace kernel has been trained in which compositionsl, transitional and physico-chemical features are utilized.
PredCRG, PredCRG_Enc, PredCRG_training
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model3 as CRG or non-CRG pred <- predict(model3, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model3 pred1 <- predict(model3, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model3 as CRG or non-CRG pred <- predict(model3, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model3 pred1 <- predict(model3, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
The model4
is the trained model with the Q4 dataset using the developed approach.
data("model4")
data("model4")
Here, 1365 sequences of pos_Q4 and neg_Q4 datasets were used for training. For prediction, support vector machine with Laplace kernel has been trained in which compositionsl, transitional and physico-chemical features are utilized.
PredCRG, PredCRG_Enc, PredCRG_training
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model4 as CRG or non-CRG pred <- predict(model4, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model4 pred1 <- predict(model4, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
library(kernlab) data(test) nam <- names(test) #encoding of test set using compositional, transitional and physico-chemical features enc <- PredCRG_Enc(test) #predicting test set using model4 as CRG or non-CRG pred <- predict(model4, newdata=enc[1:10,], type="response") #predicting probabilities of the test sequences using model4 pred1 <- predict(model4, newdata=enc[1:10,], type="probabilities") #combining predicted labels and probabilities result <- data.frame(seq_name=nam[1:10], predicted_label=as.character(pred) ,predicted_probability=pred1[,"CRG"]) print(result)
The user can predict the protein sequences as CRG (circadian protein) or non-CRG (non-circadian protein) with certain probability by supplying the test sequences.
PredCRG(seq_data)
PredCRG(seq_data)
seq_data |
Sequence dataset in FASTA format consisting of protein sequences with standard amino acid residues only. It must be an object of class |
The user has to supply only the seq_data
for which the prediction is to be made.
A dataframe with three columns consisting of sequence name, predicted labels of sequences (CRG or non-CRG) and probabilities of prediction.
Prabina Kumar Meher, ICAR-Indian Agricultural Statsitics Research Institute, New Delhi-110012, INDIA
PredCRG_Enc, PredCRG_training,model1, model2,model3,model4
data(test) tst <- test[1:10] PredCRG(seq_data=tst)
data(test) tst <- test[1:10] PredCRG(seq_data=tst)
The dataset that has been used to train the PredCRG model contains four sub-datasets (Q1, Q2, Q3 and Q4) which are prepared based on the homogeneity of sequence length. The positive sets of the sub-datasets are denoted as pos_Q1, pos_Q2, pos_Q3 and pos_Q4 respectively, whereas the negative sets as neg_Q1, neg_Q2, neg_Q3 and neq_Q4 respectively. Further, same number of sequences are there in both positive and negative sets in each sub-dataset. More clearly, 1588, 1596, 1593 and 1365 sequences are present for both positive and negative sets for Q1, Q2, Q3 and Q4 sub-datasets respectively. Further, the range of the length of the sequences for pos_Q1, pos_Q2, pos_Q3 and pos_Q4 are 39-221, 221-363, 363-538, 538-1000 amino acids respectively, and the range of the length of the sequences for neg_Q1, neg_Q2, neg_Q3 and neg_Q4 are 43-407, 407-485, 485-607 and 607-1000 amino acids respectively. In this dataset, only the Q1 sub-dataset is available due to constraint of space in CRAN. However, one can get all the four sub-datasets from GitHub repository (https://github.com/meher861982/PredCRG_dataset ).
data("PredCRG_data")
data("PredCRG_data")
The datasets are in AAStringSet
format, which can be obtained by reading the FASTA file using readAAStringSet
function availbale in Biostrings
package.
The protein sequences encoded by the circadian genes contitutes the positive datasets, whereas a randomly selected dataset from the Uniprot for the clad Viridi plantae constitutes the negative dataset.
The circadian gene sequecnces are collected from the circadian gene database accessible at http://cgdb.biocuckoo.org/ .
PredCRG, PredCRG_Enc, PredCRG_training,model1, model2,model3,model4
data(PredCRG_data) pos_Q1 <- PredCRG_data$pos_Q1 #positive set of Q1 dataset neg_Q1 <- PredCRG_data$neg_Q1 #negative set of Q1 dataset
data(PredCRG_data) pos_Q1 <- PredCRG_data$pos_Q1 #positive set of Q1 dataset neg_Q1 <- PredCRG_data$neg_Q1 #negative set of Q1 dataset
Before using the protein sequences for prediction using the proposed model, the sequences must be transformed into numeric feature vectors. The function PredCRG_Enc
will transform each protein sequnces to a numeric vector of 62 observations, based on the compositional, physico-chemical and transitional features used in the PredCRG
model.
PredCRG_Enc(prot_seq)
PredCRG_Enc(prot_seq)
prot_seq |
Sequence dataset to be supplied as input, must be an object of class |
The dataset must contains the protein sequences having standard amino acid residues only. The clas AAStringSet
can be obtained by reading the FASTA file using readAAStringSet
available in bioconductor package Biostrings
.
A matrix of dimension n*62, for n number of sequences.
Prabina Kumar Meher, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA
PredCRG, PredCRG_training, model1, model2,model3,model4
data(test) enc <- PredCRG_Enc(test)#encoding of test sequence data enc[1:5,1:5]
data(test) enc <- PredCRG_Enc(test)#encoding of test sequence data enc[1:5,1:5]
User can build their own PredCRG model by using their own training dataset. User has to supply the protein sequence dataset of both positive and negative classes having standard amino acid residues only.
PredCRG_training(pos_seq, neg_seq, kern)
PredCRG_training(pos_seq, neg_seq, kern)
pos_seq |
circadian protein sequence dataset (also called positive dataset), must be an object of class |
neg_seq |
non-circadian protein sequence dataset (also called negative dataset), must be an object of class |
kern |
Type of kernel to be used. It may be |
The sequences must of AAStringSet
type can be obtained by reading the FASTA file of the sequences using function readAAStringSet
available in Biostrings
package.
Support Vector Machine object of class ksvm
Prabina Kumar Meher, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, INDIA
PredCRG, PredCRG_Enc, model1, model2,model3,model4
library(kernlab) pos_Q1 <- PredCRG_data$pos_Q1 neg_Q1 <- PredCRG_data$neg_Q1 #training of the model using laplace kernel. user_model <- PredCRG_training(pos_seq=pos_Q1[1:100], neg_seq=neg_Q1[1:100], kern="laplace") data(test) tst_enc <- PredCRG_Enc(test[1:10])#encoding of the test set predict(user_model, tst_enc, type="response") #predicting the label of the test instances predict(user_model, tst_enc, type="probabilities")#predicting the probability of the test instances library(e1071) #training of the model using RBF kernel. user_model <- PredCRG_training(pos_seq=pos_Q1[1:100], neg_seq=neg_Q1[1:100], kern="RBF") predict(user_model, tst_enc, probability=TRUE) #Predicting probability predict(user_model, tst_enc) #Predicting labels
library(kernlab) pos_Q1 <- PredCRG_data$pos_Q1 neg_Q1 <- PredCRG_data$neg_Q1 #training of the model using laplace kernel. user_model <- PredCRG_training(pos_seq=pos_Q1[1:100], neg_seq=neg_Q1[1:100], kern="laplace") data(test) tst_enc <- PredCRG_Enc(test[1:10])#encoding of the test set predict(user_model, tst_enc, type="response") #predicting the label of the test instances predict(user_model, tst_enc, type="probabilities")#predicting the probability of the test instances library(e1071) #training of the model using RBF kernel. user_model <- PredCRG_training(pos_seq=pos_Q1[1:100], neg_seq=neg_Q1[1:100], kern="RBF") predict(user_model, tst_enc, probability=TRUE) #Predicting probability predict(user_model, tst_enc) #Predicting labels
A test dataset containing 54 circadian protein sequences collected from literature. This dataset has been used as an independent test dataset for assessing the predition accuracy of PredCRG
model.
data("test")
data("test")
PredCRG, PredCRG_Enc, PredCRG_data
data(test) PredCRG(test[1:10])
data(test) PredCRG(test[1:10])