2.2.1. Türkiye'deki Ulusal İş Sisteminin Yapısal Özellikleri
2.2.1.4. Dördüncü Evre: Devlette Yaşanan Dönüşüm, Yeni Kapitalizm
decomposição em valores singulares para a seleção de atributos
importantes para classificação de sequências protéicas
by using logistic regression models and singular
value decomposition
Braulio RGM Couto
1,2,*§, Marcelo M Santoro
3, Amjad Ali
4, Marcos A Santos
5*1
Programa de Doutorado em Bioinformática, Universidade Federal de Minas Gerais,
UFMG, Belo Horizonte, Minas Gerais, Brasil
2
Departamento de Ciências Exatas e Tecnologia, Centro Universitário de Belo
Horizonte, UNI-BH, Belo Horizonte, Minas Gerais, Brasil
3
Departamento de Bioquímica e Imunologia, UFMG, Belo Horizonte, Minas Gerais,
Brasil
4
Laboratory of Molecular and Cellular Genetics (LGCM), Departamento de Biologia
Geral, ICB/UFMG, UFMG, Belo Horizonte, Minas Gerais, Brasil
5
Departamento de Ciência da Computação, UFMG, Av. Antonio Carlos 6627, Belo
Horizonte, Minas Gerais, 31270-010, Brasil
*These authors contributed equally to this work
§
Corresponding author
Email addresses:
BRGMC: [email protected]
MMS: [email protected]
AA: [email protected]
MAS: [email protected]
Abstract
BackgroundSearching for relevant patterns in protein sequences is a critical Bioinformatics goal.
In this work we will present a computational tool to support genomic research that
uses logistic regression models and singular value decomposition to feature selection
and protein sequence classification. Firstly, we consider a biomolecular sequence as a
complex written language that is recoded as p-peptide frequency vector using all
possible overlapping p-peptides window. With 20 amino acids it generates a 20
phigh-
dimensional vector, where p is the word-size. Each vector row is the peptide that is
analyzed by logistic regression to feature selection for the protein sequence
classification. If we use a word-size window (p=1) one of the features analyzed, the
amino acids are important for a group of proteins. With p=2 we can identify
bipeptides associated with a specific sequences group. Besides peptides we include
sequence length as another feature candidate. The model-building strategy for the
feature selection was an automatic forward stepwise logistic regression. After the
feature selection step, proteins are recoded again only by the p-peptides selected as
important for each sequences group. The rank of the protein frequency matrix
produced for each target group is reduced by singular value decomposition (SVD) and
the results are used to classify unknown sequences. A database with 516,081
sequences from the Swiss-Prot section of the Universal Protein Resource (UniProt)
was the protein collection used in all analysis. We tested the method in seven target
groups: insulin, globin, keratin, cytochrome and proteins related with cystic fibrosis,
Alzheimer disease and schizophrenia. A case-control study was done to examine each
target group. In this approach, sequences from the target group (the cases) are selected
from database for comparison with series of random sequences where the protein is
and restricted, much smaller than the number of controls. In order to try an optimal
allocation of cases and controls during each feature selection analysis, we used a 1:4
case: control ratio. The ratio of four random controls to each case (4:1) compensates
few numbers of cases, being enough to detect features related to each protein group.
Results
Combined method was able to identify the amino acids and bipeptides important to
each protein group. Sensitivity to classify unknown sequences using the SVD system
based on the initial matrix with 400 rows, ranged from 76% for proteins related with
Alzheimer disease and more than 90% for other six groups. All specificities were over
90% for all proteins. After frequency matrix reconstruction using only bipeptides
identified by the logistic regression, decomposition by SVD and subsequent rank
reduction, query retrieval has a sensitivity ranging from 74% for cytochrome to more
than 90% for globin, keratin and proteins related to cystic fibrosis and schizophrenia.
As for the initial matrix, all specificities in this situation were over 90% for all
proteins.
Conclusions