Location
Ballroom
Start Date
4-5-2018 8:00 AM
End Date
4-5-2018 12:00 PM
Poster Number
51
Name of Project's Faculty Sponsor
Dr. Jeff Knisley
Faculty Sponsor's Department
Mathematics & Statistics
Type
Poster: Competitive
Project's Category
Natural Sciences
Abstract or Artist's Statement
Advancements in DNA microarray data sequencing have created the need for sophisticated machine learning algorithms and feature selection methods. Probabilistic graphical models, in particular, have been used to identify whether microarrays or genes cluster together in groups of individuals having a similar diagnosis. These clusters of genes are informative, but can be misleading when every gene is used in the calculation. First feature reduction techniques are explored, however the size and nature of the data prevents traditional techniques from working efficiently. Our method is to use the partial correlations between the features to create a precision matrix and predict which associations between genes are most important to predicting Leukemia diagnosis. This technique reduces the number of genes to a fraction of the original. In this approach, partial correlations are then extended into a spectral clustering approach. In particular, a variety of different Laplacian matrices are generated from the network of connections between features, and each implies a graphical network model of gene interconnectivity. Various edge and vertex weighted Laplacians are considered and compared against each other in a probabilistic graphical modeling approach. The resulting multivariate Gaussian distributed clusters are subsequently analyzed to determine which genes are activated in a patient with Leukemia. Finally, the results of this are compared against other feature engineering approaches to assess its accuracy on the Leukemia data set. The initial results show the partial correlation approach of feature selection predicts the diagnosis of a Leukemia patient with almost the same accuracy as using a machine learning algorithm on the full set of genes. More calculations of the precision matrix are needed to ensure the set of most important genes is correct. Additionally more machine learning algorithms will be implemented using the full and reduced data sets to further validate the current prediction accuracy of the partial correlation method.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Previous Versions
Included in
A Comparison of Unsupervised Methods for DNA Microarray Leukemia Data
Ballroom
Advancements in DNA microarray data sequencing have created the need for sophisticated machine learning algorithms and feature selection methods. Probabilistic graphical models, in particular, have been used to identify whether microarrays or genes cluster together in groups of individuals having a similar diagnosis. These clusters of genes are informative, but can be misleading when every gene is used in the calculation. First feature reduction techniques are explored, however the size and nature of the data prevents traditional techniques from working efficiently. Our method is to use the partial correlations between the features to create a precision matrix and predict which associations between genes are most important to predicting Leukemia diagnosis. This technique reduces the number of genes to a fraction of the original. In this approach, partial correlations are then extended into a spectral clustering approach. In particular, a variety of different Laplacian matrices are generated from the network of connections between features, and each implies a graphical network model of gene interconnectivity. Various edge and vertex weighted Laplacians are considered and compared against each other in a probabilistic graphical modeling approach. The resulting multivariate Gaussian distributed clusters are subsequently analyzed to determine which genes are activated in a patient with Leukemia. Finally, the results of this are compared against other feature engineering approaches to assess its accuracy on the Leukemia data set. The initial results show the partial correlation approach of feature selection predicts the diagnosis of a Leukemia patient with almost the same accuracy as using a machine learning algorithm on the full set of genes. More calculations of the precision matrix are needed to ensure the set of most important genes is correct. Additionally more machine learning algorithms will be implemented using the full and reduced data sets to further validate the current prediction accuracy of the partial correlation method.