Email: zashwood at cs dot princeton dot edu
Adviser: Sebastian Seung
I am a first year PhD student in the Computer Science Department at Princeton University, and I am interested in using machine learning to tackle longstanding questions in neuroscience. With the advent of terabyte images of entire mouse brains and petabyte-sized electron microscopy datasets, I am working to process and find structure in this data.
I completed my undergraduate degree (MPhys) in Mathematics and Theoretical Physics at the University of St Andrews in Scotland. Prior to coming to Princeton, I studied as a Robert T. Jones scholar at Emory University and I worked for two years as a Research Fellow for Professor Daniel Ho at Stanford University. At Stanford, we were interested in assessing policy efficacy by carefully designing Randomized-Controlled Trials and applying appropriate statistical methods to measure a policy’s effect.
I have various ongoing research projects in the Seung lab at Princeton. I also completed a number of projects for classes, some of which I will now describe:
In this project, I worked with Matt Myers to explore the Breast Cancer Proteomes dataset, associated with the Nature publication “Proteogenomics connects somatic mutations to signaling in Breast Cancer”. We trained classifiers to predict breast cancer subtype (defined by a patient’s mRNA expression for the ”PAM50” genes) from the patient’s expression levels for 12,553 proteins. In doing so, we were able to explore whether there were smaller groups of proteins, compared to genes, that could predict breast cancer subtype. Indeed, we found that a feature set of 14 proteins, only 2 of which were products of PAM50 genes, could be used to predict subtype with an accuracy of 86.2%, and an F1 score of 85.8%. Furthermore, working with biological data that was highly sparse, we were able to compare the performance of various feature selection methods. We found that the performance of feature selection methods that did not linearly transform the data before performing classification was vastly superior to feature selection methods (like PCA and factor analysis) that did, and the accuracy difference between a classifier trained on PCA components and our best feature selection method was as much as 20%. While there is an intuitive explanation for this result (PCA and factor analysis methods do not preserve between-class variance), it was an interesting result given the prevalence of these methods in dimensionality reduction.
In this project, I worked with Diana Cai.
Probabilistic modeling provides essential statistical methodology for analyzing the massive amounts of data that have been generated in a number of modern data analysis applications, such as online web applications, biology, and healthcare. However, inferring the posterior distribution of the parameters of complex probabilistic models is challenging for massive, high-dimensional datasets. Here, we used a random hashing-based sketch to approximate the sufficient statistics of a model, and developed a more efficient inference algorithm using the sketch. We demonstrated our method in a topic modeling application, and showed that, on a toy corpus of 100,000 words generated by a latent Dirichlet allocation model, we were able to successfully recover topics in the documents, even when we incorporated the count-min sketch to reduce the size of the data structures utilized in the inference.