KPEC

About Me

Designing Algorithms To Aid Discovery by Chemical Robots (Source: ACS Cent. Sci. 2018, 4, 7, 793–804)

Education:

Undergrad (2011-2015): I majored in Chemistry and Biology at the Massachussetts Institute of Technology (MIT) where I participated in the UROP program primarily from Prof. Jeremiah Jonhson's Lab. I worked mostly on the synthesis of building blocks for drug-carrying polymers designed primarily for drug-delivery applications. The main skills I gained were small molecule syntehsis, polymer synthesis and polymer analysis.

Graduate I (2016-2019): This refers to part 1 of my PhD (in chemistry) which I did at Stanford Univeristy in Prof. Chaitan Khosla's Lab. I spent 3 years in this lab focused on 3 projects centered around DEBS, the canonical polyketide synthase: (1) I studied the conformational changes of a DEBS module using FRET hoping to identify distinct catalytic states, (2) I studied the structure of hybrid DEBS module constructs to pin down interactions between module proteins, and (3) I studied small molecule activators of DEBS turnover. The main skills I gained were protein biochemistry (from cloning to protein expression to activity assay), FRET, protein cystrallography adding protein structure analysis.

Graduate II (2019-2022): This refers to part 2 of my PhD (still in chemistry) which was still at Stanford University but in Prof. Michael Snyder's Lab. I spent 3 years generating and analyzing multi-omic data for patients that had undergone a living-donor liver transplant and developed a rare post-operative condition called Segmental Graft Dysfunction. I used mass spectrometry profiled sick and healthy patient sera and generated proteomic, lipidomic and metabolomic data which was all complemented with demographic and clinical information. The main skills I gaiend were mass spectrometry data generation and analysis (from sample prep to batch correction to differential expression), multi-omic data integration, linear model based differential abundance, pathway analysis, and mortality prediction.

Graduate III (2022-Pesent): I moved out of the web lab and decided to further solidify my data analysis and machine learning skills by getting a formal degree so I went after UIUC's Online Master of Computer Science (the Data Science track). I started the program in 2022 and it should end somewhere in 2025. As I have already taken multiple statistics, machine learning and deep learning courses while at Stanford (during part 2 of my PhD), this is primarily to close any gaps that might have emerged during my ad hoc dive into the subject. So far, the most relevant courses I have taken are Deep Learning for Healthcare and Natural Language Processing.

Research Interests:

Mass Spectrometry + Deep Learning
I'm interested in self-supervised and semi-supervised learning models for small molecule mass spectrometry data.

For more details, see this GitHub repo: SSL4MS.

Protein-Ligand Binding Affinity Prediction
I'm interested in leveraging pre-trained language models for proteins and small molecules to predict the strength of protein-ligand interactions (IC50 values).

Publications:

You can find whatever I have published in science journals by exploring my Google Scholar page here or my ORCID page here.

Personal Research Projects

Mass Spectrometry + Deep Learning
For more details, see this GitHub repo: SSL4MS.

I'm interested in self-supervised and semi-supervised learning models for small molecule mass spectrometry data. A few clarifying points:

I emphasize self and semi supervised learning because there is a lot of unlabeled (we don't know what the underlying compound is) mass spec data for small molecules and there's a huge opportunity if there were models that could leverage all that data. In every untargeted metabolomics experiment, a huge portion (sometimes more than 50%) of spectra cannot be matched to a source compound.

I emphasize small molecule mass spectrometry data because there are lots of tools for proteomics data out there. Both ML and DL models exist for various proteomics specific tasks. But metabolomics is still lagging behind on this front.

Something that makes small molecule mass spectra data difficult to work with is the choice of representation for mass spectra data. My take is to treat the spectrum as tabular data to generate embeddings and then treat the resulting embeddings as a sequence; the models I will consider employ 1D convolutions and transformer encoder blocks. For the semi-supervised model, the main task will be spectrum reconstruction after the input has been corrupted using perturbations unique to mass spectrometry data. For the semi-supervised model, it will be a combination of reconstruction and prediction of fingerprints and/or molecular formula.

What are some important tasks that could be solved by these models? Here are some possibilities (and I'll update them with comments as I work on them):

Spectrum Embedding & Search

Fingerprint Prediction

Molecular Formula Prediction

SMILES Prediction

Protein-Ligand Binding Affinity Prediction
Project description, including further links to posters, talks, publications.

Community

Teaching:

Outreach:

Equity & Inclusion:

Contact

Email: erazokp@gmail.com

Address: Bay Area, CA