Caroline Clark

Data Science / Machine Learning

Washington DC · (847) 987-0250 · caroline@humanperspectives.com

In starting my career at Google, I was able to cultivate my passion for analytics and data visualization. I developed an interest in machine learning and AI, and have spent my time recently working in the space. I recently completed a data science bootcamp to round out the tools in my machine learning toolkit, and am passionate about applying my technical and deep learning skills to solve problems. I am hard-working, collaborative, and enthusiastic.

Projects

Building a Convolutional Neural Network Using NumPy

What are the most critical components to building an image classifier using only NumPy?

The tools for developing neural networks are becoming increasingly streamlined and robust. This has the benefit of making deep learning very accessible, but it abstracts away neural network detail from the practitioner. Even with robust tools, training neural networks is still challenging. A good understanding of how neural nets work and converge may lead to greater project success.

This project builds a convolutional neural network from scratch using only NumPy. To achieve the forward and backward propagation steps necessary for training, the network is designed as a series of layers. In Python, these layers are represented as classes, with forward and backward propagation methods. A final network class stacks these layers together.

The library currently supports these layers: Convolutions, ReLU, Max Pooling, Flatten, Dense, and Softmax.

To see this project, click here

Research on Resolution and Human Classification Error

I co-authored a paper currently in pre-print: Deriving a Quantitative Relationship Between Resolution and Human Classification Error

For machine learning perception problems, human-level classification performance is used as an estimate of top algorithm performance. Knowing this 1) provides a benchmark for model performance, 2) tells a project manager what type of data to obtain for human labelers in order to get accurate labels, and 3) enables ground-truth analysis to be carried out smoothly. In this empirical study, we explored the relationship between resolution and human classification performance using the MNIST data set down-sampled to various resolutions.

The quantitative heuristic we derived could provie useful for predicting machine model performance, predicting data storage requirements, and saving valuable resources in the deployment of machine learning projects.

To see this paper on arXiv.org, click here

Predicting COVID-19 Using Demographic Data

Is is possible to predict COVID-19 severity using demographic data?

Studies have indicated that COVID-19 affects parts of the population more severely, such as older individuals and males. At the same time, the World Health Organization has stated that lockdowns and other restrictive measures meant to stop the spread of the virus "can have a profound negative impact on individuals, communities, and societies by bringing social and economic life to a near stop."

This project examines county-level COVID-19 testing, case, and death data alongside county-level demographic data for five U.S. states. Demographic data such as age, gender, race, and income per capita is collected via the US Census Bureau's API. The data is modeled both across states and individually.

These models indicate that publicly available information has the potential to inform resource allocation and targeted preventative measures during the COVID-19 pandemic.

To see this project, click here

Text Classification With Natural Language Processing

Can a model be built to accurately classify text as belonging to either the r/artifical or r/datascience subreddits?

Subreddit post data was collected using the Pushshift API. The data was cleaned and exploratory data analysis was conducted, such as top word overlap.

Several models were compared, with Multinomial Naive Bayes used a baseline. The best-performing model was Logistic Regression with TfdifVectorizer. It achieved 92.66% accuracy on test data, with a recall score of 93.08% and precision of 93.01%. Interpretable word coefficients were extracted from the model and visualized.

To see this project, click here

Skills

Programming Languages, Skills, & Tools

Machine Learning
Deep Learning
Data Visualization
Web Scraping
TensorFlow
Keras
NumPy
Pandas
Matplotlib
Seaborn
OpenCV
SQL
HTML
Spark
Tableau

Interests

A first-generation daughter of immigrants and entrepreneurs, I was raised with an appreciation both for hard work and different cultures. I have traveled to or lived in over 20 countries, with special places in my heart for Poland, Indonesia, and Mongolia. I've also volunteered with youth and sustainable agriculture organizations in Ireland, Australia, and Wyoming.

I love yoga, horseback-riding, hiking, backpacking, and cycling. When indoors, I'm into sci-fi books and television shows, and debates about artifical intelligence.