school University of Utah '21 - present
school Pomona College '20 - B.A. Math
location_on SLC, UT
email nates [at] cs [dot] utah [dot] edu
About
Peaks
GitHub / LinkedIn
This project was completed in order to fulfill the senior thesis requirement for my undergraduate degree in Mathematics from Pomona College. The project involved a yearlong independent study of a chosen research topic, three presentations, and a final write up. I was advised throughout by Mike Izbicki.
In order to process natural language data and apply computational techniques to it one must first generate an adequate representation. One way to achieve this is by representing each word in the corpus as a vector of arbitrary length. Together these vectors form a space and the goal is to construct them in such a way that semantically similar words have similar vector representations. We know that our model is good if the distances between vectors reflect the semantic relationship of the words they represent.
The use of embeddings to model natural language data is a critical component for many tasks in NLP; however, the vast majority of research has focused on generating embeddings for English. In this project we apply word2vec to generate an embedding model for Latin. We chose Latin because of its distinct linguistic properties as well as its low-resource and historical characteristics. In the project we perform all the steps necessary to create a word embedding model from corpus selection/creation to training, to evaluation.