Summary

The ability of a program to peak its runners well at the end of the season is often what determines national champions. To determine how well programs peak their athletes, we will use simulations of cross country results to score a hypothetical “national meet” over two years, and evaluate these simulated results against the real results of the national meet to evaluate overperforming and underperforming programs. To run the simulation, we collect results for each relevant athlete in the Divison at each meet. For each athlete, we maintain a vector of adjusted times from the current season. Then we sample from a distribution created from athlete’s vector and sort the athletes into a meet ranking, which we then score as a cross country meet. We compare the actual nationals result from that year to our simulated result to determine how well a team peaked. We can do this over a number of years to evaluate different programs. We assign a score to each program based on how consistently they peaked at the national meet and how far they exceeded our simulated expectation of their finish at the national meet. We also created a shiny app to allow users to the heuristic experience of running the simulation and assessing the stability of team places.

## Help on topic 'distinct' was found in the following packages:
## 
##   Package               Library
##   plotly                /Library/Frameworks/R.framework/Versions/3.6/Resources/library
##   dplyr                 /Library/Frameworks/R.framework/Versions/3.6/Resources/library
## 
## 
## Using the first match ...

The first step is to build a scraper which is capable of retrieving data from the official database of NCAA cross country and track and field times, TFRRS. The database is very robust but has no API for statisticians to pull data in a reasonable way, so this actually turns out to be quite difficult. TFRRS also threatens users with cease and desist messages if they detect too much traffic from one user, or if they notice a blog post on the web about it. Despite its draconian policies with regard to data usage, they remain far and away the most robust way to access data on collegiate runners.

The inaccessibility of TFRRS data raises the question of to whom the data truly belong. Athletes who ran the very times in the database are unable to retrieve that same data. Furthermore, this data is produced by third party timing services who post meet results on their own timing websites, and TFRRS only serves as an aggregator. As such, we felt no real ethical issues scraping TFRRS.

This first codechunk compiles results from each of the regional meets into a single data frame. To qualify for the national meet, each team must compete in a regional meet. 2 teams automatically qualify from each region and the remainder of the nationals field is selected through a well-documented selection process.

###Initial codestuff
urls <- c("https://www.tfrrs.org/results/xc/16561", "https://www.tfrrs.org/results/xc/16562",
           "https://www.tfrrs.org/results/xc/16563", "https://www.tfrrs.org/results/xc/16564",
          "https://www.tfrrs.org/results/xc/16565", "https://www.tfrrs.org/results/xc/16567", 
          "https://www.tfrrs.org/results/xc/16568", "https://www.tfrrs.org/results/xc/16724")

nationals_urls <- c("https://www.tfrrs.org/results/xc/15028", "https://www.tfrrs.org/results/xc/13424",
                    "https://www.tfrrs.org/results/xc/9349/", "https://www.tfrrs.org/results/xc/11260/",
                    "https://www.tfrrs.org/results/xc/6216/")


df <- data.frame(PL=integer(),
                 NAME=character(),
                 YEAR=character(),
                 TEAM=character(),
                 TIME=character(),
                 COURSE=character(),
                 DATE=character()
                 )

for(url in urls) {
  webpage <- read_html(url)
  print("read webpage")
  table_of_tables <- xml_find_all(webpage, "//table") %>% html_table
  titles <- html_nodes(webpage, "h3")
  courses <- html_nodes(webpage, ".inline-block")
  table_index <- 2
  for(i in 1:length(titles[])) {
    if(grepl("8K|8000|8,000|8k", titles[i][1])) {
      table_index <- i
      print(table_index)
      current_data <- table_of_tables[[table_index]] %>% select(NAME, YEAR, TEAM, TIME)
      current_data <- current_data %>% mutate(DATE=html_text(courses[4][1]))
      if(str_length(html_text(courses[5][1])) > 4) {
        current_data <- current_data %>% mutate(COURSE=html_text(courses[5][1]))
      } else {
        current_data <- current_data %>% mutate(COURSE="NA")
      }
      df <- rbind(df, current_data)
      break
    }
  }
}

Now we are ready to get a more relevant and complex data set, including the data which we gathered above. This code chunk grabs all of the names of teams which competed at the national meet.

###########
##Meet Results for National Qualifiers
###########

#first get team names
df <- data.frame(PL=integer(),
                 NAME=character(),
                 YEAR=character(),
                 TEAM=character(),
                 TIME=character(),
                 COURSE=character(),
                 DATE=character()
)

url="https://www.tfrrs.org/results/xc/16726/NCAA_Division_III_Cross_Country_Championships"
webpage <- read_html(url)
print("read webpage")
table_of_tables <- xml_find_all(webpage, "//table") %>% html_table
titles <- html_nodes(webpage, "h3")
courses <- html_nodes(webpage, ".inline-block")
table_index <- 2
for(i in 1:length(titles[])) {
  if(grepl("8K|8000|8,000|8k", titles[i][1])) {
    table_index <- i
    print(table_index)
    current_data <- table_of_tables[[table_index]] %>% select(NAME, YEAR, TEAM, TIME)
    current_data <- current_data %>% mutate(DATE=html_text(courses[4][1]))
    if(str_length(html_text(courses[5][1])) > 4) {
      current_data <- current_data %>% mutate(COURSE=html_text(courses[5][1]))
    } else {
      current_data <- current_data %>% mutate(COURSE="NA")
    }
    df <- rbind(df, current_data)
    break
  }
}

teamnames<-unique(df$TEAM)

We are ready to get a more interesting data set. The NCAA cross-country regular season is about 8 weeks long, and most teams compete in anywhere from 4-6 meets over these weeks. To get a hold of these meets, we scrape each qualified team’s TFRRS page, which has a schedule of meets which they competed at.

2019

We grab all of the results from each of the meets that an NCAA qualifying team raced during the regular season. Then, we filter for the HTML table which has some variation of 8k, or 8000m, as a title – this is the race distance that the men compete at.

Now, each course runs a little bit differently. There are many factors which go into this, such as course surface, course condition, weather conditions, and elevation profiles. However, since athletes run on multiple courses per year, there is a reasonable way to make adjustments for these differences. Bijan Mazaheri, a Caltech PhD. student, ran a nifty program to generate these adjustments. The idea is to look at two courses at a time, and pick all the athletes which ran on both. Then, take the difference in times between each athlete, and take the mean of all those times to generate a relative difference in times between the two courses. This is the pairwise adjustment for these two courses. To convert a time from the first course to the second course, add this adjustment factor to each athlete’s time.

Repeat this for each pair of courses which had similar runners. When all of the pairwise adjustments have been calculated, pick one course to serve as a baseline, and then use least-squares to pick a single adjustment for each course.

Since Bijan already did all this work, we used his adjustments for 2019, and then generated our own adjustments in a slightly simpler way for 2018.

######Bijon Race Adjustments

webpage<-read_html("https://bijanmazaheri.wordpress.com/2019/11/18/2019-diii-meet-adjustments-men/")
table_of_tables <- xml_find_all(webpage, "//table") %>% html_table
course_corrections<-as.data.frame(table_of_tables[[1]])
colnames(course_corrections)<-c("Meet", "Score")
course_corrections<-course_corrections %>% filter(Meet!="Meet")

Then, we filter our data further to only select the top 7 runners (the runners which ran at the national meet) for each team. We also make a number of slight modifications to the data set to clean up meet names, and correct for a mistake with a single Williams meet (domain specific knowledge!). Omitting the chunk

## [1] "read webpage"

Convert the times to seconds, and then apply the adjustment time.

###Lubridate over the list

library(lubridate)
filtered_results <- filtered_results %>% mutate(ADJ_TIME=(period_to_seconds(ms(filtered_results$TIME))-as.numeric(filtered_results$Score)))
write.csv(filtered_results, "2019_Results_Adjusted.csv")

Simulation time! For each person, we sample from a normal distribution centered on their average time from the season with a variance corresponding to their true variance throughout the year. One thing that we could change is the exact distribution we sample from for each athlete: it’s far more likely to have a one-off horrible race than a one-off incredible race, so we could use a skewed distribution instead of a strictly normal distribution to model this. This would prevent any horrible, horrible days from allowing a runner to have an excellent simulated result on the other side of the center of the distribution. More simply: Having a very bad day should increase the upside potential of a runner.

set.seed(11)
sim_stats<-list()
total_stats<-list()
filtered_results <- read.csv("2019_Results_Adjusted.csv")



for(count in seq(1, 100)) {
  # Get the mean and sd for each individual
  individuals <- filtered_results %>% group_by(NAME) %>%
    summarize(MEAN=mean(ADJ_TIME, na.rm=TRUE),
                        SD=sd(ADJ_TIME, na.rm = TRUE))
  # Add join this data back to the full data set
  teams <- left_join(individuals, filtered_results)
  teams <- distinct(teams, NAME, SD, MEAN, TEAM)
  
  # simulate time for each individual and add to the data frame
  sims <- teams %>% mutate(SIMTIME = rnorm(nrow(teams), teams$MEAN, teams$SD))
  ordered_results_simulated <- sims %>% arrange(SIMTIME)
  ordered_results_simulated$SCORE = seq(1,nrow(ordered_results_simulated))
  
  team_scores <- arrange(ordered_results_simulated %>% group_by(TEAM) %>% 
                           summarize(TEAM_SCORE=sum(head(SCORE, 5))), TEAM_SCORE)
  
  sim_stats[[count]]<-data.frame(TEAMS=team_scores$TEAM[1:4], PLACES=seq(1:4), SCORES=team_scores$TEAM_SCORE[1:4])
  
  total_stats[[count]]<-data.frame(TEAMS=team_scores$TEAM, PLACES=seq(1:length(team_scores$TEAM)), SCORES=team_scores$TEAM_SCORE)
}

sim_stats = do.call(rbind, sim_stats)

total_stats=do.call(rbind, total_stats)

Here’s an interactive plot made with Plotly.

#######
#Interactive Plotly Bar Chart w/ Various Place Probabilities
#######

b<-sim_stats %>% group_by(TEAMS) %>% summarize(On_Podium=n()/100, P_first=sum(PLACES==1)/100, P_second=sum(PLACES==2)/100, P_third=sum(PLACES==3)/100, P_fourth=sum(PLACES==4)/100) %>% dplyr::arrange(desc(On_Podium))


text<-c()
for (i in 1:dim(b)[1]){
text <- c(text, paste("Prob Podium:", b[i,2], "\n", "Prob 1st:", b[i,3], "\n", "Prob 2nd:", b[i,4], "\n", "Prob 3rd:", b[i,5], "\n", "Prob 4th:", b[i,6]))
}

dat <- data.frame(x=b$TEAMS, y=b$On_Podium, text=text)
dat$x <- factor(dat$x, levels = as.character(b$TEAMS))

q <- plot_ly(dat, x = ~x, y = ~y, type = 'bar', text = text,
        marker = list(color = c('rgb(85,13,152)', 'rgb(178,8,57)', 'rgb(232,168,8)', 'rgb(188,0,0)', 'rgb(13,43,129)', 'rgb(247,138,12)', 'rgb(31,56,107)', 'rgb(162,31,75)', 'rgb(0,115,96)', 'rgb(134,43,45)', 'rgb(245,130,43)', 'rgb(254,209,76)'),
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5))) %>%
  layout(title = "Results of National Meet Simulations 2019",
         xaxis = list(title = ""),
         yaxis = list(title = "Probability of Podium"))

q
#chart_link = api_create(q, filename="bar-text")
#chart_link

Below we plot the ranking of each team in the simulation less the true outcome at the national meet, giving us the “rankings jump” of each team.

## [1] "read webpage"

Same plot as above, sorted by the significance of the jump given the range of the team’s results in simulations.

## # A tibble: 5 x 2
##   TEAMS                  mean_diff
##   <fct>                      <dbl>
## 1 North Central (Ill.)        25.4
## 2 Pomona-Pitzer               21.5
## 3 Williams                    43.7
## 4 Claremont-Mudd-Scripps      19.4
## 5 SUNY Geneseo                12

2018

Repeat all of the above for the 2018 season.

For course corrections in 2018, we use the nationals course as a baseline instead of doing the pairwise adjustments and doing least squares. We take each runner that ran on a course and took the median time difference from the nationals course as the adjustment factor.

###Correction is median difference in times between national meet and meet in question (for those who ran both) 
resdf <- read.csv("2018resdf.csv")
resdf<-resdf %>% distinct()
corrected<-matrix(ncol=86)
for (i in 1:length(unique(resdf$MEETNAME))){
  meet_one<-resdf %>% filter(MEETNAME==unique(resdf$MEETNAME)[i])
  meet_one$TIME<-period_to_seconds(ms(meet_one$TIME))
  tmpvec<-c()
  for (j in 1:length(unique(resdf$MEETNAME))){
  meet_two<-resdf %>% filter(MEETNAME==unique(resdf$MEETNAME)[j]) %>% distinct()
  meet_two$TIME<-period_to_seconds(ms(meet_two$TIME))
  tmp=inner_join(meet_one, meet_two, by="NAME") 
  tmpvec<-c(tmpvec, median(tmp$TIME.x-tmp$TIME.y, na.rm=TRUE))
  }
  corrected<-rbind(corrected, tmpvec)
}

corrected<-corrected[2:dim(corrected)[1],]
corrections<-corrected[1,]
names(corrections)<-unique(resdf$MEETNAME)
corrections<-corrections[!is.na(corrections)]

########
#Adjust course times
########
#Nationals 2018
url="https://www.tfrrs.org/results/xc/15028/NCAA_Division_III_Cross_Country_Championships%250A"
webpage <- read_html(url)
table_of_tables <- xml_find_all(webpage, "//table") %>% html_table
runners<-table_of_tables[[4]]$NAME

#factor in course adjustments
all_results<-read.csv("2018_all_results.csv")
all_results<- all_results %>% filter(NAME %in% runners)
indices=match(all_results$MEETNAME, str_trim(names(corrections)))
adjusted_results <- all_results %>% mutate(ADJ_TIME=period_to_seconds(ms(TIME))+corrections[indices])
#write.csv(adjusted_results, "2018_Results_Adjusted.csv")

#Split by name
split_results_18<-split(adjusted_results, f=adjusted_results$NAME, drop=TRUE)

We simulate again in the same way. Omitting the code-chunk.

b<-sim_stats %>% group_by(TEAMS) %>% summarize(On_Podium=n()/100, P_first=sum(PLACES==1)/100, P_second=sum(PLACES==2)/100, P_third=sum(PLACES==3)/100, P_fourth=sum(PLACES==4)/100) %>% dplyr::arrange(desc(On_Podium))


text<-c()
for (i in 1:dim(b)[1]){
text <- c(text, paste("Prob Podium:", b[i,2], "\n", "Prob 1st:", b[i,3], "\n", "Prob 2nd:", b[i,4], "\n", "Prob 3rd:", b[i,5], "\n", "Prob 4th:", b[i,6]))
}

dat <- data.frame(x=b$TEAMS, y=b$On_Podium, text=text)
dat$x <- factor(dat$x, levels = as.character(b$TEAMS))

h <- plot_ly(dat, x = ~x, y = ~y, type = 'bar', text = text,
        marker = list(color = c('rgb(178,8,57)', 'rgb(0,115,96)', 'rgb(162,31,75)', 'rgb(204,21,69)', 'rgb(242,206,15)', 'rgb(13,43,129)', 'rgb(71,11,119)', 'rgb(31,56,107)', 'rgb(99,35,153)','rgb(226,30,21)', 'rgb(245,130,43)', 'rgb(188,0,0)', 'rgb(124,20,21)', 'rgb(238,173,18)'),
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5))) %>%
  layout(title = "Results of National Meet Simulations 2018",
         xaxis = list(title = ""),
         yaxis = list(title = "Probability of Podium"))

h
#chart_link = api_create(q, filename="interactive plot 2018 natties")
#chart_link
## [1] "read webpage"

Exploring potential of Travel Distance

We thought, hey, maybe teams which travel further for nationals have harder time performing to expectation at the national meet. We use this map to illustrate the travel distances for each team. One thing of note is that most teams in the Midwest drive to nationals, almost regardless of where the meet is. Air travel is stressful for recovery which could affect runners’ perfomance in the national meet. #Conclusion We found a few compelling results with this project. 1) We saw that it was a reasonable assumption to model runners’ performances as coming from a normal distribution. 2) We saw that race adjustments need not be as computationally intensive as using least-squares to fit an adjustment time to a matrix of pairwise adjustments. And 3) While our simulations did a pretty good job of predicting podium finishes, making predictions for races has a lot of uncertainty, as evidenced by the range of finishing places over 100 simulations. The most challenging and time-consuming part of this project was building the scraper to reliably parse results from TFRRS.

In results, we note that Carnegie Mellon consistently peaked very poorly at the national meet despite a high projected finish in simulations, while Pomona-Pitzer and Wash U. among others tend to perform at or above expectations at the national meets.