Mapping 16 Equity Indicators to the New York City Schools Public Datasets: Access, Opportunity, and Outcomes

Alex J. Bowers*, Yeonsoo Choi, Huijie Shi, Feiyi Sun, Haimei Sun, Jonathan Williams, Sarah Weeks

Teachers College, Columbia University

This document is the online R markdown supplement to the report titled ā€œMapping 16 Equity Indicators to the New York City Schools Public Datasets: Access, Opportunity, and Outcomesā€.

This document contains the R code for each of the public data figures in the report, in the order presented in the report. This document includes seven individual markdowns written by the coauthors of the report.

Note: For more information on each of the visualizations, please refer to the full report published online.

To cite this report supplement (authors in alphabetical order by last name): Bowers, A.J., Choi, Y., Shi, H., Sun, F., Sun, H., Williams, J., Weeks, S. (2022). Online Supplement to Mapping 16 Equity Indicators to the New York City Schools Public Datasets: Access, Opportunity, and Outcomes. Teachers College, Columbia University. New York, NY.

CC BY NC Creative Commons: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) https://creativecommons.org/licenses/by-nc/4.0/


Box and Whisker Plots

Feiyi Sun

# Load Package
library(openxlsx)
library(tidyverse)
## ā”€ā”€ Attaching packages ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ tidyverse 1.3.1 ā”€ā”€
## āœ” ggplot2 3.3.6     āœ” purrr   0.3.4
## āœ” tibble  3.1.7     āœ” dplyr   1.0.9
## āœ” tidyr   1.2.0     āœ” stringr 1.4.0
## āœ” readr   2.1.2     āœ” forcats 0.5.1
## ā”€ā”€ Conflicts ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ tidyverse_conflicts() ā”€ā”€
## āœ– dplyr::filter() masks stats::filter()
## āœ– dplyr::lag()    masks stats::lag()
library(visdat)
library(readxl)
# Include color-blind friendly color palette
cbPalette <- c( "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2")
# Create text function for ggplot
theme_text <- function(){
  theme(text = element_text(family = "serif", color = "gray25"),
        # change background fill and color
        plot.title = element_text(face = "bold"),
        plot.caption = element_text(face = "bold", hjust = 1))
}

Introduction

The box and whisker plot displays five-number summary of the data: minimum, first quartile, median, third quartile, and maximum. I borrowed the code of box and whisker plot from Bulut and Desjardinsā€™s Visualizing big data chapter and made a few changes to adapt to this project. The x-axis uses outcome variables such as on-time graduation rate and performance on tests (percent proficiency and average regent scores). The y-axis displays 32 districts ordered based on the median values of outcome variables. Color of the box plots is by borough.

On-time graduation (outcome)

Data Import

Go to New York City Department of Education Graduation Results webpage first. Scroll down to the bottom. Under ā€œGraduation Results for Cohorts 2001 to 2017ā€ section, right click on the ā€œSchool (Excel file)ā€ and ā€œCopy Link Addressā€. Select the ā€œAllā€ sheet within that file which includes the on-time graduation rate for each school in New York City.

# Load excel file directly from the website
grad1 <- read.xlsx("https://infohub.nyced.org/docs/default-source/default-document-library/2021-graduation_rates_public_school.xlsx", sheet= "All")

# Load excel file from local directory
# grad1 <- read_excel("2021-graduation_rates_public_school.xlsx", sheet = "All")

Variable Selection

I am using the data from the Cohort of 2015 since the Cohort of 2015 entered 9th grade in the 2015-2016 school year. If graduated on time (4-year), students would leave schools in 2018-2019 school year which was pre-pandemic period (Covid began in 2020 Spring). Also, there are two types of on-time graduates: 1) 4-Year June 2) 4-Year August. The 4-Year August cohort is cumulative (Business Rules | NYSED Data Site, n.d.). Through observation, the rate for those who graduated on time in August is a bit higher than that for those who graduated on time in June which confirms the cumulative rule (code and output are not included below).

# Split DBN column by number of characters using `substr` function
# District = first two characters of DBN column (start at 1, stop at 2)
# Borough = the third character of DBN column (start at 3, stop at 3)
grad2 <- transform(grad1, District = substr(DBN, 1, 2), Borough = substr(DBN, 3,3)) %>% 
  # Move the new District and Borough columns to the left 
  relocate("District", "Borough") %>% 
  # Select Cohort Year 2015
  subset(Cohort.Year == 2015) %>% 
  # Select 4-year August graduates
  subset(Cohort =="4 year August") %>% 
  # Select the first 10 columns
  select(1:10)

# Replace borough abbreviation with full name
grad2["Borough"][grad2["Borough"] == "K"] <- "Brooklyn"
grad2["Borough"][grad2["Borough"] == "M"] <- "Manhattan"
grad2["Borough"][grad2["Borough"] == "Q"] <- "Queens"
grad2["Borough"][grad2["Borough"] == "R"] <- "Staten Island"
grad2["Borough"][grad2["Borough"] == "X"] <- "Bronx"

Data Cleaning

There are no missing values. Suppressed values under ā€œ# Gradā€ column are coded as ā€œsā€ which means those schools have fewer than 5 students who first entered 9th grade in 2015-2016 school year (the Cohort of 2015) and the number of students who graduated from those schools in 2018-2019 school year are replaced with ā€œsā€. There are 11 out of 480 (2.29%) suppressed values which is smaller than 5% (codes are provided below but output are not shown). Thus, it is reasonable to delete them.

vis_miss(grad2) # no missing values
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

# Feel free to observe the data below
# table(grad2$X..Total.Cohort) # no suppressed data
# table(grad2$X..Grads) # 11 suppressed
# grad2.s <- grad2[grad2$X..Grads == "s",] # 11 10
# View(grad2.s)
# grad2.st <- grad2[grad2$X..Total.Cohort <=5,] # 11 10
# View(grad2.st)
# Delete suppressed values from grad2
grad3 <- grad2 %>% 
  subset(X..Grads != "s")
# dim(grad2) # 480 10
# dim(grad3) # 469 10
# 2.29% missingness due to suppression

Data Visualization

ggplot(data = grad3,
       mapping = aes(y = reorder(District, as.numeric(X..Grads.1), median), 
                     x = as.numeric(X..Grads.1), 
                     fill = Borough)) +
  geom_boxplot() +
  scale_fill_manual(values = cbPalette) + 
  labs(y = "District", 
       x = "On-time Graduation Rate (%)",
       title = "Median On-time Graduation Rates in A Descending Order",
       caption = "Use the data for 4-year August graduates of the Cohort of 2015 from `2021-graduation_rates_public_school` dataset.") + 
  theme_bw() + 
  theme_text()

I focused on the on-time graduation rates for traditional public high schools. Since there are some extreme values such as 0% for District 7, 8, and 12 and lower than 25% for some other districts, the average on-time graduation rates can be distorted. Instead, I ordered the box plots based on the median on-time graduation rate for each district. Going from the top to the bottom, the median on-time graduation rate decreases. We can find out that all districts in Queens rank in the top half of all 32 districts, while all districts in the Bronx rank at the bottom two-third of all districts. Five districts that behave the worst are from Bronx and Brooklyn, while top five districts are from Manhattan and Queens.

The range of on-time graduation rates are pretty wide for District 1, 7, 15, and 23. There are about 28 schools with on-time graduation rates lower than 25%, 4 of which have even no students graduated on time in August during 2018-2019 school year. These schools are from District 7, 8, and 12. Through observation of the data (codes are provided below but outputs are not shown), prior to the Cohort of 2015, on-time graduation rates for these four schools were not 0%. The number of students in the Cohort of 2015 was dramatically fewer than the number of students in previous Cohorts. By consulting New York City Department of Education website, these schools were phasing out towards closing between 2015-2016 school year and 2018-2019 school year.

# If you are curious about which schools have fewer than 25% on-time graduation rate, feel free to try out the following code.

# Schools with on-time graduation rate as 0%
# grad3.0 <- grad3[grad3$X..Grads.1 == 0,]
# nrow(grad3.0) # 4
# View(grad3.0)

# Schools with on-time graduation rate lower than 25%
# grad3.25 <- grad3[grad3$X..Grads.1 < 25  & grad3$X..Grads.1 != 100,]
# nrow(grad3.25) # 28
# View(grad3.25)

# Observe the schools with 0% on-time graduation rates
# View(grad1[grad1$DBN == "07X547",]) 
# View(grad1[grad1$DBN == "08X519",])
# View(grad1[grad1$DBN == "12X372",])
# View(grad1[grad1$DBN == "12X692",])

Performance on tests (outcome): Math

Introduction: To examine the group differences in achievement levels and learning growth in math, I looked at the percent proficiency of math for elementary school (grades 3-5) and middle school (grades 6-8), as well as average regents score of Algebra I for high school using metric value (grades 9-12).

Data Import

To get the data about percent proficiency of Math for elementary school and middle school, please go to New York City Department of Education Test Results webpage. Scroll down to the bottom. Under the ā€œMath Test Results 2013 to 2019ā€ section, right click on the ā€œSchool (Excel file)ā€, ā€œCopy Link Addressā€, and select the ā€œAllā€ sheet.

To get the data about average regents score of Algebra I for high school, please go to New York City Department of Education School Quality Report Citywide Data Archives webpage. Find the ā€œ2018-19 School Quality Reportsā€ section. Right click on ā€œ2018-19 School Quality Reports Results for high schoolsā€, ā€œCopy Link Addressā€, and select ā€œStudent Achievementā€.

math <- read.xlsx("https://infohub.nyced.org/docs/default-source/default-document-library/school-math-results-2013-2019-(public).xlsx", sheet = "All")
# math <- read_excel("school-math-results-2013-2019-(public).xlsx", sheet = "All")
math <- math[,-1]
# Regent score
sqr1819 <- read.xlsx("https://infohub.nyced.org/docs/default-source/default-document-library/201819_hs_sqr_results.xlsx", sheet = "Student Achievement")
names(sqr1819) <- sqr1819[1,] # Copy the first row to header
sqr1819 <- sqr1819[-1,-1] # Delete the first row and first empty column
# dim(sqr1819) # 486 176

Variable Selection

For the percent proficiency of math for elementary and middle school, I took the ā€œ% Level3+4ā€ variable from ā€œschool-math-results-2013-2019-(public)ā€ dataset which refers to the percent of students who are proficient or excel in standards for Mathematics for their grade. For average regents score of Algebra I for high school, I took the ā€œMetric Value - Average Regents Score - Algebra I (Common Core)ā€ variable from ā€œ201819_hs_sqr_resultsā€ dataset. This is a 100-point scale score with 65 or higher as the standard for being eligible of getting a diploma and 80 or higher as the standard for being college ready proficient in Math.

Data Cleaning

8 out of 2275 elementary schools (grades 3-5) and 10 out of 1369 middle schools (grades 6-8 have 5 or fewer tested students which are suppressed with an ā€œsā€ and deleted for Figure 5.

vis_miss(math) # no missing values

# table(math$`#.Level.3+4`) # have 93 suppressed values, other values are all numeric
math2 <- transform(math, District = substr(DBN, 1, 2), Borough = substr(DBN, 3,3)) %>% 
  # move the new District and Borough columns to the left 
  relocate("District", "Borough")
  
# Replace borough abbreviation with full name
math2["Borough"][math2["Borough"] == "K"] <- "Brooklyn"
math2["Borough"][math2["Borough"] == "M"] <- "Manhattan"
math2["Borough"][math2["Borough"] == "Q"] <- "Queens"
math2["Borough"][math2["Borough"] == "R"] <- "Staten Island"
math2["Borough"][math2["Borough"] == "X"] <- "Bronx"
# Keep columns by name
keeps <- c("DBN", "School Name", "School Type",
           "Metric Value - Average Regents Score - English (Common Core)",
           "Metric Value - Average Regents Score - Global History",
           "Metric Value - Average Regents Score - US History",
           "Metric Value - Average Regents Score - Algebra I (Common Core)",
           "Metric Value - Average Regents Score - Living Environment")
sqr1819.2 <- sqr1819[keeps]

sqr1819.3 <- transform(sqr1819.2, District = substr(DBN, 1, 2), Borough = substr(DBN, 3,3)) %>% 
  # move the new District and Borough columns to the left 
  relocate("District", "Borough")
# dim(sqr1819.3) # 486  10

# Replace borough abbreviation with full name
sqr1819.3["Borough"][sqr1819.3["Borough"] == "K"] <- "Brooklyn"
sqr1819.3["Borough"][sqr1819.3["Borough"] == "M"] <- "Manhattan"
sqr1819.3["Borough"][sqr1819.3["Borough"] == "Q"] <- "Queens"
sqr1819.3["Borough"][sqr1819.3["Borough"] == "R"] <- "Staten Island"
sqr1819.3["Borough"][sqr1819.3["Borough"] == "X"] <- "Bronx"
vis_miss(sqr1819.3) # no missing values

# table(sqr1819.3$Metric.Value...Average.Regents.Score...Algebra.I..Common.Core.) # N<15: 30, other values are numeric

Math: Grades 3-5 (elementary school)

# number of elementary schools with or without suppressed values: 2275
math.elem.ws <- math2 %>% 
  # Select 2018-2019 school year
  subset(Year == 2018) %>% 
  # Select grade level 3-5
  subset(Grade == 3 | Grade == 4 | Grade == 5)
# dim(math.elem.ws) # 2275 19

# number of elementary schools with suppressed values (grades 3-5): 8
math.elem.s <- math2 %>% 
  # Select 2018-2019 school year
  subset(Year == 2018) %>% 
  # Select grade level 3-5
  subset(Grade == 3 | Grade == 4 | Grade == 5) %>% 
  filter(X..Level.3.4 == "s")
# dim(math.elem.s) # 8 19
# 8 out of 2275 elementary schools (grades 3-5) have 5 or fewer tested students which are suppressed with an ā€œsā€

# dataset without suppressed values
math.elem <- math2 %>% 
  # Select 2018-2019 school year
  subset(Year == 2018) %>% 
  # Select grade level 3-5
  subset(Grade == 3 | Grade == 4 | Grade == 5) %>% 
  # Filter out rows with no suppressed values under "# Level3+4" column
  filter(X..Level.3.4 != "s") %>% 
  group_by(DBN) %>% 
  # percentage of level 3 and 4 for grades 3-5 in total for each school
  mutate(per.Level.34 = sum(as.numeric(X..Level.3.4))/sum(as.numeric(Number.Tested))*100)
ggplot(data = math.elem,
       mapping = aes(y = reorder(District, as.numeric(per.Level.34), median), 
                     x = as.numeric(per.Level.34), 
                     fill = Borough)) +
  geom_boxplot() +
  scale_fill_manual(values = cbPalette) + 
  labs(y= "Districts", 
       x= "Percent proficiency (%)",
       title = "Median Percent Proficiency of Math for Elementary Schools (Grades 3-5) in A Descending Order",
       subtitle = "New York State Grades 3-5 Mathematics Assessment",
       caption = "Use the data for 2018-2019 school year from 'school-math-results-2013-2019-(public)' dataset.") + 
  theme_bw() + 
  theme_text()