Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see https://rmarkdown.rstudio.com/lesson-1.html. In the header section above, you can configure options for this document, including title, author(s), and additional style and output options. The nocite and bibliography lines automatically add a bibliography for the software packages you have used. Remove the nocite line to suppress references you haven’t cited. You may delete this instruction section.

Abstract

Chakraborty (2021) uses early COVID-19 Data to measure if counties with higher rates of disabilities are more prone to effects from the pandemic in terms of infection rates. The overall goal of the study is to investigate whether disabled people are disproportionately affected by COVID-19.

In order to measure this, Chakraborty tests the correlation of disability among people over 18, separated by race, poverty, age, sex, with rates of COVID-19 by county. These correlation values are tested for significant outcomes to determind what factors of population demographics can be used to predict a county’s COVID-19 rate.

Chakraborty also controls for spatial correlation by using clustering in a General Estimating Equation using both state boundaries and rates of COVID-19. For each category in race, sex, poverty status, and age, a separate GEE model was ran.

Chakraborty found significant positive relationships between rates of disability among traditionally marginalized groups and rates of COVID-19. This study aims to reproduce the results found in Chakraborty’s research, using the same models and data sources.

This reproduction study will include: a map of COVID-19 rates by county (fig. 1), a map of disability rates by county (fig. 2), a table of correlation values (fig. 3), a map of GEE clusters (fig. 4), a qualitative map for risk level (fig. 5), and a table of correlation after controlling for spatial dependence (fig. 6).

The reproduction study data and code are available in public a GitHub repository at https://github.com/jorredahl/RPr-Chakraborty-2021 and the analysis plans and reports are registered with OSF at https://doi.org/10.17605/OSF.IO/S5MTQ. The reproduction is implemented with R markdown using the SpatialEpi package for the Kulldorff spatial scan statistic packages and the geepack package for the generalized estimating equation.

Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. https://doi.org/10.1016/j.dhjo.2020.101007

Study design

The aim of this reproduction study is to implement the original study as closely as possible to reproduce the map of county level distribution of COVID-19 incidence rate, the summary statistics and bivariate correlation for disability characteristics and COVID-19 incidence, and the generalized estimating equations. Our two confirmatory hypotheses are that we will be able to exactly reproduce Chakraborty’s results as presented in table 1 and table 2. Stated as null reproduction study hypotheses (RPr-H):

RPr-H1: There is a less than perfect match between Chakraborty’s bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate and our bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate.

RPr-H2: There is a less than perfect match between Chakraborty’s beta coefficient for the GEE of each disability/sociodemographic variable and our beta coefficient for the GEE of each disability/sociodemographic variable.

Each hypothesis encompasses multiple models, testing across the demographic variables: race, ethnicity, sex, age, and poverty status.

Original study design

The original study is observational, with the exploratory objective of determining “whether COVID-19 incidence is significantly greater in counties containing higher percentages of socio-demographically disadvantaged [people with disabilities], based on their race, ethnicity, poverty status, age, and biological sex” (Chakraborty 2021).

In the original study, 18 implicit bivariate hypotheses are tested for correlation between COVID-19 cumulative incidence rates and specific categories of PwDs at the county level. Although the original publication does not state null hypotheses for each bivariate correlation, we may formulate the original research hypotheses (OR-H) as follows:

OR-H1.1: There is no correlation between the COVID-19 incidence rate and the percentage of people with disabilities at the county level. OR-H1.2: There is no correlation between the COVID-19 incidence rate and the percentage of white people with disabilities at the county level. … OR-H1.18 There is no correlation between the COVID-19 incidence rate and the percentage of female people with disabilities at the county level.

Five multi-variate hypotheses are tested for associations between COVID-19 cumulative incidence rates and subgroups of PwDs at the county level. Although the original publication does not state null hypotheses for each model, we may formulate them as follows:

OR-H2.1: The percentages of people with disability, categorized by race, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters. … OR-H2.5: The percentages of people with disability, categorized by gender, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters.

There is no randomization in the original study.

Study metadata

Materials and procedure

Computational environment

The study was originally conducted using SaTScan software to implement the Kulldorff spatial scan statistic. Other software are not specified in the publication; however data files suggest and communication with the author verifies that spatial analysis and mapping was conducted in ArcGIS, generalized estimating equation (GEE) models were calculated in SPSS, and the SaTScan software version was 9.6.

This reproduction study uses R, including the SpatialEpi package for the Kulldorff spatial scan statistics and the geepack package for GEE models.

Data

ACS Socio-demographic data

The American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html and details on sampling methods and accuracy can be found at https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html.

Disability Subgroup Variables
Variable Name in Study ACS Variable name
percent of total civilian non-institutionalized population with a disability S1810_C03_001E
Race
percent w disability: White alone S1810_C03_004E
percent w disability: Black alone S1810_C03_005E
percent w disability: Native American S1810_C03_006E
percent w disability: Asian alone S1810_C03_007E
percent w disability: Other race S1810_C03_009E
Ethnicity
percent w disability: Non-Hispanic White S1810_C03_0011E
percent w disability: Hispanic S1810_C03_012E
percent w disability: Non-Hispanic non-White (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100
percent w disability: Other race S1810_C03_009E
Poverty
percent w disability: Below poverty level (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100
percent w disability: Above poverty level (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100
Age
percent w disability: 5-17 S1810_C03_014E
percent w disability: 18-34 S1810_C03_015E
percent w disability: 35-64 S1810_C03_016E
percent w disability: 65-74 S1810_C03_017E
percent w disability: 75+ S1810_C03_018E
Biological sex
percent w disability: male S1810_C03_001E
percent w disability: female S1810_C03_003E

American Community Survey (ACS) data for sociodemographic subcategories of people with disabilities can be accessed by using the tidycensus package to query the Census API. This requires an API key which can be acquired at api.census.gov/data/key_signup.html.

ACS data transformations

The original study extent is the lower 48 states and Washington D.C. Therefore, Alaska, Hawai’i and Puerto Rico are removed from the data. Data on people with disabilities in poverty is derived from a different census table (C18130) than data on people with disabilities and age, race, ethnicity, age, and biological sex (S1810). Therefore, join the poverty data to the other data using the GEOID. Also transform the ACS geographic data into Contiguous USA Albers Equal Area projection and fix geometry errors.

Optionally, save the raw ACS data to data/raw/public/acs.gpkg for use in GIS software.

Calculate independent socio-demographic variables of people with disabilities as percentages for each sub-category of disability (race, ethnicity, poverty, age, and biological sex) and remove raw census data from the data frame. Reproject the data into an Albers equal area conic projection.

COVID-19 data

Data on COVID-19 cases from the Johns Hopkins University dashboard have been provided directly with the research compendium because the data is no longer available online in the state in which it was downloaded on August 1, 2020. The dashboard and cumulative counts of COVID-19 cases and deaths were continually updated, so an exact reproduction required communication with the original author, Jayajit Chakraborty, for assistance with provision of data from August 1, 2020. The data includes an estimate of the total population (POP_ESTIMA) and confirmed COVID-19 cases (Confirmed). The COVID-19 case data expresses cumulative count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata for this particular resource is no longer available from the original source, one can reasonably assume that the total population estimate was based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data had not been released yet.

Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.

COVID-19 data transformations

Calculate the COVID incidence rate as the cases per 100,000 people. Convert the COVID data to a non-geographic data frame.

Join dependent COVID data to independent ACS demographic data.

Missing data

Deviation: There is one county with missing disability and poverty data. This was not mentioned in the original study or in our pre-analyis plan. However, we replace the missing data with zeros, producing results identical to Chakraborty’s.

fips statefp county county_st covid_rate dis_pct white_pct black_pct native_pct asian_pct other_pct non_hisp_white_pct hisp_pct non_hisp_non_white_pct bpov_pct apov_pct pct_5_17 pct_18_34 pct_35_64 pct_65_74 pct_75 male_pct female_pct pop cases x y
35039 35 Rio Arriba Rio Arriba County, New Mexico 751.17 16.06467 10.77458 0.038371 2.744807 0.038371 2.468536 2.355981 11.39619 2.312494 NA NA 0.3069682 1.25857 6.781439 3.391998 4.279648 8.556738 7.50793 39006 293 -106.6932 36.50962

Analysis

Descriptive statistics

Calculate descriptive statistics for dependent COVID-19 rate and independent socio-demographic characteristics, reproducing the min, max, mean, and SD columns of original study table 1.

Deviation: We also calculate the Shapiro Wilk test for normality.

Reproduced Descriptive Statistics
min max mean SD ShapiroWilk p
covid_rate 0.00 14257.17 966.90 1003.96 0.74 0
dis_pct 3.83 33.71 15.95 4.40 0.98 0
white_pct 0.85 33.26 13.55 4.63 0.98 0
black_pct 0.00 20.70 1.48 2.66 0.61 0
native_pct 0.00 13.74 0.28 0.94 0.28 0
asian_pct 0.00 3.45 0.09 0.18 0.51 0
other_pct 0.00 15.24 0.55 0.65 0.57 0
non_hisp_white_pct 0.10 33.16 12.84 4.81 0.99 0
hisp_pct 0.00 25.26 0.99 2.15 0.42 0
non_hisp_non_white_pct 0.00 20.93 2.13 2.75 0.70 0
bpov_pct 0.00 14.97 3.57 1.85 0.93 0
apov_pct 0.00 27.30 12.48 3.06 0.99 0
pct_5_17 0.00 5.08 1.03 0.48 0.95 0
pct_18_34 0.00 5.59 1.56 0.67 0.96 0
pct_35_64 1.01 18.36 6.35 2.30 0.96 0
pct_65_74 0.00 12.73 3.09 1.16 0.95 0
pct_75 0.00 11.13 3.87 1.19 0.97 0
male_pct 1.30 18.19 8.06 2.37 0.98 0
female_pct 1.91 19.94 7.90 2.26 0.98 0

Compare reproduced descriptive statistics to original descriptive statistics. Difference is calculated as ‘reproduction study - original study’. Identical results will result in zero.

Descriptive Statistics Comparison
min max mean SD
covid_rate 0 0.17 -0.1 -0.04
dis_pct 0 0.00 0.0 0.00
white_pct 0 0.00 0.0 0.00
black_pct 0 0.00 0.0 0.00
native_pct 0 0.00 0.0 0.00
asian_pct 0 0.00 0.0 0.00
other_pct 0 0.00 0.0 0.00
non_hisp_white_pct 0 0.00 0.0 0.00
hisp_pct 0 0.00 0.0 0.00
non_hisp_non_white_pct 0 0.00 0.0 0.00
bpov_pct 0 0.00 0.0 0.00
apov_pct 0 0.00 0.0 0.00
pct_5_17 0 0.00 0.0 0.00
pct_18_34 0 0.00 0.0 0.00
pct_35_64 0 0.00 0.0 0.00
pct_65_74 0 0.00 0.0 0.00
pct_75 0 0.00 0.0 0.00
male_pct 0 0.00 0.0 0.00
female_pct 0 0.00 0.0 0.00

The descriptive statistics are identical, except that the original study seems to have rounded the COVID-19 statistics to zero decimal places.

Bivariate parametric correlation analysis

The county-level Pearson’s rho correlation coefficient was used to test association between intra-categorical rates of disability and COVID-19 incidence rates. As this was a parametric test, normality should be tested. A separate hypothesis was formulated for disability in aggregate and for each sociodemographic disability characteristic. Calculating this value is solely exploratory, as it does not control for spatial dependencies, like in later calculations.

Calculate Pearson’s R Correlation Coefficient of each independent variable and the COVID-19 incidence rate, reproducing the Pearson’s R column of original study Table 1.

Compare the reproduced Pearson’s r correlation coefficients to the original study’s Pearson’s r correlation coefficients. Stars indicates the significance level with two stars for p < 0.01 and one star for p < 0.05. Correlation difference rp_r_diff is calculated between the reproduction study rp_r and original study or_r as rp_r_diff = rp_r - or_r Direction difference rp_dir_diff is calculated as (rp_r > 0) - (or_r > 0), giving 0 if both coefficients have the same direction, 1 if the reproduction is positive and the original is negative, and -1 if the reproduction is negative but the original is positive.

Compare reproduced and original Pearson’s R
Original
Reproduced
Difference
Variable R Sig. Level R Sig. Level R Sig. Level Direction
dis_pct -0.056 2 -0.060 2 -0.004 0 0
white_pct -0.326 2 -0.332 2 -0.006 0 0
black_pct 0.456 2 0.460 2 0.004 0 0
native_pct 0.020 0 0.019 0 -0.001 0 0
asian_pct 0.097 2 0.094 2 -0.003 0 0
other_pct 0.028 0 0.026 0 -0.002 0 0
non_hisp_white_pct -0.355 2 -0.361 2 -0.006 0 0
hisp_pct 0.119 2 0.119 2 0.000 0 0
non_hisp_non_white_pct 0.439 2 0.442 2 0.003 0 0
bpov_pct 0.108 2 0.106 2 -0.002 0 0
apov_pct -0.146 2 -0.151 2 -0.005 0 0
pct_5_17 0.083 2 0.084 2 0.001 0 0
pct_18_34 0.066 1 0.063 2 -0.003 1 0
pct_35_64 -0.005 0 -0.008 0 -0.003 0 0
pct_65_74 -0.089 2 -0.091 2 -0.002 0 0
pct_75 -0.181 2 -0.186 2 -0.005 0 0
male_pct -0.131 2 -0.134 2 -0.003 0 0
female_pct 0.028 0 0.023 0 -0.005 0 0

The reproduced values were very similar in value to the original values, only deviating by as much as 0.006. Significance levels and direction were found to all be equal, showing a very similar results value from the original to the reproduced.

This is not ideal for reproducing the original results, likely this difference comes from the result of rows in the original dataset getting flipped in Excel, meaning some COVID rates belonged to other counties.

Kulldorff spatial scan statistic

The methods in the original study call for geographic clustering of counties to control for spatial dependence. These groups are made according to the Kulldorff spatial scan statistic for geographic clusters of high COVID-19 incidence.

The original study uses SaTScan software to implement the Kulldorff spatial scan statistic model. In SaTScan, models may be specified with many parameters having significant implications for results. The original manuscript only specifies that Poisson model should be used. The scan statistic in SaTScan used spherical great circle distance calculations based upon the latitude and longitude coordinates of the centroid of each county. For this purpose, we have used the X and Y attributes provided as geographic variables with ACS data.

Deviation: We opted to use the SpatialEpi package in R, selecting open source software with R integration over SatSCan software, which is free but not open. The Kulldorff spatial scan statistic model in SpatialEpi also supports a discrete Poisson spatial model, and uses the GINI coefficient to select secondary clusters with no geographical overlap that maximize the difference between locations inside of clusters and locations outside of clusters. We expected that this set of software options could reproduce identical results compared to SaTScan.

## [1] Run time: 4.85 minutes
## [1] Most likely cluster:
## $location.IDs.included
##  [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838  280 1760 1846 1756
## 
## $population
## [1] 16949211
## 
## $number.of.cases
## [1] 469091
## 
## $expected.cases
## [1] 233805.6
## 
## $SMR
## [1] 2.006329
## 
## $log.likelihood.ratio
## [1] 97983.07
## 
## $monte.carlo.rank
## [1] 1
## 
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 136

Preprocess data for GEE modelling

Deviation In this study, clustering will be decided more simply, involving only a single method of clustering, risk levels are included for all counties in the cluster rather than the center, and clusters are treated as a single category rather than a tiered system.

Unique GEE cluster IDs

GEE modeling requires specific clustering to correctly control for spatial dependence. We will use our SpatialEpi clusters to get (non-identical) similar clusters to those of the original study using SaTScan.

Filter and standardize data

First, cluster classes must be defined in 6 tiers, then combined with state ID codes to create unique cluster IDs to code for spatial relation.

Risk is classified according to this table:

Relative Risk Values Relative Risk Class
Outside of cluster 1
RR < 1 1
1 <= RR < 2 2
2 <= RR < 3 3
3 <= RR < 4 4
4 <= RR < 5 5
5 <= RR < 6 6

Counties falling outside of any cluster are assigned a score of 1.

Deviation SpatialEpi does not directly calculate relative risk on a local or cluster level, so it must be calculated on its own.

Deviation, in order to best replicate the process from Chakraborty’s study, we must find z-scores then filter for non-zero COVID rates.

Compare redults gathered in this method to the original results of this section from the study.s

## Summary of difference between reproduction independent variables and original independent variables
## Mean:
##              z_white_pct              z_black_pct             z_native_pct 
##                        0                        0                        0 
##              z_asian_pct              z_other_pct     z_non_hisp_white_pct 
##                        0                        0                        0 
##               z_hisp_pct z_non_hisp_non_white_pct               z_bpov_pct 
##                        0                        0                        0 
##               z_apov_pct               z_pct_5_17              z_pct_18_34 
##                        0                        0                        0 
##              z_pct_35_64              z_pct_65_74                 z_pct_75 
##                        0                        0                        0 
##               z_male_pct             z_female_pct 
##                        0                        0
## 
## Standard deviation:
##              z_white_pct              z_black_pct             z_native_pct 
##                    0.000                    0.000                    0.000 
##              z_asian_pct              z_other_pct     z_non_hisp_white_pct 
##                    0.001                    0.001                    0.000 
##               z_hisp_pct z_non_hisp_non_white_pct               z_bpov_pct 
##                    0.000                    0.000                    0.000 
##               z_apov_pct               z_pct_5_17              z_pct_18_34 
##                    0.000                    0.001                    0.001 
##              z_pct_35_64              z_pct_65_74                 z_pct_75 
##                    0.000                    0.000                    0.000 
##               z_male_pct             z_female_pct 
##                    0.000                    0.000

After changing the order to first z-score standardize and then filter for COVID rates > 0, we observed no mean difference between our reproduced variables and the original variables, and we find no standard deviation > 0.001 for the difference between reproduction independent variables and original variables. There are no major differences between the independent variables.

Save final derived data

Save data to file so it can be loaded from there.

GEE models

The generalized estimating equation (GEE) models work to find associations between COVID and disability rates, accounting for spatial clustering.

As specified by the author, “GEEs extend the generalized linear model to accommodate clustered data, in addition to relaxing several assumptions of traditional regression (i.e., normality)”. Additionally, the author noted that “clusters of observations must be defined based on the assumption that observations within a cluster are correlated while observations from different clusters are independent.” All five GEE models were specified with exchangeable correlation matrices, gamma distributions, and logarithmic link function.

In Chakraborty’s original study, he uses these GEE parameters to best display association.:

“The ‘exchangeable’ correlation matrix was selected for the results reported here, since this specification yielded the best statistical fit based on the QIC (quasi- likelihood under the independence) model criterion.” (Chakraborty 2021, Methods paragraph 5)

“The gamma distribution with logarithmic link function was chosen for all GEEs since this model specification provided the lowest QIC value.” (Chakraborty 2021, Methods paragraph 5)

GEE Function

Compare GEE results

First, get the original results of the GEE model:

Original Cluster IDs and Original COVID-19 Incidence
term estimate std.error conf.low conf.high stars p.value
Race model intercept 6.756 0.020 6.716 6.795 2 0.000
White -0.197 0.024 -0.245 -0.149 2 0.000
Black 0.269 0.020 0.229 0.308 2 0.000
Native American 0.022 0.023 -0.023 0.066 0 0.339
Asian 0.057 0.020 0.017 0.096 2 0.005
Other race 0.089 0.018 0.053 0.124 2 0.000
Ethnicity model intercept 6.743 0.020 6.705 6.782 2 0.000
Non-Hispanic White -0.224 0.024 -0.272 -0.176 2 0.000
Hispanic 0.120 0.016 0.088 0.152 2 0.000
Non-Hispanic non-White 0.270 0.019 0.233 0.308 2 0.000
Poverty model intercept 6.816 0.021 6.774 6.858 2 0.000
Below poverty level 0.223 0.029 0.166 0.280 2 0.000
Above poverty level -0.284 0.031 -0.345 -0.223 2 0.000
Age model intercept 6.822 0.021 6.780 6.864 2 0.000
Age 5-17 0.058 0.023 0.012 0.103 1 0.013
Age 18-34 0.038 0.035 -0.031 0.106 0 0.279
Age 35-64 0.003 0.041 -0.077 0.084 0 0.942
Age 65-74 -0.033 0.041 -0.113 0.046 0 0.412
Age 75 or more -0.156 0.027 -0.209 -0.102 2 0.000
Biological sex model intercept 6.814 0.021 6.772 6.856 2 0.000
Male -0.378 0.039 -0.456 -0.301 2 0.000
Female 0.279 0.042 0.197 0.361 2 0.000

Then, compare the rseults of the two methods:

Her, many of the values gathered from the SpatialEpi Clustering and GEE model differ from the original results. While the same conclusions can be drawn, it is important to note that running a GEE model through a different service will have different coefficient values. Most likely, SATScan has different parameters in their proprietary modeling functions that cause differing results to that of SpatialEpi.

Interpret GEE Model

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "equal"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'midpoint', 'palette' (rename to
##   'values') to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "RdBu" is named
## "brewer.rd_bu"
## Multiple palettes called "rd_bu" found: "brewer.rd_bu", "matplotlib.rd_bu". The first one, "brewer.rd_bu", is returned.

Calculate variance, covariance, correlation, and weights

Map each county’s weight in GEE model.

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "jenks"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'midpoint', 'palette' (rename to 'values')
##   to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

Results

Map of COVID-19 Rates

This map will display each US County and its rate of COVID-19 Per 100,000 people.

Map of Disability Rates

Deviation This map will display each US County and its percentage of people over the age of 18 with a disability.

Table of Correlation values

This table will display correlation values using Pearson’s R and Spearman’s Rho, designed to reproduce Chakraborty’s first table.

Reproduced Pearson’s R
variable r t p rp_stars
dis_pct -0.060 3.350 0.000 2
white_pct -0.332 19.612 0.000 2
black_pct 0.460 28.847 0.000 2
native_pct 0.019 1.072 0.142 0
asian_pct 0.094 5.272 0.000 2
other_pct 0.026 1.460 0.072 0
non_hisp_white_pct -0.361 21.545 0.000 2
hisp_pct 0.119 6.686 0.000 2
non_hisp_non_white_pct 0.442 27.429 0.000 2
bpov_pct 0.106 5.914 0.000 2
apov_pct -0.151 8.513 0.000 2
pct_5_17 0.084 4.688 0.000 2
pct_18_34 0.063 3.493 0.000 2
pct_35_64 -0.008 0.460 0.323 0
pct_65_74 -0.091 5.113 0.000 2
pct_75 -0.186 10.541 0.000 2
male_pct -0.134 7.519 0.000 2
female_pct 0.023 1.305 0.096 0
pop 0.128 7.215 0.000 2
cases 0.209 11.891 0.000 2
x 0.099 5.540 0.000 2
y -0.412 25.195 0.000 2

Map of Kulldorff Clusters

Deviation This map will display each cluster as its own classification, illuminating the inner workings of GEE clustering.

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
## 
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "-Oranges" is named
## "oranges" (in long format "brewer.oranges")

Map of Risk Scores

Deviation This map will display risk scores determined by each cluster based on COVID-19 rates, risk scores range from 1 to 6. Relative risk scores are calculated locally and by cluster, so both are shown here.

Local Relative Risk Scores:

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
##   'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

Cluster Relative Risk Scores

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
##   'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

Table of county weights and variance

This table will display results to reproduce Chakraborty’s second table, controlling for spatial correlation using the GEE clustering.

Reproduced SpatialEpi Cluster IDs and Reproduced COVID-19 Incidence
term estimate std.error conf.low conf.high stars p.value
Race model intercept 6.755 0.020 6.716 6.795 2 0.000
White -0.204 0.024 -0.250 -0.158 2 0.000
Black 0.265 0.021 0.224 0.305 2 0.000
Native American 0.020 0.023 -0.024 0.065 0 0.368
Asian 0.051 0.020 0.012 0.090 1 0.011
Other race 0.086 0.018 0.050 0.122 2 0.000
Ethnicity model intercept 6.742 0.020 6.703 6.781 2 0.000
Non-Hispanic White -0.230 0.024 -0.276 -0.184 2 0.000
Hispanic 0.115 0.016 0.083 0.146 2 0.000
Non-Hispanic non-White 0.267 0.019 0.229 0.304 2 0.000
Poverty model intercept 6.816 0.021 6.774 6.858 2 0.000
Below poverty level 0.219 0.029 0.162 0.277 2 0.000
Above poverty level -0.285 0.031 -0.346 -0.224 2 0.000
Age model intercept 6.823 0.022 6.781 6.865 2 0.000
Age 5-17 0.063 0.023 0.018 0.108 2 0.006
Age 18-34 0.032 0.035 -0.036 0.101 0 0.355
Age 35-64 -0.005 0.041 -0.086 0.076 0 0.903
Age 65-74 -0.025 0.041 -0.105 0.055 0 0.543
Age 75 or more -0.160 0.027 -0.214 -0.107 2 0.000
Biological sex model intercept 6.816 0.021 6.774 6.858 2 0.000
Male -0.368 0.039 -0.445 -0.292 2 0.000
Female 0.266 0.041 0.185 0.347 2 0.000

Discussion

Once results are found, compare table values with the table values in the original study using SATScan. Success will find the same values as those found in Chakrabory’s original study. A failure would involve producing significantly different values, or even a study that finds a different hypothesis to be true.

The results shown from the results figures would indicate a success for this reproduction, although this process had many deviations, some unplanned and others not, that allowed for this workflow to be recreated in RStudio.

Deviations

Our first deviation came from an error in the collected data for this reproduction, where not all counties were given disability and poverty statistics.

Most planned deviations were to provide more depth to the study, without changing the fundamental workflow. Changes like these include making more figures to make the process of the study more clear, or including numbers for the Shapiro Wilk test for Normality.

Most significantly, switching to SpatialEpi over SATScan resulted in slightly different coefficients across the board, and also meant we had to self-calculate relative risk scores.

For the differences in how clustering was determined, opting for a more simple approach, the belief was that hierarchical clustering may deliver cluster relative risk scores that were too sensitive or inaccurate to other lower-order clusters.

Conclusions

We find similar results to Chakraborty in our own reproduction. COVID Risk rates are positively correlated with higher disability rates among traditionally marginalized groups. While the results are not identical, the conclusions that can be made remain the same.

Differences

Differences in this reproduction to the original study largely stem from deviations in our own methods, opting for methods that are more available to the general public and offer a more streamlined workflow. Still, the biggest differences in the two studies come from the GEE model results, particularly the coefficient values for Black disability rate, Non Hispanic Non White Disability rate, below poverty disability rate, and female disability rate.

Acknowledgements

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)

Bivand, Roger. 2025. classInt: Choose Univariate Class Intervals. https://r-spatial.github.io/classInt/.
Chang, Winston. 2015. Downloader: Download Files over HTTP and HTTPS. https://github.com/wch/downloader.
Csárdi, Gábor. 2021. Dotenv: Load Environment Variables from .env. https://github.com/gaborcsardi/dotenv.
Dunnington, Dewey, Edzer Pebesma, and Ege Rubak. 2024. S2: Spherical Geometry Operators Using the S2 Geometry Library. https://r-spatial.github.io/s2/.
Grosjean, Philippe. 2022. svDialogs: SciViews - Standard Dialog Boxes for Windows, MacOS and Linuxes. https://github.com/SciViews/svDialogs.
———. 2024. SciViews::r. MONS, Belgium: UMONS. https://sciviews.r-universe.dev/.
Grosjean, Philippe, and Frederic Ibanez. 2024. Pastecs: Package for Analysis of Space-Time Ecological Series. https://github.com/SciViews/pastecs.
Halekoh, Ulrich, Søren Højsgaard, and Jun Yan. 2006. “The r Package Geepack for Generalized Estimating Equations.” Journal of Statistical Software 15/2: 1–11.
Højsgaard, Søren, Ulrich Halekoh, and Jun Yan. 2024. Geepack: Generalized Estimating Equation Package. https://CRAN.R-project.org/package=geepack.
Kim, Albert Y., Jon Wakefield, and Mikael Moise. 2023. SpatialEpi: Methods and Data for Spatial Epidemiology. https://github.com/rudeboybert/SpatialEpi.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
———. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
———. 2025. Foreign: Read Data Stored by Minitab, s, SAS, SPSS, Stata, Systat, Weka, dBase, ... https://svn.r-project.org/R-packages/trunk/foreign/.
Robinson, David, Alex Hayes, and Simon Couch. 2024. Broom: Convert Statistical Objects into Tidy Tibbles. https://broom.tidymodels.org/.
Solt, Frederick, and Yue Hu. 2024. Dotwhisker: Dot-and-Whisker Plots of Regression Results. https://fsolt.org/dotwhisker/.
Tennekes, Martijn. 2018. tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.
———. 2025. Tmap: Thematic Maps. https://github.com/r-tmap/tmap.
Walker, Kyle, and Matt Herman. 2025. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://readr.tidyverse.org.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Yan, Jun. 2002. “Geepack: Yet Another Package for Generalized Estimating Equations.” R-News 2/3: 12–14.
Yan, Jun, and Jason P. Fine. 2004. “Estimating Equations for Association Structures.” Statistics in Medicine 23: 859–80.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with Kable and Pipe Syntax. http://haozhu233.github.io/kableExtra/.