This is an R Markdown document. Markdown is a simple formatting
syntax for authoring HTML, PDF, and MS Word documents. For more details
on using R Markdown see https://rmarkdown.rstudio.com/lesson-1.html. In the
header section above, you can configure options
for this document, including title, author(s), and additional style
and output options. The nocite
and
bibliography
lines automatically add a bibliography for the
software packages you have used. Remove the nocite
line to
suppress references you haven’t cited. You may delete this instruction
section.
Chakraborty (2021) uses early COVID-19 Data to measure if counties with higher rates of disabilities are more prone to effects from the pandemic in terms of infection rates. The overall goal of the study is to investigate whether disabled people are disproportionately affected by COVID-19.
In order to measure this, Chakraborty tests the correlation of disability among people over 18, separated by race, poverty, age, sex, with rates of COVID-19 by county. These correlation values are tested for significant outcomes to determind what factors of population demographics can be used to predict a county’s COVID-19 rate.
Chakraborty also controls for spatial correlation by using clustering in a General Estimating Equation using both state boundaries and rates of COVID-19. For each category in race, sex, poverty status, and age, a separate GEE model was ran.
Chakraborty found significant positive relationships between rates of disability among traditionally marginalized groups and rates of COVID-19. This study aims to reproduce the results found in Chakraborty’s research, using the same models and data sources.
This reproduction study will include: a map of COVID-19 rates by county (fig. 1), a map of disability rates by county (fig. 2), a table of correlation values (fig. 3), a map of GEE clusters (fig. 4), a qualitative map for risk level (fig. 5), and a table of correlation after controlling for spatial dependence (fig. 6).
The reproduction study data and code are available in public a GitHub
repository at https://github.com/jorredahl/RPr-Chakraborty-2021
and the analysis plans and reports are registered with OSF at https://doi.org/10.17605/OSF.IO/S5MTQ. The reproduction
is implemented with R markdown using the SpatialEpi
package
for the Kulldorff spatial scan statistic packages and the
geepack
package for the generalized estimating
equation.
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. https://doi.org/10.1016/j.dhjo.2020.101007
The aim of this reproduction study is to implement the original study as closely as possible to reproduce the map of county level distribution of COVID-19 incidence rate, the summary statistics and bivariate correlation for disability characteristics and COVID-19 incidence, and the generalized estimating equations. Our two confirmatory hypotheses are that we will be able to exactly reproduce Chakraborty’s results as presented in table 1 and table 2. Stated as null reproduction study hypotheses (RPr-H):
RPr-H1: There is a less than perfect match between Chakraborty’s bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate and our bivariate correlation coefficient for each disability/sociodemographic variable and COVID-19 incidence rate.
RPr-H2: There is a less than perfect match between Chakraborty’s beta coefficient for the GEE of each disability/sociodemographic variable and our beta coefficient for the GEE of each disability/sociodemographic variable.
Each hypothesis encompasses multiple models, testing across the demographic variables: race, ethnicity, sex, age, and poverty status.
The original study is observational, with the exploratory objective of determining “whether COVID-19 incidence is significantly greater in counties containing higher percentages of socio-demographically disadvantaged [people with disabilities], based on their race, ethnicity, poverty status, age, and biological sex” (Chakraborty 2021).
In the original study, 18 implicit bivariate hypotheses are tested for correlation between COVID-19 cumulative incidence rates and specific categories of PwDs at the county level. Although the original publication does not state null hypotheses for each bivariate correlation, we may formulate the original research hypotheses (OR-H) as follows:
OR-H1.1: There is no correlation between the COVID-19 incidence rate and the percentage of people with disabilities at the county level. OR-H1.2: There is no correlation between the COVID-19 incidence rate and the percentage of white people with disabilities at the county level. … OR-H1.18 There is no correlation between the COVID-19 incidence rate and the percentage of female people with disabilities at the county level.
Five multi-variate hypotheses are tested for associations between COVID-19 cumulative incidence rates and subgroups of PwDs at the county level. Although the original publication does not state null hypotheses for each model, we may formulate them as follows:
OR-H2.1: The percentages of people with disability, categorized by race, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters. … OR-H2.5: The percentages of people with disability, categorized by gender, are not associated with COVID-19 incidence at the county level when accounting for the state and risk level of COVID-19 clusters.
There is no randomization in the original study.
Key words
: COVID-19, Disability, Intersectionality,
Race/ethnicity, Poverty, ReproducibilitySubject
: select from the BePress
TaxonomyDate created
: 3/3/25Date modified
: 3/10/25Spatial Coverage
: Continental United States (48
contiguous states and Washington D.C.)Spatial Resolution
: US CountiesSpatial Reference System
: Contiguous USA Albers Equal
Area projection EPSG:5070Temporal Coverage
: From 1/22/2020 (when John Hopkins
began collecting the data) to 8/1/2020 (when the data was retrieved for
the original study)Temporal Resolution
: Rates were collected as one
temporal unit.The study was originally conducted using SaTScan software to
implement the Kulldorff spatial scan statistic. Other software are not
specified in the publication; however data files suggest and
communication with the author verifies that spatial analysis and mapping
was conducted in ArcGIS, generalized estimating equation (GEE) models
were calculated in SPSS, and the SaTScan software version was
9.6
.
This reproduction study uses R, including the SpatialEpi package for the Kulldorff spatial scan statistics and the geepack package for GEE models.
The American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html and details on sampling methods and accuracy can be found at https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html.
Variable Name in Study | ACS Variable name |
---|---|
percent of total civilian non-institutionalized population with a disability | S1810_C03_001E |
Race | |
percent w disability: White alone | S1810_C03_004E |
percent w disability: Black alone | S1810_C03_005E |
percent w disability: Native American | S1810_C03_006E |
percent w disability: Asian alone | S1810_C03_007E |
percent w disability: Other race | S1810_C03_009E |
Ethnicity | |
percent w disability: Non-Hispanic White | S1810_C03_0011E |
percent w disability: Hispanic | S1810_C03_012E |
percent w disability: Non-Hispanic non-White | (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100 |
percent w disability: Other race | S1810_C03_009E |
Poverty | |
percent w disability: Below poverty level | (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100 |
percent w disability: Above poverty level | (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100 |
Age | |
percent w disability: 5-17 | S1810_C03_014E |
percent w disability: 18-34 | S1810_C03_015E |
percent w disability: 35-64 | S1810_C03_016E |
percent w disability: 65-74 | S1810_C03_017E |
percent w disability: 75+ | S1810_C03_018E |
Biological sex | |
percent w disability: male | S1810_C03_001E |
percent w disability: female | S1810_C03_003E |
American Community Survey (ACS) data for sociodemographic
subcategories of people with disabilities can be accessed by using the
tidycensus
package to query the Census API. This requires
an API key which can be acquired at api.census.gov/data/key_signup.html.
The original study extent is the lower 48 states and Washington D.C. Therefore, Alaska, Hawai’i and Puerto Rico are removed from the data. Data on people with disabilities in poverty is derived from a different census table (C18130) than data on people with disabilities and age, race, ethnicity, age, and biological sex (S1810). Therefore, join the poverty data to the other data using the GEOID. Also transform the ACS geographic data into Contiguous USA Albers Equal Area projection and fix geometry errors.
Optionally, save the raw ACS data to
data/raw/public/acs.gpkg
for use in GIS software.
Calculate independent socio-demographic variables of people with disabilities as percentages for each sub-category of disability (race, ethnicity, poverty, age, and biological sex) and remove raw census data from the data frame. Reproject the data into an Albers equal area conic projection.
Data on COVID-19 cases from the Johns Hopkins University dashboard
have been provided directly with the research compendium because the
data is no longer available online in the state in which it was
downloaded on August 1, 2020. The dashboard and cumulative counts of
COVID-19 cases and deaths were continually updated, so an exact
reproduction required communication with the original author, Jayajit
Chakraborty, for assistance with provision of data from August 1, 2020.
The data includes an estimate of the total population
(POP_ESTIMA
) and confirmed COVID-19 cases
(Confirmed
). The COVID-19 case data expresses cumulative
count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata
for this particular resource is no longer available from the original
source, one can reasonably assume that the total population estimate was
based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data
had not been released yet.
Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.
Calculate the COVID incidence rate as the cases per 100,000 people. Convert the COVID data to a non-geographic data frame.
Join dependent COVID data to independent ACS demographic data.
Deviation: There is one county with missing disability and poverty data. This was not mentioned in the original study or in our pre-analyis plan. However, we replace the missing data with zeros, producing results identical to Chakraborty’s.
fips | statefp | county | county_st | covid_rate | dis_pct | white_pct | black_pct | native_pct | asian_pct | other_pct | non_hisp_white_pct | hisp_pct | non_hisp_non_white_pct | bpov_pct | apov_pct | pct_5_17 | pct_18_34 | pct_35_64 | pct_65_74 | pct_75 | male_pct | female_pct | pop | cases | x | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35039 | 35 | Rio Arriba | Rio Arriba County, New Mexico | 751.17 | 16.06467 | 10.77458 | 0.038371 | 2.744807 | 0.038371 | 2.468536 | 2.355981 | 11.39619 | 2.312494 | NA | NA | 0.3069682 | 1.25857 | 6.781439 | 3.391998 | 4.279648 | 8.556738 | 7.50793 | 39006 | 293 | -106.6932 | 36.50962 |
Calculate descriptive statistics for dependent COVID-19 rate and independent socio-demographic characteristics, reproducing the min, max, mean, and SD columns of original study table 1.
Deviation: We also calculate the Shapiro Wilk test for normality.
min | max | mean | SD | ShapiroWilk | p | |
---|---|---|---|---|---|---|
covid_rate | 0.00 | 14257.17 | 966.90 | 1003.96 | 0.74 | 0 |
dis_pct | 3.83 | 33.71 | 15.95 | 4.40 | 0.98 | 0 |
white_pct | 0.85 | 33.26 | 13.55 | 4.63 | 0.98 | 0 |
black_pct | 0.00 | 20.70 | 1.48 | 2.66 | 0.61 | 0 |
native_pct | 0.00 | 13.74 | 0.28 | 0.94 | 0.28 | 0 |
asian_pct | 0.00 | 3.45 | 0.09 | 0.18 | 0.51 | 0 |
other_pct | 0.00 | 15.24 | 0.55 | 0.65 | 0.57 | 0 |
non_hisp_white_pct | 0.10 | 33.16 | 12.84 | 4.81 | 0.99 | 0 |
hisp_pct | 0.00 | 25.26 | 0.99 | 2.15 | 0.42 | 0 |
non_hisp_non_white_pct | 0.00 | 20.93 | 2.13 | 2.75 | 0.70 | 0 |
bpov_pct | 0.00 | 14.97 | 3.57 | 1.85 | 0.93 | 0 |
apov_pct | 0.00 | 27.30 | 12.48 | 3.06 | 0.99 | 0 |
pct_5_17 | 0.00 | 5.08 | 1.03 | 0.48 | 0.95 | 0 |
pct_18_34 | 0.00 | 5.59 | 1.56 | 0.67 | 0.96 | 0 |
pct_35_64 | 1.01 | 18.36 | 6.35 | 2.30 | 0.96 | 0 |
pct_65_74 | 0.00 | 12.73 | 3.09 | 1.16 | 0.95 | 0 |
pct_75 | 0.00 | 11.13 | 3.87 | 1.19 | 0.97 | 0 |
male_pct | 1.30 | 18.19 | 8.06 | 2.37 | 0.98 | 0 |
female_pct | 1.91 | 19.94 | 7.90 | 2.26 | 0.98 | 0 |
Compare reproduced descriptive statistics to original descriptive statistics. Difference is calculated as ‘reproduction study - original study’. Identical results will result in zero.
min | max | mean | SD | |
---|---|---|---|---|
covid_rate | 0 | 0.17 | -0.1 | -0.04 |
dis_pct | 0 | 0.00 | 0.0 | 0.00 |
white_pct | 0 | 0.00 | 0.0 | 0.00 |
black_pct | 0 | 0.00 | 0.0 | 0.00 |
native_pct | 0 | 0.00 | 0.0 | 0.00 |
asian_pct | 0 | 0.00 | 0.0 | 0.00 |
other_pct | 0 | 0.00 | 0.0 | 0.00 |
non_hisp_white_pct | 0 | 0.00 | 0.0 | 0.00 |
hisp_pct | 0 | 0.00 | 0.0 | 0.00 |
non_hisp_non_white_pct | 0 | 0.00 | 0.0 | 0.00 |
bpov_pct | 0 | 0.00 | 0.0 | 0.00 |
apov_pct | 0 | 0.00 | 0.0 | 0.00 |
pct_5_17 | 0 | 0.00 | 0.0 | 0.00 |
pct_18_34 | 0 | 0.00 | 0.0 | 0.00 |
pct_35_64 | 0 | 0.00 | 0.0 | 0.00 |
pct_65_74 | 0 | 0.00 | 0.0 | 0.00 |
pct_75 | 0 | 0.00 | 0.0 | 0.00 |
male_pct | 0 | 0.00 | 0.0 | 0.00 |
female_pct | 0 | 0.00 | 0.0 | 0.00 |
The descriptive statistics are identical, except that the original study seems to have rounded the COVID-19 statistics to zero decimal places.
The county-level Pearson’s rho correlation coefficient was used to test association between intra-categorical rates of disability and COVID-19 incidence rates. As this was a parametric test, normality should be tested. A separate hypothesis was formulated for disability in aggregate and for each sociodemographic disability characteristic. Calculating this value is solely exploratory, as it does not control for spatial dependencies, like in later calculations.
Calculate Pearson’s R Correlation Coefficient of each independent variable and the COVID-19 incidence rate, reproducing the Pearson’s R column of original study Table 1.
Compare the reproduced Pearson’s r correlation coefficients
to the original study’s Pearson’s r correlation coefficients.
Stars indicates the significance level with two stars for
p < 0.01
and one star for p < 0.05
.
Correlation difference rp_r_diff
is calculated between the
reproduction study rp_r
and original study
or_r
as rp_r_diff = rp_r - or_r
Direction
difference rp_dir_diff
is calculated as
(rp_r > 0) - (or_r > 0)
, giving 0
if
both coefficients have the same direction, 1
if the
reproduction is positive and the original is negative, and
-1
if the reproduction is negative but the original is
positive.
Variable | R | Sig. Level | R | Sig. Level | R | Sig. Level | Direction |
---|---|---|---|---|---|---|---|
dis_pct | -0.056 | 2 | -0.060 | 2 | -0.004 | 0 | 0 |
white_pct | -0.326 | 2 | -0.332 | 2 | -0.006 | 0 | 0 |
black_pct | 0.456 | 2 | 0.460 | 2 | 0.004 | 0 | 0 |
native_pct | 0.020 | 0 | 0.019 | 0 | -0.001 | 0 | 0 |
asian_pct | 0.097 | 2 | 0.094 | 2 | -0.003 | 0 | 0 |
other_pct | 0.028 | 0 | 0.026 | 0 | -0.002 | 0 | 0 |
non_hisp_white_pct | -0.355 | 2 | -0.361 | 2 | -0.006 | 0 | 0 |
hisp_pct | 0.119 | 2 | 0.119 | 2 | 0.000 | 0 | 0 |
non_hisp_non_white_pct | 0.439 | 2 | 0.442 | 2 | 0.003 | 0 | 0 |
bpov_pct | 0.108 | 2 | 0.106 | 2 | -0.002 | 0 | 0 |
apov_pct | -0.146 | 2 | -0.151 | 2 | -0.005 | 0 | 0 |
pct_5_17 | 0.083 | 2 | 0.084 | 2 | 0.001 | 0 | 0 |
pct_18_34 | 0.066 | 1 | 0.063 | 2 | -0.003 | 1 | 0 |
pct_35_64 | -0.005 | 0 | -0.008 | 0 | -0.003 | 0 | 0 |
pct_65_74 | -0.089 | 2 | -0.091 | 2 | -0.002 | 0 | 0 |
pct_75 | -0.181 | 2 | -0.186 | 2 | -0.005 | 0 | 0 |
male_pct | -0.131 | 2 | -0.134 | 2 | -0.003 | 0 | 0 |
female_pct | 0.028 | 0 | 0.023 | 0 | -0.005 | 0 | 0 |
The reproduced values were very similar in value to the original values, only deviating by as much as 0.006. Significance levels and direction were found to all be equal, showing a very similar results value from the original to the reproduced.
This is not ideal for reproducing the original results, likely this difference comes from the result of rows in the original dataset getting flipped in Excel, meaning some COVID rates belonged to other counties.
The methods in the original study call for geographic clustering of counties to control for spatial dependence. These groups are made according to the Kulldorff spatial scan statistic for geographic clusters of high COVID-19 incidence.
The original study uses SaTScan software to implement the Kulldorff
spatial scan statistic model. In SaTScan, models may be specified with
many parameters having significant implications for results. The
original manuscript only specifies that Poisson model should be used.
The scan statistic in SaTScan used spherical great circle distance
calculations based upon the latitude and longitude coordinates of the
centroid of each county. For this purpose, we have used the
X
and Y
attributes provided as geographic
variables with ACS data.
Deviation: We opted to use the SpatialEpi package in R, selecting open source software with R integration over SatSCan software, which is free but not open. The Kulldorff spatial scan statistic model in SpatialEpi also supports a discrete Poisson spatial model, and uses the GINI coefficient to select secondary clusters with no geographical overlap that maximize the difference between locations inside of clusters and locations outside of clusters. We expected that this set of software options could reproduce identical results compared to SaTScan.
## [1] Run time: 4.85 minutes
## [1] Most likely cluster:
## $location.IDs.included
## [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838 280 1760 1846 1756
##
## $population
## [1] 16949211
##
## $number.of.cases
## [1] 469091
##
## $expected.cases
## [1] 233805.6
##
## $SMR
## [1] 2.006329
##
## $log.likelihood.ratio
## [1] 97983.07
##
## $monte.carlo.rank
## [1] 1
##
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 136
Deviation In this study, clustering will be decided more simply, involving only a single method of clustering, risk levels are included for all counties in the cluster rather than the center, and clusters are treated as a single category rather than a tiered system.
GEE modeling requires specific clustering to correctly control for spatial dependence. We will use our SpatialEpi clusters to get (non-identical) similar clusters to those of the original study using SaTScan.
First, cluster classes must be defined in 6 tiers, then combined with state ID codes to create unique cluster IDs to code for spatial relation.
Risk is classified according to this table:
Relative Risk Values | Relative Risk Class |
---|---|
Outside of cluster | 1 |
RR < 1 | 1 |
1 <= RR < 2 | 2 |
2 <= RR < 3 | 3 |
3 <= RR < 4 | 4 |
4 <= RR < 5 | 5 |
5 <= RR < 6 | 6 |
Counties falling outside of any cluster are assigned a score of 1.
Deviation SpatialEpi does not directly calculate relative risk on a local or cluster level, so it must be calculated on its own.
Deviation, in order to best replicate the process from Chakraborty’s study, we must find z-scores then filter for non-zero COVID rates.
Compare redults gathered in this method to the original results of this section from the study.s
## Summary of difference between reproduction independent variables and original independent variables
## Mean:
## z_white_pct z_black_pct z_native_pct
## 0 0 0
## z_asian_pct z_other_pct z_non_hisp_white_pct
## 0 0 0
## z_hisp_pct z_non_hisp_non_white_pct z_bpov_pct
## 0 0 0
## z_apov_pct z_pct_5_17 z_pct_18_34
## 0 0 0
## z_pct_35_64 z_pct_65_74 z_pct_75
## 0 0 0
## z_male_pct z_female_pct
## 0 0
##
## Standard deviation:
## z_white_pct z_black_pct z_native_pct
## 0.000 0.000 0.000
## z_asian_pct z_other_pct z_non_hisp_white_pct
## 0.001 0.001 0.000
## z_hisp_pct z_non_hisp_non_white_pct z_bpov_pct
## 0.000 0.000 0.000
## z_apov_pct z_pct_5_17 z_pct_18_34
## 0.000 0.001 0.001
## z_pct_35_64 z_pct_65_74 z_pct_75
## 0.000 0.000 0.000
## z_male_pct z_female_pct
## 0.000 0.000
After changing the order to first z-score standardize and then filter for COVID rates > 0, we observed no mean difference between our reproduced variables and the original variables, and we find no standard deviation > 0.001 for the difference between reproduction independent variables and original variables. There are no major differences between the independent variables.
Save data to file so it can be loaded from there.
The generalized estimating equation (GEE) models work to find associations between COVID and disability rates, accounting for spatial clustering.
As specified by the author, “GEEs extend the generalized linear model to accommodate clustered data, in addition to relaxing several assumptions of traditional regression (i.e., normality)”. Additionally, the author noted that “clusters of observations must be defined based on the assumption that observations within a cluster are correlated while observations from different clusters are independent.” All five GEE models were specified with exchangeable correlation matrices, gamma distributions, and logarithmic link function.
In Chakraborty’s original study, he uses these GEE parameters to best display association.:
“The ‘exchangeable’ correlation matrix was selected for the results reported here, since this specification yielded the best statistical fit based on the QIC (quasi- likelihood under the independence) model criterion.” (Chakraborty 2021, Methods paragraph 5)
“The gamma distribution with logarithmic link function was chosen for all GEEs since this model specification provided the lowest QIC value.” (Chakraborty 2021, Methods paragraph 5)
First, get the original results of the GEE model:
term | estimate | std.error | conf.low | conf.high | stars | p.value |
---|---|---|---|---|---|---|
Race model intercept | 6.756 | 0.020 | 6.716 | 6.795 | 2 | 0.000 |
White | -0.197 | 0.024 | -0.245 | -0.149 | 2 | 0.000 |
Black | 0.269 | 0.020 | 0.229 | 0.308 | 2 | 0.000 |
Native American | 0.022 | 0.023 | -0.023 | 0.066 | 0 | 0.339 |
Asian | 0.057 | 0.020 | 0.017 | 0.096 | 2 | 0.005 |
Other race | 0.089 | 0.018 | 0.053 | 0.124 | 2 | 0.000 |
Ethnicity model intercept | 6.743 | 0.020 | 6.705 | 6.782 | 2 | 0.000 |
Non-Hispanic White | -0.224 | 0.024 | -0.272 | -0.176 | 2 | 0.000 |
Hispanic | 0.120 | 0.016 | 0.088 | 0.152 | 2 | 0.000 |
Non-Hispanic non-White | 0.270 | 0.019 | 0.233 | 0.308 | 2 | 0.000 |
Poverty model intercept | 6.816 | 0.021 | 6.774 | 6.858 | 2 | 0.000 |
Below poverty level | 0.223 | 0.029 | 0.166 | 0.280 | 2 | 0.000 |
Above poverty level | -0.284 | 0.031 | -0.345 | -0.223 | 2 | 0.000 |
Age model intercept | 6.822 | 0.021 | 6.780 | 6.864 | 2 | 0.000 |
Age 5-17 | 0.058 | 0.023 | 0.012 | 0.103 | 1 | 0.013 |
Age 18-34 | 0.038 | 0.035 | -0.031 | 0.106 | 0 | 0.279 |
Age 35-64 | 0.003 | 0.041 | -0.077 | 0.084 | 0 | 0.942 |
Age 65-74 | -0.033 | 0.041 | -0.113 | 0.046 | 0 | 0.412 |
Age 75 or more | -0.156 | 0.027 | -0.209 | -0.102 | 2 | 0.000 |
Biological sex model intercept | 6.814 | 0.021 | 6.772 | 6.856 | 2 | 0.000 |
Male | -0.378 | 0.039 | -0.456 | -0.301 | 2 | 0.000 |
Female | 0.279 | 0.042 | 0.197 | 0.361 | 2 | 0.000 |
Then, compare the rseults of the two methods:
Her, many of the values gathered from the SpatialEpi Clustering and GEE model differ from the original results. While the same conclusions can be drawn, it is important to note that running a GEE model through a different service will have different coefficient values. Most likely, SATScan has different parameters in their proprietary modeling functions that cause differing results to that of SpatialEpi.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "equal"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'midpoint', 'palette' (rename to
## 'values') to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "RdBu" is named
## "brewer.rd_bu"
## Multiple palettes called "rd_bu" found: "brewer.rd_bu", "matplotlib.rd_bu". The first one, "brewer.rd_bu", is returned.
Calculate variance, covariance, correlation, and weights
Map each county’s weight in GEE model.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "jenks"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'midpoint', 'palette' (rename to 'values')
## to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
This map will display each US County and its rate of COVID-19 Per 100,000 people.
Deviation This map will display each US County and its percentage of people over the age of 18 with a disability.
This table will display correlation values using Pearson’s R and Spearman’s Rho, designed to reproduce Chakraborty’s first table.
variable | r | t | p | rp_stars |
---|---|---|---|---|
dis_pct | -0.060 | 3.350 | 0.000 | 2 |
white_pct | -0.332 | 19.612 | 0.000 | 2 |
black_pct | 0.460 | 28.847 | 0.000 | 2 |
native_pct | 0.019 | 1.072 | 0.142 | 0 |
asian_pct | 0.094 | 5.272 | 0.000 | 2 |
other_pct | 0.026 | 1.460 | 0.072 | 0 |
non_hisp_white_pct | -0.361 | 21.545 | 0.000 | 2 |
hisp_pct | 0.119 | 6.686 | 0.000 | 2 |
non_hisp_non_white_pct | 0.442 | 27.429 | 0.000 | 2 |
bpov_pct | 0.106 | 5.914 | 0.000 | 2 |
apov_pct | -0.151 | 8.513 | 0.000 | 2 |
pct_5_17 | 0.084 | 4.688 | 0.000 | 2 |
pct_18_34 | 0.063 | 3.493 | 0.000 | 2 |
pct_35_64 | -0.008 | 0.460 | 0.323 | 0 |
pct_65_74 | -0.091 | 5.113 | 0.000 | 2 |
pct_75 | -0.186 | 10.541 | 0.000 | 2 |
male_pct | -0.134 | 7.519 | 0.000 | 2 |
female_pct | 0.023 | 1.305 | 0.096 | 0 |
pop | 0.128 | 7.215 | 0.000 | 2 |
cases | 0.209 | 11.891 | 0.000 | 2 |
x | 0.099 | 5.540 | 0.000 | 2 |
y | -0.412 | 25.195 | 0.000 | 2 |
Deviation This map will display each cluster as its own classification, illuminating the inner workings of GEE clustering.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "-Oranges" is named
## "oranges" (in long format "brewer.oranges")
Deviation This map will display risk scores determined by each cluster based on COVID-19 rates, risk scores range from 1 to 6. Relative risk scores are calculated locally and by cluster, so both are shown here.
Local Relative Risk Scores:
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
## 'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
Cluster Relative Risk Scores
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
## 'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
This table will display results to reproduce Chakraborty’s second table, controlling for spatial correlation using the GEE clustering.
term | estimate | std.error | conf.low | conf.high | stars | p.value |
---|---|---|---|---|---|---|
Race model intercept | 6.755 | 0.020 | 6.716 | 6.795 | 2 | 0.000 |
White | -0.204 | 0.024 | -0.250 | -0.158 | 2 | 0.000 |
Black | 0.265 | 0.021 | 0.224 | 0.305 | 2 | 0.000 |
Native American | 0.020 | 0.023 | -0.024 | 0.065 | 0 | 0.368 |
Asian | 0.051 | 0.020 | 0.012 | 0.090 | 1 | 0.011 |
Other race | 0.086 | 0.018 | 0.050 | 0.122 | 2 | 0.000 |
Ethnicity model intercept | 6.742 | 0.020 | 6.703 | 6.781 | 2 | 0.000 |
Non-Hispanic White | -0.230 | 0.024 | -0.276 | -0.184 | 2 | 0.000 |
Hispanic | 0.115 | 0.016 | 0.083 | 0.146 | 2 | 0.000 |
Non-Hispanic non-White | 0.267 | 0.019 | 0.229 | 0.304 | 2 | 0.000 |
Poverty model intercept | 6.816 | 0.021 | 6.774 | 6.858 | 2 | 0.000 |
Below poverty level | 0.219 | 0.029 | 0.162 | 0.277 | 2 | 0.000 |
Above poverty level | -0.285 | 0.031 | -0.346 | -0.224 | 2 | 0.000 |
Age model intercept | 6.823 | 0.022 | 6.781 | 6.865 | 2 | 0.000 |
Age 5-17 | 0.063 | 0.023 | 0.018 | 0.108 | 2 | 0.006 |
Age 18-34 | 0.032 | 0.035 | -0.036 | 0.101 | 0 | 0.355 |
Age 35-64 | -0.005 | 0.041 | -0.086 | 0.076 | 0 | 0.903 |
Age 65-74 | -0.025 | 0.041 | -0.105 | 0.055 | 0 | 0.543 |
Age 75 or more | -0.160 | 0.027 | -0.214 | -0.107 | 2 | 0.000 |
Biological sex model intercept | 6.816 | 0.021 | 6.774 | 6.858 | 2 | 0.000 |
Male | -0.368 | 0.039 | -0.445 | -0.292 | 2 | 0.000 |
Female | 0.266 | 0.041 | 0.185 | 0.347 | 2 | 0.000 |
Once results are found, compare table values with the table values in the original study using SATScan. Success will find the same values as those found in Chakrabory’s original study. A failure would involve producing significantly different values, or even a study that finds a different hypothesis to be true.
The results shown from the results figures would indicate a success for this reproduction, although this process had many deviations, some unplanned and others not, that allowed for this workflow to be recreated in RStudio.
Our first deviation came from an error in the collected data for this reproduction, where not all counties were given disability and poverty statistics.
Most planned deviations were to provide more depth to the study, without changing the fundamental workflow. Changes like these include making more figures to make the process of the study more clear, or including numbers for the Shapiro Wilk test for Normality.
Most significantly, switching to SpatialEpi over SATScan resulted in slightly different coefficients across the board, and also meant we had to self-calculate relative risk scores.
For the differences in how clustering was determined, opting for a more simple approach, the belief was that hierarchical clustering may deliver cluster relative risk scores that were too sensitive or inaccurate to other lower-order clusters.
We find similar results to Chakraborty in our own reproduction. COVID Risk rates are positively correlated with higher disability rates among traditionally marginalized groups. While the results are not identical, the conclusions that can be made remain the same.
Differences in this reproduction to the original study largely stem from deviations in our own methods, opting for methods that are more available to the general public and offer a more streamlined workflow. Still, the biggest differences in the two studies come from the GEE model results, particularly the coefficient values for Black disability rate, Non Hispanic Non White Disability rate, below poverty disability rate, and female disability rate.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)