Abstract

The Social Vulnerability Index (SoVI) is a widely used tool for assessing community vulnerability to natural hazards by aggregating social and demographic variables. This study evaluates the relationship between SoVI and disaster outcomes, specifically property damage per capita from Hurricane Harvey, using American Community Survey (ACS) and Spatial Hazard Events and Losses Database for the United States (SHELDUS) data. SoVI was calculated at the county level using both national and regional datasets, followed by ordinary least squares regression to analyze its correlation with property damage. Results indicate a weak positive relationship between SoVI and property damage per capita, but the relationship was not statistically significant in either dataset. Challenges in replicating SoVI included ambiguities in variable selection, subjectivity in determining component cardinality, and concerns with potential inconsistencies and spatial granularity in SHELDUS data. These findings suggest that while SoVI provides useful insights into social vulnerability, its application as a predictor of disaster outcomes is limited without accounting for additional factors. The study highlights the need for improved methods to refine SoVI calculations and validate its effectiveness.

1 Introduction

1.1 Background

The Social Vulnerability Index (SoVI) aims to measure the vulnerability of communities to natural hazards by examining various social and demographic factors to assist policymakers and planners in risk assessment and disaster management with a standardized and repeatable method (Cutter & Finch, 2008). SoVI aggregates various variables, such as income, age, race, transportation access, and disability status, typically gathered from census data, to create a composite score, which is then applied to assess the susceptibility of communities to disasters. Since many variables and complex datasets are involved in this process, principal component analysis (PCA) is used to reduce complex data sets into lower dimensions to reveal the hidden, simplified structures that often underlie them (Shlens, 2014).

Since its creation, SoVI has been a critical tool in disaster risk management in the U.S. and abroad (Bronfman et al., 2021; Fekete, 2009). However, it still has some important drawbacks. Social Vulnerability cannot be directly measured and quantified, so it must be inferred from indirect indicators. Distilling these complex and diverse social variables into a single index through PCA raises issues with its internal consistency (do different measurements produce similar results) (Spielman et al., 2020) and its sensitivity to variable selection and geographic context (Flanagan et al., 2021; Rufat et al., 2019). Also, the social vulnerability of a community may differ based on the type of hazard encountered – it may not be a one-size-fits-all index (Rufat et al., 2019; Spielman et al., 2020; Tellman et al., 2020). Finally, the spatial scale may influence the index, as SoVI may differ at the regional, county, or local level (Hinojos et al., 2023). Given the widespread use of SoVI by policymakers and planners, combined with the challenges and concerns raised, it is critical that SoVI be verified and validated. Despite SoVI’s widespread use and acceptance as a predominant tool, from a research perspective, validating its assessment remains a challenge. Various methods have been used in an attempt to validate SoVI, such as spatial autocorrelation, post-hazard analysis, and regression models, but there is no universally accepted standard. The lack of the ability to validate a critical tool underscores the need for continued research of validation methods that can be applied consistently.

1.2 Validation Methods

Various validation methods have been used since SoVI’s inception to test the validity of the index. One method is to utilize spatial autocorrelation, such as Moran’s I and Local Indicators of Spatial Association (LISA), due to their ability to quantify and visualize spatial patterns. The use of spatial autocorrelation showed social vulnerability is not randomly distributed but instead clustered; additionally, the research showed vulnerability status in some areas remained persistent, while the status of other areas changed over time with shifting demographics and socioeconomic conditions (Bronfman et al., 2021; Park & Xu, 2020). The use of spatial autocorrelation also showed using localized variables and considering higher spatial resolution (i.e., block groups or census tracts versus counties or their equivalent outside of the U.S.) to calculate SoVI could be more effective in capturing areas at risk (Hinojos et al., 2023; Park & Xu, 2021). Regarding indicators, research found income, housing, race, and age were key factors in determining vulnerability.

Post-hazard analysis is another method used to validate and verify social vulnerability. These studies utilized surveys to assess the impacts of natural hazards on different communities – one in Houston, TX, post-hurricane Harvey (Griego et al., 2020) and the other in Germany, which focused on significant river flooding in 2002 (Fekete, 2009). These studies found the survey approach assisted in determining variables, such as income, housing quality, and age, which may identify patterns in social vulnerability within communities affected by natural disasters. While these surveys effectively highlighted a correlation between these variables and disaster impacts, they did not necessarily establish causation (Fekete, 2009). These studies also faced the challenge of the potential risk of recall bias in the household surveys, where participants might not accurately remember or report the details of their experiences during the disaster, likely due to negative or positive experiences, leading to inaccuracies and inconsistencies (Griego et al., 2020) combined with the required fieldwork and securing participation to produce a suitable sample size.

A final common method to validate and verify social vulnerability indexes is to utilize regression models with post-hazard empirical data (i.e., damage and fatality assessments). Various methods of regression analysis have been utilized, such as ordinary least squares (OLS) (Rufat et al., 2019; Yoon, 2012), logistic regression (Rufat et al., 2019), and zero-inflated negative binomial (ZINB) regression (Zahran et al., 2008). Previous research that validated SoVI using post-disaster data from the Federal Emergency Management Administration (FEMA), which provides requests for disaster assistance, or Spatial Hazard Events or Losses Database for the United States (SHELDUS), which provides property damage in dollars and fatalities, aimed to determine if a correlation existed between specific vulnerability factors (e.g., income, race, etc. ) and disaster outcomes, such as FEMA assistance or property damage and fatalities. While each of these studies found a correlation between SoVI variables–specifically income, housing quality, race, age, and disaster outcomes, they also did identify challenges. One primary challenge is data availability, finding it to be incomplete, inconsistent, or not at an appropriate resolution (Tellman et al., 2020; Yoon, 2012; Zahran et al., 2008). Additionally, researchers determined that what may apply to one type of hazard or geographic location may not apply to other types of hazards or locations. (Rufat et al., 2019; Tellman et al., 2020; Zahran et al., 2008)

1.3 Research Intent

While SoVI is a widely accepted index for determining socially vulnerable areas in our communities, it is an indirect measurement that lacks a consistent method for validation and verification. Previous research has attempted to validate SoVI, each having strengths and weaknesses, but no standard method exists. Of the previous research investigated, those using post-disaster outcomes and regression analysis seem the most promising as they utilize readily available empirical data and proven regression analysis methods.

The intent of this research is to expound upon these methods. It will first calculate SoVI at the county level using American Community Survey (ACS) data, as well as data from other sources where applicable, for the entire United States and then using data just from Texas, Louisisana, and Mississippi. From there, SHELDUS data from hurricane Harvey along the Gulf Coast will be used to compare SoVI values to property damage. By analyzing how or if SoVI correlates with a disaster outcome, this research will attempt to determine how well SoVI captures actual social vulnerability.

2 Data

Two primary datasets were used in this research: the American Community Survey (ACS) and the Spatial Hazard Events or Losses Database for the United States (SHELDUS).

2.1 Social Vulnerability Data

The ACS is an ongoing survey conducted by the U.S. Census Bureau that provides essential data on a wide range of demographic, social, economic, and housing characteristics of the U.S. population. For calculating the Social Vulnerability Index (SoVI), ACS data is crucial because it supplies detailed information on factors like income, age, race, housing, and transportation access — all of which can be key indicators of a community’s vulnerability. For the purposes of this research, data from the ACS 5-Year Estimates at the county level were used, specifically for 2017 and 2014, the latter of which was used to validate the SoVI calculation method.

In addition to the ACS data for SoVI, hospital locations were acquired from the Homeland Infrastructure Foundation-Level Data (HIFLD) portal. Unfortunately, the only available data was for 2024. This data was used to calculate concentration of hospitals in each county.

2.2 Validation Data

The Spatial Hazard Events and Losses Database for the United States (SHELDUS), managed by Arizona State University, is a detailed database that records the impacts of natural hazards across U.S. counties. SHELDUS includes information on events such as hurricanes, floods, wildfires, and earthquakes, providing data on property damage, crop losses, injuries, and fatalities at the county level. For this research, SHELDUS data from hurricane Harvey in 2017 was used. Of note, SHELDUS is not a free service.

3 Methods

3.1 Social Vulnerability Index Calculation

Prior to validating Social Vulnerability Index (SoVI) using SHELDUS data, the index needed to be calculated using the method outlined by Cutter et al. (2008). This method involves using principal component analysis (PCA) to reduce the dimensionality of the data and create a composite index that represents social vulnerability. Throughout the literature reviews, there was no mention of how to calculate the index, code was not provided, nor is there a SoVI R package. One of the literatures reviewed states that the SoVI data was purchased for their validation analaysis (Tellman et al., 2020). There is, however, a “recipe” and “evolution” of the SoVI provided by Universty of South Carolina College of Arts and Sciences.

Since the SoVI calculation method was not provided, the method was engineered based on the “recipe” and “evolution” provided by the University of South Carolina. To test the accuracy of the method, ACS data from 2014 was used and compared to the % variance and component loading provided in the “recipe”. The results from the engineered method did not match those provided in the “recipe” - either for % variance, loadings, or order. They did, however, provide an index and, since no other option was available, it was decided to move forward with the engineered methodology. Details on how SoVI was calculated are provided in the following sections and thoughts on the discrepancies are provided in the discussion section.

3.1.1 2017 ACS Data Preparation

The first step in calculating SoVI was to download the ACS 5-Year Estimates data for 2017 from the Census Bureau at the county level using tidycensus package in R. The ACS variables chosen were based on the “evolution” provided by the University of South Carolina and are the closest approximation of what is believed to be intended. This is an approximation because neither “recipe” nor “evolution” explicitly state which variables were used, leaving selection as a “best guess” which will be discussed in the discussion section.

census_api_key(api_key, install = TRUE, overwrite = TRUE)

year <- 2017

# List of variables to retrieve
variables <- c(
  "B02001_005",  # 1.  Asian Alone (ASIAN) 
  "B02001_003",  # 2.  Black Alone (BLACK)
  "B03003_003",  # 3.  Hispanic or Latino (HISPANIC)
  "B02001_004",  # 4.  American Indian Alone (NATIVE)
  "B01001_003",  # 5a. Males under 5
  "B01001_020",  # 5b. Males 65-66
  "B01001_021",  # 5c. Males 67-69
  "B01001_022",  # 5d. Males 70-74
  "B01001_023",  # 5e. Males 75-79
  "B01001_024",  # 5f. Males 80-84
  "B01001_025",  # 5g. Males over 85
  "B01001_027",  # 5h. Females under 5
  "B01001_044",  # 5i. Females 65-66
  "B01001_045",  # 5j. Females 67-69
  "B01001_046",  # 5k. Females 70-74
  "B01001_047",  # 5l. Females 75-79
  "B01001_048",  # 5m. Females 80-84
  "B01001_049",  # 5n. Females over 85
  "B01001_001",  ##### Total Population (POP)
  "B09002_002",  # 6.  Children - Married Couples 
  "B09002_001",  ##### Total Children
  "B01002_001",  # 7.  Median Age
  "B19055_002",  # 8.  Households w/ SS Benefits
  "B11001_001",  ##### Total Households (HH)
  "B17001_002",  # 9.  Income in the past 12 months below poverty level
  "B17001_001",  ##### Total population for who poverty is determined
  "B19001_017",  # 10. Household income over $200,000
  ############   ##### Use Total Households for denominator
  "B19301_001",  # 11. Per Capita Income
  "C16002_004",  # 12a. Spanish - limited English
  "C16002_007",  # 12b. Other Indo-European - limited English
  "C16002_010",  # 12c. Asian - limited English
  "C16002_013",  # 12d. Other - limited English
  ############   ##### Use Total Households for denominator 
  "B01001_026",  # 13. Female
  ############   ##### Use Total Population for denominator
  "B11001_006",  # 14. Female Head of Household
  ############   ##### Use Total Households for denominator
  "B09019_038",  # 15. Pop in group quarters (not clear if this is just Nursing Home)
  ############   ##### Use Total Population for denominator
  ############   # 16. Hospitals - Derived from HILD
  "B27001_005",  # 17a. No Health Insurance - Male Under 6
  "B27001_008",  # 17b. No Health Insurance - Male 6-17
  "B27001_011",  # 17c. No Health Insurance - Male 18-24
  "B27001_014",  # 17d. No Health Insurance - Male 25-34
  "B27001_017",  # 17e. No Health Insurance - Male 35-44
  "B27001_020",  # 17f. No Health Insurance - Male 45-54
  "B27001_023",  # 17g. No Health Insurance - Male 55-64
  "B27001_026",  # 17h. No Health Insurance - Male 65-74
  "B27001_029",  # 17i. No Health Insurance - Male over 75
  "B27001_033",  # 17j. No Health Insurance - Female Under 6
  "B27001_036",  # 17k. No Health Insurance - Female 6-17
  "B27001_039",  # 17l. No Health Insurance - Female 18-24
  "B27001_042",  # 17m. No Health Insurance - Female 25-34
  "B27001_045",  # 17n. No Health Insurance - Female 35-44
  "B27001_048",  # 17o. No Health Insurance - Female 45-54
  "B27001_051",  # 17p. No Health Insurance - Female 55-64
  "B27001_054",  # 17q. No Health Insurance - Female 65-74
  "B27001_057",  # 17r. No Health Insurance - Female over 75
  "B27001_001",  #### Civilian non-institutionalized population
  "B15001_004",  # 18a. Education - Less than 9th grade - Male 18-24
  "B15001_005",  # 18b. Education - 9th-12th grade, no diploma - Male 18-24
  "B15001_012",  # 18c. Education - Less than 9th grade - Male 25-34
  "B15001_013",  # 18d. Education - 9th-12th grade, no diploma - Male 25-34
  "B15001_020",  # 18e. Education - Less than 9th grade - Male 35-44
  "B15001_021",  # 18f. Education - 9th-12th grade, no diploma - Male 35-44
  "B15001_028",  # 18g. Education - Less than 9th grade - Male 45-64
  "B15001_029",  # 18h. Education - 9th-12th grade, no diploma - Male 45-64
  "B15001_036",  # 18i. Education - Less than 9th grade - Make over 65
  "B15001_037",  # 18j. Education - 9th-12th grade, no diploma - Male over 65
  "B15001_045",  # 18k. Education - Less than 9th grade - Female 18-24
  "B15001_046",  # 18l. Education - 9th-12th grade, no diploma - Female 18-24
  "B15001_053",  # 18m. Education - Less than 9th grade - Female 25-34
  "B15001_054",  # 18n. Education - 9th-12th grade, no diploma - Female 25-34
  "B15001_061",  # 18o. Education - Less than 9th grade - Female 35-44
  "B15001_062",  # 18p. Education - 9th-12th grade, no diploma - Female 35-44
  "B15001_069",  # 18q. Education - Less than 9th grade - Female 45-64
  "B15001_070",  # 18r. Education - 9th-12th grade, no diploma - Female 45-64
  "B15001_077",  # 18s. Education - Less than 9th grade - Female over 65
  "B15001_078",  # 18t. Education - 9th-12th grade, no diploma -Female over 65
  "B15001_001",  ##### Total Population over 18
  "B23025_005",  # 19. Unemployed over 16
  "B23025_001",  ##### Total Labor Force over 16
  "B25001_001",  # 20. Total Housing Units (for People per unit)
  "B25003_003",  # 21. Renter Occupied Units
  "B25003_001",  ##### Total Occupied Units
  "B25077_001",  # 22. Median Value of Owner Occupied Units
  "B25064_001",  # 23. Median Gross Rent
  "B25024_010",  # 24. Mobile Homes
  "B25024_001",  ##### Total Housing Units
  "C24010_032",  # 25a. Construction and extraction - Male
  "C24010_068",  # 25b. Construction and extraction - Female
  "C24010_019",  # 26a. Service Occupations - Male
  "C24010_055",  # 26b. Service Occupations - Female
  "C24010_001",  ##### Civilian employed pop over 16
  "B23001_091",  # 27a. Females in armed forces - 16-19
  "B23001_093",  # 27b. Females employed - 16-19
  "B23001_098",  # 27c. Females in armed forces - 20-21
  "B23001_100",  # 27d. Females employed - 20-21
  "B23001_105",  # 27e. Females in armed forces - 22-24
  "B23001_107",  # 27f. Females employed - 22-24
  "B23001_112",  # 27g. Females in armed forces - 25-29
  "B23001_114",  # 27h. Females employed - 25-29
  "B23001_119",  # 27i. Females in armed forces - 30-34
  "B23001_121",  # 27j. Females employed - 30-34
  "B23001_126",  # 27k. Females in armed forces - 35-44
  "B23001_128",  # 27l. Females employed - 35-44
  "B23001_133",  # 27m. Females in armed forces - 45-54
  "B23001_135",  # 27n. Females employed - 45-54
  "B23001_140",  # 27o. Females in armed forces - 55-59
  "B23001_142",  # 27p. Females employed - 55-59 
  "B23001_147",  # 27q. Females in armed forces - 60-61
  "B23001_149",  # 27r. Females employed - 60-61
  "B23001_154",  # 27s. Females in armed forces - 62-64
  "B23001_156",  # 27t. Females employed - 62-64
  "B23001_161",  # 27u. Females employed - 65-69
  "B23001_166",  # 27v. Females employed - 70-74
  "B23001_171",  # 27w. Females employed - over 75
  "B23001_001",  ##### Population over 16
  "B25044_003",  # 28a. No vehicle - Housing Unit - Owner Occupied
  "B25044_010",  # 28b. No vehicle - Housing Unit - Renter Occupied
  ############   ##### Use Total Occupied units for denominator
  "B25002_003"  # 29. Unoccupied Units
  ############   ##### Use Total Housing Units for denominator
)

# Fetch data from Census
acs.raw <- get_acs(
  geography = "county", 
  variables = variables, 
  year = year, 
  survey = "acs5", 
  output = "wide"
)

In addition to the ACS data, hospital locations were acquired from the Homeland Infrastructure Foundation-Level Data (HIFLD) portal. These were then merged with county shapefiles in ArcGIS Pro to calculate hospital density (hospital count per county). The data was then joined to the ACS data and any counties with a population of 0 were removed.

# Import Hospital Density data created in ArcGIS Pro using HILD data
hosp_dens <- read_csv(here("data", "hosp_dens.csv")) %>%
  select(GEOID, hosp_den)

# Add hospital density info to ACS data
acs.raw.combined <- left_join(acs.raw, hosp_dens, by = "GEOID")

# Removing rows where total population is 0
acs.raw.combined <- acs.raw.combined %>%
  filter(B01001_001E > 0)

With the required data downloaded, the actual variables for SoVi, as outlined in the “evolution”, can be created and normalized, as discussed in the “recipe”. Normalization is essential because it allows for the combination of variables that are on different scales and units, ensuring they contribute meaningfully and comparably to SoVI. The variables are then checked to determine if any are missing (i.e. NA) and replaced with the mean of the variable.

### Create new dataframe with SoVI variables
# Normalized dataframe with geometry
acs.normal <- acs.raw.combined %>%
  transmute(
    GEOID = GEOID,                                            # GEOID
    NAME = NAME,                                              # County name
    QASIAN = B02001_005E / B01001_001E,                       # Percent Asian
    QBLACK = B02001_003E / B01001_001E,                       # Percent Black
    QHISP = B03003_003E / B01001_001E,                        # Percent Hispanic
    QNATAM = B02001_004E / B01001_001E,                       # Percent Native American
    QAGEDEP = (B01001_003E + B01001_020E + B01001_021E + B01001_022E + B01001_023E + 
                 B01001_024E + B01001_025E + B01001_027E + B01001_044E + B01001_045E + 
                 B01001_046E + B01001_047E + B01001_048E + B01001_049E) / B01001_001E,  # Percent below 5 and above 65
    QFAM = B09002_002E / B09002_001E,                         # Percent children in married-couple families
    MEDAGE = B01002_001E,                                     # Median Age
    QSSBEN = B19055_002E / B11001_001E,                       # Percent households with Social Security income
    QPOVTY = B17001_002E / B17001_001E,                       # Percent below poverty level
    QRICH = B19001_017E / B11001_001E,                        # Percent with income over $200,000
    PERCAP = B19301_001E,                                     # Per Capita Income
    QESL = (C16002_004E + C16002_007E + C16002_010E + C16002_013E) / B11001_001E,  # Percent limited English-speaking
    QFEMALE = B01001_026E / B01001_001E,                      # Percent Female
    QFHH = B11001_006E / B11001_001E,                         # Percent Female Head of Household
    QNRRES = B09019_038E / B01001_001E,                       # Percent in Nursing Homes
    HOSTPC = hosp_den,                                        # Hospital Density (per sq miles)
    QNOHLTH = (B27001_005E + B27001_008E + B27001_011E + B27001_014E + B27001_017E + 
                 B27001_020E + B27001_023E + B27001_026E + B27001_029E + B27001_033E + 
                 B27001_036E + B27001_039E + B27001_042E + B27001_045E + B27001_048E + 
                 B27001_051E + B27001_054E + B27001_057E) / B27001_001E,  # Percent without health insurance
    QED12LES = (B15001_004E + B15001_005E + B15001_012E + B15001_013E + B15001_020E + 
                  B15001_021E + B15001_028E + B15001_029E + B15001_036E + B15001_037E + 
                  B15001_045E + B15001_046E + B15001_053E + B15001_054E + B15001_061E + 
                  B15001_062E + B15001_069E + B15001_070E + B15001_077E + B15001_078E) / B15001_001E,  # Percent with less than high school education
    QCVLUN = B23025_005E / B23025_001E,                       # Percent unemployed
    PPUNIT = B01001_001E / B25001_001E,                       # People per unit
    QRENTER = B25003_003E / B25003_001E,                      # Percent renter occupied
    MDHSEVAL = B25077_001E,                                   # Median value of owner-occupied units
    MDGRENT = B25064_001E,                                    # Median gross rent
    QMOHO = B25024_010E / B25024_001E,                        # Percent mobile homes
    QEXTRCT = (C24010_032E + C24010_068E) / C24010_001E,      # Percent in extraction occupations
    QSERV = (C24010_019E + C24010_055E) / C24010_001E,        # Percent in service occupations
    QFEMLBR = (B23001_091E + B23001_093E + B23001_098E + B23001_100E + B23001_105E + 
                 B23001_107E + B23001_112E + B23001_114E + B23001_119E + B23001_121E + 
                 B23001_126E + B23001_128E + B23001_133E + B23001_135E + B23001_140E + 
                 B23001_142E + B23001_147E + B23001_149E + B23001_154E + B23001_156E + 
                 B23001_161E + B23001_166E + B23001_171E) / B23001_001E,  # Percent Female in Labor Force
    QNOAUTO = (B25044_003E + B25044_010E) / B25003_001E,      # Percent no vehicle
    QUNOCCHU = B25002_003E / B25024_001E                      # Percent unoccupied units
  )

# Verify the data and replace problematic data
acs.normal <- acs.normal %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

The final step, per the “recipe”, is to standardize the variables by subtracting the mean and dividing by the standard deviation. This is done to ensure that all variables are on the same scale and have a mean of 0 and a standard deviation of 1, ensuring each one contributes equally to the final index by placing them on a common scale. This step prevents variables with larger ranges from dominating the analysis and is crucial for the accuracy of principal component analysis (PCA), which relies on comparably scaled data. Standardization also makes the index more interpretable and allows for meaningful comparisons across regions. It helps SoVI capture true variations in vulnerability rather than just differences in variable scales.

# Standardize the data - calculate z-scores where mean = 0 and sd = 1
# List of columns to standardize (excluding GEOID, NAME)
columns_to_select <- setdiff(colnames(acs.normal), c("GEOID", "NAME"))

# Standardize the selected variables
acs.standardized <- acs.normal %>%
  mutate(across(all_of(columns_to_select), ~ scale(.)[, 1]))

3.1.2 SoVI Calcultion - Principal Component Analysis

With the standardized data, the next step is to perform principal component analysis (PCA) to reduce the dimensionality of the data and create a composite index that represents social vulnerability, or the SoVI. Since SoVI for each county is based on all the counties used to calculate index, the selection of input counties is important and can change a counties SoVI, as evinced by Spielman, et al. (2020). For this reason SoVI was calculated twice - once using the counties for the entire U.S. and once using only the counties from Texas, Louisiana, and Mississippi.

After the standardized data is separated into U.S. and Texas-Louisiana-Mississippi dataframes, PCA is performed on each dataset using the FactoMineR package for R. The PCA will generate a set of principal components that are linear combinations of the standardized variables. The first principal component will capture the most variance in the data, with each subsequent component capturing less and less variance. This can be seen in the scree plots, with Figure 3.1 depicting the first 10 components of the PCA calculated for the entire US, and Figure 3.2 depicting just that for TX, LA, and MS.

# Load data and drop the NAME column
df.US <- acs.standardized %>%
  select(-NAME)

# Load data and drop the NAME column, filter only for TX, LA, and MS

df.TXLAMS <- acs.standardized %>%
   filter(substr(GEOID, 1, 2) %in% c("48", "22", "28")) %>% # 48 = Texas, 22 = Louisiana, 28 = Mississippi
  select(-NAME)

# Perform PCA using FactoMineR (minus GEOID)
pca.US <- PCA(df.US[,-1],
                  ncp = 15,
                  scale.unit = FALSE, 
                  graph = FALSE)

pca.TXLAMS <- PCA(df.TXLAMS[,-1],
                  ncp = 15,
                  scale.unit = FALSE, 
                  graph = FALSE)


Figure 3.1. Scree plot of PCA for the entire U.S. The figure depicts the first ten components,or loadings, for the PCA completed on the entire U.S. dataset. An examination of the percentage influence shows a leveling around the 8th and 9th component.

Figure 3.2. Scree plot of PCA for TX-LA-MS. The figure depicts the first ten components,or loadings, for the PCA completed on the Texas, Loiusiana, and Mississippi dataset. An examination of the percentage influence shows a leveling around the 6th and 7th component.

To determine how many components to retain, PCA is actually done twice on each dataframe. The first calculation is used to determine how many components to retain and then the second is done retaining only those components. This can be accomplished by visually examining the skree plots (Figure 3.1 and Figure 3.2) but a more precise method is by using the Kaiser criterion - where the squares of the eigen values are greater than 1 - to determine which components to retain.

# Calculate eigen values to determine number of factors to retain
eigenvalues.US <- apply(pca.US$var$coord^2, 2, sum)
eigenvalues.TXLAMS <- apply(pca.TXLAMS$var$coord^2, 2, sum)

# Determine number of factors to retain based on Kaiser's criterion
num_factors_kaiser.US <- sum(eigenvalues.US > 1)
num_factors_kaiser.TXLAMS <- sum(eigenvalues.TXLAMS > 1)

When retaining components using the Kaiser criterion, the goal is finding a balance between simplicity and accuracy - too few components may oversimplify the data and lose important information while retaining too many may over complicate the analysis without much gain. Based on the Kaiser criterion, the number of components to retain for the U.S. dataset is 7 and for the Texas, Louisiana, and Mississippi dataset is 5.

The next step involves applying a varimax rotation to the loadings. The varimax rotation simplifies the interpretation of the components by redistributing the variance among the squared loadings, ensuring that each variable aligns strongly with a single component.

# Perform PCA
pca.rotated.US <- PCA(df.US[,-1], 
                   ncp = num_factors_kaiser.US,  
                   scale.unit = FALSE, 
                   graph = FALSE)

pca.rotated.TXLAMS <- PCA(df.TXLAMS[,-1], 
                   ncp = num_factors_kaiser.TXLAMS,  
                   scale.unit = FALSE, 
                   graph = FALSE)

After rotation, the loadings are analyzed to determine cardinality—evaluating how each component contributes to social vulnerability. Per the “recipe”, variables with loadings greater than 0.7 are considered significant, guiding the assignment of cardinality to each component and reflecting their relationship to social vulnerability.

# Rotate the loadings using varimax rotation
pca.rotated.US$var$coord <- varimax(pca.rotated.US$var$coord)$loadings
pca.rotated.TXLAMS$var$coord <- varimax(pca.rotated.TXLAMS$var$coord)$loadings

For example, table 3.1 illustrates the loadings for each component of the entire U.S. dataset. The “recipe” indicates 0.5 can be used as a cutoff in some instances; however, for this research, using 0.5 lead to some loadings being considered in multiple components, so only loadings greater than +/-0.7 were considered.

The first component, Dimension 1, includes variables such as the percentage of the Asian population (QASIAN), the percentage of households earning over $200k (QRICH), and per capita income (PERCAP), among others. According to the literature, these variables are typically associated with reduced social vulnerability. Since the values are negative, the cardinality for Dimension 1 would be positive. Conversely, if the values in Dimension 1 were positive, the cardinality would be negative. A similar pattern can be observed in Dimension 2 of the U.S. dataset. While the literature suggests that the variables in Dimension 2 generally increase social vulnerability, most of the loadings are negative. As a result, the cardinality for Dimension 2 would be negative. This same process is applied to the Texas, Louisiana, and Mississippi dataset and can be seen in table 3.2.

Table 3.1. Loadings for the U.S. dataset. The table shows the loadings for variables in the U.S. dataset with values greater than 0.7, which are considered significant. These values are then examined for each component to determine cardinality of the component.

# View the rotated loadings with values greater than 0.7 to determine cardinality
print(pca.rotated.US$var$coord, cutoff = 0.7)
## 
## Loadings:
##          Dim.1  Dim.2  Dim.3  Dim.4  Dim.5  Dim.6  Dim.7 
## QASIAN   -0.702                                          
## QBLACK                  0.743                            
## QHISP                                 0.907              
## QNATAM                                             -0.842
## QAGEDEP         -0.837                                   
## QFAM                   -0.838                            
## MEDAGE          -0.865                                   
## QSSBEN          -0.768                                   
## QPOVTY                                                   
## QRICH    -0.863                                          
## PERCAP   -0.758                                          
## QESL                                  0.906              
## QFEMALE                       -0.858                     
## QFHH                    0.835                            
## QNRRES                         0.914                     
## HOSTPC                                                   
## QNOHLTH                                      0.706       
## QED12LES                                                 
## QCVLUN                                                   
## PPUNIT           0.885                                   
## QRENTER                                                  
## MDHSEVAL -0.909                                          
## MDGRENT  -0.886                                          
## QMOHO                                                    
## QEXTRCT                                                  
## QSERV                                                    
## QFEMLBR                                                  
## QNOAUTO                                                  
## QUNOCCHU        -0.757                                   
## 
##                Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## SS loadings    4.425 4.035 4.794 2.046 2.729 2.371 1.473
## Proportion Var 0.153 0.139 0.165 0.071 0.094 0.082 0.051
## Cumulative Var 0.153 0.292 0.457 0.528 0.622 0.703 0.754

Table 3.2. Loadings for the TX, LA, and MS dataset. The table shows the loadings for variables in the TX, LA, and MS dataset with values greater than 0.5, which are considered significant. These values are then examined for each component to determine cardinality of the component.

# View the rotated loadings with values greater than 0.7 to determine cardinality
print(pca.rotated.TXLAMS$var$coord, cutoff = 0.7)
## 
## Loadings:
##          Dim.1  Dim.2  Dim.3  Dim.4  Dim.5 
## QASIAN                                     
## QBLACK    1.062                            
## QHISP                   1.127              
## QNATAM                                     
## QAGEDEP          0.928                     
## QFAM     -0.925                            
## MEDAGE           0.812                     
## QSSBEN           0.883                     
## QPOVTY    0.822                            
## QRICH                                      
## PERCAP                                     
## QESL                                       
## QFEMALE                        1.201       
## QFHH      0.982                            
## QNRRES                        -1.237       
## HOSTPC                                     
## QNOHLTH                 0.764              
## QED12LES                0.723              
## QCVLUN                                     
## PPUNIT          -0.874                     
## QRENTER                               0.724
## MDHSEVAL                                   
## MDGRENT                                    
## QMOHO                                      
## QEXTRCT                              -0.957
## QSERV                                      
## QFEMLBR                                    
## QNOAUTO                                    
## QUNOCCHU         0.723                     
## 
##                Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## SS loadings    6.181 4.273 3.841 3.736 2.430
## Proportion Var 0.213 0.147 0.132 0.129 0.084
## Cumulative Var 0.213 0.360 0.493 0.622 0.706

The final step in calculating SoVI is to create the index for each county. Per the “recipe” this is accomplished by adding each principal component (or dimension in the case of using FactoMineR) together with appropriate cardinality applied. The resulting equations can be seen for the entire U.S. and the TX, LA, and MS datasets in equations 1 and 2, respectively.


\[\begin{equation} \text{SoVI}_{\text{U.S.}} = \text{PC1} - \text{PC2} + \text{PC3} - \text{PC4} + \text{PC5} + \text{PC6} - \text{PC7} \tag{1} \end{equation}\]
\[\begin{equation} \text{SoVI}_{\text{TX-LA-MS}} = \text{PC1} + \text{PC2} + \text{PC3} + \text{PC4} + \text{PC5} \tag{2} \end{equation}\]

# Calculate the component scores
component_scores.US <- as.data.frame(pca.rotated.US$ind$coord)
component_scores.TXLAMS <- as.data.frame(pca.rotated.TXLAMS$ind$coord)


# Combine component scores with GEOID and calculate SoVI
#### For all US
county_sovi.US <- cbind(GEOID = df.US$GEOID, component_scores.US, 
                        SoVI = component_scores.US$Dim.1 
                              - component_scores.US$Dim.2
                              + component_scores.US$Dim.3
                              - component_scores.US$Dim.4
                              + component_scores.US$Dim.5
                              + component_scores.US$Dim.6
                              - component_scores.US$Dim.7)

#### For TX, LA, and MS
county_sovi.TXLAMS <- cbind(GEOID = df.TXLAMS$GEOID, component_scores.TXLAMS, 
                            SoVI = component_scores.TXLAMS$Dim.1 
                                  + component_scores.TXLAMS$Dim.2
                                  + component_scores.TXLAMS$Dim.3
                                  + component_scores.TXLAMS$Dim.4
                                  + component_scores.TXLAMS$Dim.5)

3.1.3 Visualizing SoVI - Plotting the Index

With SoVI calculated for each county, using both the U.S. county data and the Texas, Louisiana, and Mississippi data, the final step is to visualize the results by plotting SoVI values for the region of interest along the Gulf Coast. The maps show the distribution of social vulnerability across the country and the region, providing a visual representation of the index and how it varies geographically. To visualize, the “recipe” indicates using either a multiple of standard deviation (1.5 or 2 standard deviations) or quantiles. Considering how each of these methods breaks the data - deviations from the mean and equal-sized categories - it is important to examine how the data is distributed. For normally distributed data, standard deviations are appropriate, while quantiles are better for skewed data.

# Determine distribution of the data of SoVI
skewness_value.US <- skewness(county_sovi.US$SoVI, na.rm = TRUE)
skewness_value.TXLAMS <- skewness(county_sovi.TXLAMS$SoVI, na.rm = TRUE)

Using the R package e1071 the skewness can be calculated to determine the distribution. A skewness value between -0.5 and 0.5 is considered normally distributed. Skewness values between 0.5 and 1 or -0.5 and -1 are considered slightly skewed and values larger, or smaller, than 1 and -1, respectively, are skewed. The skewness of SoVI for the U.S. is 0.0212592567905736, indicating a normal distribution, and that for the TX, LA, and MS is 0.520511409824776, indicates a slightly skewed distribution. Since the intent is to compare the dataset, considering the TX, LA, and MS dataset is slightly skewed, quantiles are used to categorize the SoVI scores, which aligns with the methods used in the evolution of SoVI.

Figure 3.3 shows the distribution of social vulnerability across the U.S. and the Texas, Louisiana, and Mississippi region. The maps provide a visual representation of the index and how it varies geographically. The map on the left shows SoVI calculated using the entire U.S. dataset, while the map on the right shows SoVI calculated using data only from Texas, Louisiana, and Mississippi. The maps reveal that SoVI remains close in many of the counties regardless of the dataset used for calculation; however, there are some counties with significant variations. For example, the counties along the Mississippi River in Mississippi have a low vulnerability using the U.S. dataset but are considered highly vulnerable using the Texas, Louisiana, and Mississippi dataset. While this is a glaringly obvious example, there are other examples of this throughout the maps. This discrepancy highlights the importance of the dataset used in calculating SoVI and how it can impact the results.


Figure 3.3. SoVI for the U.S. and TX, LA, and MS. The maps show the distribution of social vulnerability across the country and the region, providing a visual representation of the index and how it varies geographically. On the left map, SoVI was calculated using the entire U.S. dataset while that on the right was calculated using data only from TX, LA, and MS. SoVI remains close in many of the counties regardless of dataset used for calculation however there some counties with significant variations.

3.2 SHELDUS Data and SoVI - Correlation

With SoVI calculated for the Texas, Lousiana, and Mississippi region using two different datasets, the next step is to determine how well SoVI correlates with disaster data. For this analysis, the SHELDUS dataset was used to determine the correlation between SoVI and disaster data through ordinary least squares. SHELDUS is a comprehensive database of natural disasters in the U.S. and includes information on the type of disaster, the date it occurred, and the location. The dataset also includes information on the property damage, which can be adjusted for inflation and per capita for each county, as well as injuries and deaths. For this project, the initial focus is on using property damage per capita in 2023 dollars.

# Read in the data
sheldus <- read_csv(here("data/SHELDUS_Harvey", "SHELDUS_harvey.csv")) %>%
  select('County FIPS', 'PropertyDmgPerCapita(ADJ 2023)', 'InjuriesPerCapita', 
         'FatalitiesPerCapita') %>%                               
  rename(GEOID = 'County FIPS', Prop_DMG = 'PropertyDmgPerCapita(ADJ 2023)', 
         Injuries = 'InjuriesPerCapita', Fatalities = "FatalitiesPerCapita") %>%                                                    # Renaming columns
  mutate(GEOID = gsub("^'|'$", "", GEOID))                                     

# Merge SHELDUS data with SoVI data, retaining all colmuns without matching SHELDUS
us.sovi.sheldus <- left_join(counties.US, sheldus, by = "GEOID") %>%
  filter(substr(GEOID, 1, 2) %in% c("48", "22", "28"))
txlams.sovi.sheldus <- left_join(counties.TXLAMS, sheldus, by = "GEOID") %>%
  filter(substr(GEOID, 1, 2) %in% c("48", "22", "28"))

The SHELDUS data was purchased from the Spatial Hazard Events and Losses Database for the United States (SHELDUS) website, which is managed by Arizona State University. The data was filtered to include only information related to Hurricane Harvey, with a date range spanning a few days prior to the event, which occurred from August 25–29, 2017, and several months after to ensure all information was captured. Property damage values were adjusted to 2023 dollars automatically. The SHELDUS data was then joined with county-level dataframes containing SoVI scores and further refined to include only counties in Texas, Louisiana, and Mississippi.

Figure 3.4 illustrates property damage per capita from Hurricane Harvey (adjusted to 2023 dollars). The majority of the highest per capita damage occurred along the Texas Gulf Coast, particularly around Houston, TX, and extended into western Louisiana. However, reported damages are observed as far north as DeSoto County in northern Mississippi and as far west as Ector County in western Texas. For visualization, the data was categorized into four quantiles, and includes a group for “NA” or “No Data.” The map effectively highlights the spatial distribution of property damage categorized into these five groups


Figure 3.4. Property Damage from Hurricane Harvey (2017). The map shows the distribution of property damage per capita in 2023 dollars from Hurricane Harvey in the Texas, Louisiana, and Mississippi region. The property damage is categorized into five groups based on quantiles. The majority of the damage per capita occurs in the vicinity of Houston, TX.

To explore the relationship between social vulnerability and property damage, scatter plots were created for both the U.S. dataset (Figure 3.5) and the Texas, Louisiana, and Mississippi dataset (Figure 3.6) using ordinary least squares regression. These plots illustrate the relationship between property damage per capita from Hurricane Harvey and the SoVI scores. A slight positive trend is observed, suggesting that counties with higher social vulnerability tend to experience greater property damage. This pattern appears consistent across both datasets, implying that social vulnerability could be an important factor influencing the extent of damage communities face during natural disasters.

ggplot(us.sovi.sheldus, aes(x = Prop_DMG, y = SoVI)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Property Damage vs. SoVI (US)",
       x = "Property Damage Per Capita",
       y = "SoVI")


Figure 3.5. Property Damage vs. SoVI (U.S. Dataset). The figure shows the relationship between property damage per capita in 2023 dollars from Hurricane Harvey and the Social Vulnerability Index (SoVI) calculated using the entire U.S. dataset. The plot shows a positive relationship between property damage and social vulnerability.

ggplot(txlams.sovi.sheldus, aes(x = Prop_DMG, y = SoVI)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Property Damage vs. SoVI (TX-LA-MS)",
       x = "Property Damage Per Capita",
       y = "SoVI")
test2

test2


Figure 3.6. Property Damage vs. SoVI (TX-LA-MS Dataset). The figure shows the relationship between property damage per capita in 2023 dollars from Hurricane Harvey and the Social Vulnerability Index (SoVI) calculated using the TX, LA, and MS dataset. The plot shows a positive relationship between property damage and social vulnerability.

4 Results

To truly determine the relationship between social vulnerability and property damage, ordinary least squares (OLS) regression was used to examine the correlation between the two variables in both datasets. The regression analysis provides a p-value to determine the statistical significance of the relationship. A low p-value (typically < 0.05) indicates that the relationship is statistically significant. The direction and strength of the relationship are determined by the correlation coefficient derived from the regression model, where a value of 1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 indicates no relationship.

4.1 U.S. County Data

An OLS regression was performed to examine the relationship between property damage per capita (response variable) and SoVI (predictor variable) for counties impacted by Hurricane Harvey along in Texas, Louisiana, Mississippi using the entire U.S. dataset. The analysis suggests a marginal positive relationship between SoVI and property damage per capita. While counties with higher social vulnerability appear to have slightly higher property damage, the relationship is not strongly statistically significant (p = 0.0562) as the p-value is slightly above the nominal threshold of 0.5, and the model explains only a small portion (5.5%) of the variability in property damage, which can be seen in Table 4.1. This indicates that while social vulnerability may play a role, additional factors likely contribute to property damage outcomes and should be explored in further analysis.

# OLS regression of US County Dataset and SHELDUS Prop DMG
us.model <- lm(Prop_DMG ~ SoVI, data = us.sovi.sheldus)
summary(us.model)
## 
## Call:
## lm(formula = Prop_DMG ~ SoVI, data = us.sovi.sheldus)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14090  -8765  -5185   3782  78539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8828.1     2101.6   4.201 8.27e-05 ***
## SoVI          1039.7      534.7   1.944   0.0562 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16390 on 65 degrees of freedom
##   (333 observations deleted due to missingness)
## Multiple R-squared:  0.05496,    Adjusted R-squared:  0.04042 
## F-statistic:  3.78 on 1 and 65 DF,  p-value: 0.05619


Table 4.1. OLS Regression of U.S. County Data. The table shows the results of the ordinary least squares (OLS) regression analysis examining the relationship between property damage per capita from Hurricane Harvey and SoVI calculated using the entire U.S. dataset. The results (p > 0.05) indicate a marginally significant relationship between property damage and social vulnerability.

4.2 Texas-Lousiana-Mississippi Data

Similarly to the U.S. dataset, an OLS regression was performed to examine the relationship between property damage per capita (response variable) and SoVI (predictor variable) for counties impacted by Hurricane Harvey in Texas, Louisiana, and Mississippi using the Texas, Louisiana, and Mississippi dataset. The analysis again shows a positive relationship, although extremely weak, indicating counties with higher social vulnerability tend to have higher property damage. However, the relationship is not statistically significant (p = 0.0735) and the model explains only a small portion (4.7%) of the variability in property damage, as seen in Table 4.2. Again, this suggests that while social vulnerability may influence property damage outcomes, other factors likely play a role and should be explored in further analysis.

# OLS regression of TX, LA, MS County Dataset and SHELDUS Prop DMG
txlams.model <- lm(Prop_DMG ~ SoVI, data = txlams.sovi.sheldus)
summary(txlams.model)
## 
## Call:
## lm(formula = Prop_DMG ~ SoVI, data = txlams.sovi.sheldus)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15741  -7672  -5517    741  79300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9293.6     2227.6   4.172 9.14e-05 ***
## SoVI           973.2      546.7   1.780   0.0797 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16470 on 65 degrees of freedom
##   (333 observations deleted due to missingness)
## Multiple R-squared:  0.04649,    Adjusted R-squared:  0.03182 
## F-statistic: 3.169 on 1 and 65 DF,  p-value: 0.0797


Table 4.2. OLS Regression of TX-LA-MS County Data. The table shows the results of the ordinary least squares (OLS) regression analysis examining the relationship between property damage per capita from Hurricane Harvey and SoVI calculated using the TX-LA-MS dataset. The results (p > 0.05) indicate a these is no statistical significance between property damage and social vulnerability.

5 Discussion

This research aimed to evaluate the relationship between social vulnerability, as measured by SoVI, and disaster outcomes, specifically property damage per capita from Hurricane Harvey. The OLS regression analysis revealed a positive relationship between SoVI and property damage per capita, but the relationship was not statistically significant in either dataset (SoVI calculated with U.S.-wide and TX, LA, and MS county-level ACS data). These findings suggest that social vulnerability alone is not a strong predictor of property damage outcomes. The weak relationship may indicate that other factors, such as local infrastructure or disaster preparedness, play a more significant role. Alternatively, discrepancies in the recreation of SoVI could have contributed to the results.

Prior to conducting an OLS regression anlysis, SoVI values were calculated at the county level using the SoVI “recipe” provided by Universty of South Carolina College of Arts and Sciences. This was an attempt to recreate the methods used by Cutter and others to quantify social vulnerability, taking up the bulk of the research with an initial attempt to recreate the 2014 SoVI values provided in the “recipe”, which was unsuccessful. Recreating SoVI proved to be a challenge as neither the “recipe” nor other literature were as clear as anticipated. The “recipe” provided a general outline of the steps to follow but did not provide specific details on how to calculate the index. This led to some ambiguity and subjectivity in the process.

5.1 Variable Selection Ambiguity

The SoVI “evolution” outlines 29 variables for calculating SoVI. Selecting these variables is a critical and foundational step, but it involves several challenges. While some variables, such as the percentage of Black or Hispanic populations, are directly tied to ACS data, others require additional computation or interpretation.

For example, the “recipe” suggests including the percentage of children living in two-parent families. The closest ACS proxy is “B09002_002: Children - Married Couples,” which excludes non-married couples and may underestimate this population. Similarly, the suggested variable “Nursing Home Residents per capita” does not have a direct equivalent in ACS data. The closest proxy, “B09019_038: Population in Group Quarters,” lacks age specificity and may include individuals in non-nursing home facilities, further complicating its use.

Additionally, due to data availability, 2024 hospital data was used instead of 2017 data. These substitutions introduce potential discrepancies in the recreated SoVI values, highlighting the ambiguity and subjectivity inherent in variable selection.

5.2 Cardinality Determination

Following variable selection, variables were normalized and standardized as prescribed by the SoVI “recipe.” However, determining the cardinality of each principal component after PCA and varimax rotation presented significant challenges. The “recipe” recommends using a loading threshold of 0.7 but suggests that a threshold of 0.5 is acceptable in some cases. With no clear guidance on when to apply each threshold, the decision becomes subjective. In this study, a 0.7 threshold was used, but the choice may have influenced the results, as applying a 0.5 threshold could yield different cardinalities.

Various research literature provides guidance on which variables increase vulnerability and which decreases it, helping to determine overall cardinality. Still the process can be extremely subjective. For example, in Table 3.1, Dimension 4 includes a negative loading for “Percent Female” and a positive loading for “Nursing Home Residents per capita.” Literature suggests that “Percent Female” increases vulnerability, supporting a negative cardinality. However, the effect of nursing home residents is less clear: while their presence could indicate increased vulnerability due to age and need for assistance, it could also represent reduced vulnerability due to available care. Determining which variable has greater influence — or whether the overall cardinality should change — requires subjective interpretation.

This ambiguity extends to other potential scenarios, such as correlated variables. For instance, if “Percent Rich” and “Percent Poverty” appear in the same component with positive loadings, one might question which variable dominates and whether the cardinality should reflect a positive or negative influence. The “recipe” provides no detailed guidance on resolving such conflicts, leaving these decisions to researcher subjectivity and introducing potential inconsistencies.

5.3 SoVI Calculation - U.S. versus Local County Data

An important consideration when calculating SoVI is deciding the appropriate spatial extent — should the analysis use data from the entire U.S. or focus on a local data? Previous research, such as that by Spielman et al. (2020), demonstrated that the spatial extent significantly impacts the results, a finding corroborated in this study (see Figure 3.3). This variation arises because PCA compares all data within the chosen extent, potentially amplifying national-level disparities while overlooking regional specifics. For example, the cost of living in DeSoto County, MS, is likely much lower than in Los Angeles County, CA. Since cost of living and other factors often vary by region, comparing counties across the entire U.S. may not appropriately represent local vulnerabilities. Expanding the spatial extent to include neighboring states rather than restricting the analysis to a single state might offer a more balanced comparison. However, the optimal extent remains unclear and requires further exploration.

5.4 SHELDUS Data

To determine the relationship between social vulnerability and disaster outcomes, property damage data from the SHELDUS database was used. The SHELDUS database provides comprehensive county-level information on natural disasters, including property damage, injuries, and fatalities. However, it has limitations, such as potential under-reporting or inconsistencies in data collection. For example, property damage values may not account for uninsured losses and are often based on estimates rather than verified costs. Also, disaster-related injuries and fatalities may be under-counted, resulting in incomplete datasets. Additionally, SHELDUS evenly distributes property damage estimates across all counties affected by a disaster, which can introduce spatial autocorrelation and mask the true distribution of impacts (Yoon, 2012). Furthermore, as Tellman et al. (2020) note, these damage estimates often rely on “guesstimates,” with variability in how counties define and report damage, leading to potential inaccuracies of up to 40%. Higher damage estimates may also not indicate higher vulnerability. As Rufat et al. (2019) found in their study of Hurricane Sandy, higher property values were associated with higher damage but decreased social vulnerability. It is possible this same relationship is present in the Houston area. Due to the lack of higher spatial granularity in the SHELDUS data (i.e., not available at the block group or census tract level), it is difficult to determine if this is the case.

6 Future Work

This research highlights several areas for improvement in the calculation and validation of SoVI. If using SoVI, future studies should prioritize refining variable selection to address ambiguities, create a consistent and repeatable process, and incorporate more granular spatial data, such as census tracts, to better capture localized vulnerability patterns. Another option would be to utilize the Center for Disease Control and Prevention (CDC) Social Vulnerability Index (SVI) to test validation methods. Additionally, exploring hazard-specific variables and alternative statistical methods, such as geographically weighted PCA, could enhance the index’s relevance and accuracy. Finally, applying SoVI to diverse disaster events and integrating more detailed empirical data, such as FEMA disaster assistance records, would provide a more comprehensive assessment of its effectiveness.

References

Bronfman, N. C., Repetto, P. B., Guerrero, N., Castañeda, J. V., & Cisternas, P. C. (2021). Temporal evolution in social vulnerability to natural hazards in chile [Journal Article]. Natural Hazards, 107(2), 1757–1784. https://doi.org/10.1007/s11069-021-04657-1
Cutter, S. L., & Finch, C. (2008). Temporal and spatial changes in social vulnerability to natural hazards [Journal Article]. Proceedings of the National Academy of Science, 105(7), 6.
Fekete, A. (2009). Validation of a social vulnerability index in context to river-floods in germany [Journal Article]. Natural Hazards and Earth System Sciences, 9, 393–403.
Flanagan, B., Hallisey, E., Sharpe, J. D., Mertzlufft, C. E., & Grossman, M. (2021). On the validity of validation: A commentary on rufat, tate, emrich, and antolini’s “how valid are social vulnerability models?” [Journal Article]. Annals of the American Association of Geographers, 111(4), em-i-em-vi. https://doi.org/10.1080/24694452.2020.1857220
Griego, A. L., Flores, A. B., Collins, T. W., & Grineski, S. E. (2020). Social vulnerability, disaster assistance, and recovery: A population-based study of hurricane harvey in greater houston, texas [Journal Article]. International Journal of Disaster Risk Reduction, 51. https://doi.org/10.1016/j.ijdrr.2020.101766
Hinojos, S., McPhillips, L., Stempel, P., & Grady, C. (2023). Social and environmental vulnerability to flooding: Investigating cross-scale hypotheses [Journal Article]. Applied Geography, 157. https://doi.org/10.1016/j.apgeog.2023.103017
Park, G., & Xu, Z. (2020). Spatial and temporal dynamics of social vulnerability in the united states from 1970 to 2010 [Journal Article]. International Journal of Applied Geospatial Research, 11(1), 36–54. https://doi.org/10.4018/ijagr.2020010103
Park, G., & Xu, Z. (2021). The constituent components and local indicator variables of social vulnerability index [Journal Article]. Natural Hazards, 110(1), 95–120. https://doi.org/10.1007/s11069-021-04938-9
Rufat, S., Tate, E., Emrich, C. T., & Antolini, F. (2019). How valid are social vulnerability models? [Journal Article]. Annals of the American Association of Geographers, 109(4), 1131–1153. https://doi.org/10.1080/24694452.2018.1535887
Shlens, J. (2014). A tutorial in principal component analysis [Journal Article]. Https://Arxiv.org/Abs/1404.1100v1, Online. https://arxiv.org/abs/1404.1100
Spielman, S. E., Tuccillo, J., Folch, D. C., Schweikert, A., Davies, R., Wood, N., & Tate, E. (2020). Evaluating social vulnerability indicators: Criteria and their application to the social vulnerability index [Journal Article]. Natural Hazards, 100(1), 417–436. https://doi.org/10.1007/s11069-019-03820-z
Tellman, B., Schank, C., Schwarz, B., Howe, P. D., & Sherbinin, A. de. (2020). Using disaster outcomes to validate components of social vulnerability to floods: Flood deaths and property damage across the USA [Journal Article]. Sustainability, 12(15). https://doi.org/10.3390/su12156006
Yoon, D. K. (2012). Assessment of social vulnerability to natural disasters: A comparative study [Journal Article]. Natural Hazards, 63(2), 823–843. https://doi.org/10.1007/s11069-012-0189-2
Zahran, S., Brody, S. D., Peacock, W. G., Vedlitz, A., & Grover, H. (2008). Social vulnerability and the natural and built environment: A model of flood casualties in texas [Journal Article]. Disasters, 32(4), 537–560. https://doi.org/10.1111/j.1467-7717.2008.01054.x