library(foreign)
library(Hmisc)
library(sjmisc)
library(tidyverse)
library(ggplot2)
library(DT)

Load in data

We’re loading in the SPSS-formatted data from GSS year 2018. Need foreign::spss.get to read the data

Two things to watch out for when importing data

Data types

Whenever we’re loading in data from external sources, especially data as complex as the GSS, we need to make sure that we’re importing what we think we’re importing. Let’s look at a couple variables and see what kind of errors we could accidentally make

The variable age is the respondent’s age. Simple enough. Let’s try to plot the age distribution using hist(). Hmmm…

Hmm.. what’s going on?

summary(gss$age)

##          18          19          20          21          22          23 
##          22          26          15          27          40          29 
##          24          25          26          27          28          29 
##          38          43          31          39          45          43 
##          30          31          32          33          34          35 
##          50          34          43          42          65          40 
##          36          37          38          39          40          41 
##          40          43          38          55          41          40 
##          42          43          44          45          46          47 
##          40          39          40          29          42          30 
##          48          49          50          51          52          53 
##          33          31          37          41          29          48 
##          54          55          56          57          58          59 
##          35          48          50          39          37          46 
##          60          61          62          63          64          65 
##          46          39          24          37          33          39 
##          66          67          68          69          70          71 
##          36          27          39          41          45          33 
##          72          73          74          75          76          77 
##          22          20          29          23          19          28 
##          78          79          80          81          82          83 
##          11           9          12           9          12           7 
##          84          85          86          87          88 89 OR OLDER 
##          10          12          14           5           8          29 
##        NA's 
##           7

Uh huh… how do we fix it? What about this? Let’s just assume that those >89 are 90. They’re not, but it’s better than just leaving them out of the analysis, so….

gss$age_recoded = gss$age %>% 
  fct_recode("90"="89 OR OLDER") %>%
  as.numeric

hist(gss$age_recoded)

mean(gss$age_recoded,na.rm=TRUE)

## [1] 31.97138

Ok.. mean is 32.. seems reasonable? Is it correct though? Let’s look at a histogram. What do you notice?

hist(gss$age_recoded)

How do we fix this?

Now try it yourself by fixing the coding of realrinc (respondent’s annual income adjusted to 1986 dollars).

gss$age_recoded = gss$age %>% 
  fct_recode("90"="89 OR OLDER") %>%
  as.character %>%
  as.numeric

mean(gss$age_recoded,na.rm=TRUE)

## [1] 48.98377

What’s the mean and median income of participants younger than 25 in 2018 dollars? To get 2018 dollars (where these data are from), multiply realrinc (1986 dollars) by 2.29.

gss$age_recoded = gss$age %>% 
  fct_recode("90"="89 OR OLDER") %>% 
  as.character %>%
  as.numeric

gss <- gss %>% mutate(realrinc = as.numeric(as.character(realrinc)))
gss <- gss %>% mutate(realrinc_2016 = realrinc*2.29)

How many participants are there in this dataset who’re younger than 25? Break it down by the participant’s gender (sex). Do you notice anything about the mean vs. median for males vs. females? What is this telling you?

Solution

gss %>% filter(age_recoded<25) %>% 
  summarize(realrinc_2016_mean = mean(realrinc_2016,na.rm=TRUE),realrinc_2016_med = median(realrinc_2016,na.rm=TRUE),n=n())

##   realrinc_2016_mean realrinc_2016_med   n
## 1           16441.63           9356.94 197

Factor levels

The variable happy contains participant’s responses to this question: Taken all together, how would you say things are these days--would you say that you are very happy, pretty happy, or not too happy?

It’s coded as those three levels

levels(gss$happy)

## [1] "VERY HAPPY"    "PRETTY HAPPY"  "NOT TOO HAPPY"

Let’s look at respondents’ happiness as a function of self-reported health (health)

gss %>% drop_na(health) %>% group_by(health) %>% summarize(happy = mean(as.numeric(happy),na.rm=TRUE))

## # A tibble: 4 × 2
##   health    happy
##   <fct>     <dbl>
## 1 EXCELLENT  1.56
## 2 GOOD       1.85
## 3 FAIR       2.02
## 4 POOR       2.35

See the problem?

levels(gss$happy)

## [1] "VERY HAPPY"    "PRETTY HAPPY"  "NOT TOO HAPPY"

Let’s fix it:

gss$happy_rev <- fct_rev(gss$happy)

levels(gss$happy)

## [1] "VERY HAPPY"    "PRETTY HAPPY"  "NOT TOO HAPPY"

levels(gss$happy_rev)

## [1] "NOT TOO HAPPY" "PRETTY HAPPY"  "VERY HAPPY"

Do people know that the earth goes around the sun?

Ok, let’s switch gearts a bit. Let’s look at some relationships between degree (participant’s education level) and their knowledge of whether the earth goes around the sun or the sun goes around the earth (earthsun)

Contingency tables using base R

The first way we can do this is by using contingency tables found in the base package which by default will show us frequencies:

gss %>% xtabs(~degree+earthsun,data=.)

##                 earthsun
## degree           Earth around sun Sun around earth
##   LT HIGH SCHOOL               77               44
##   HIGH SCHOOL                 413              166
##   JUNIOR COLLEGE               57               26
##   BACHELOR                    192               31
##   GRADUATE                     97               14

or more usefully, proportions (notice how we’re piping in the xtabs output into prop.table)

gss %>% xtabs(~degree+earthsun,data=.) %>% prop.table(margin=1) %>% round(2)

##                 earthsun
## degree           Earth around sun Sun around earth
##   LT HIGH SCHOOL             0.64             0.36
##   HIGH SCHOOL                0.71             0.29
##   JUNIOR COLLEGE             0.69             0.31
##   BACHELOR                   0.86             0.14
##   GRADUATE                   0.87             0.13

People with less than a high school degree are at chance on this question… What does this mean?). Is that more or less troubling than that 7% of people with graduate degrees get this wrong? (not a rhetorical question!)

Contingency tables using `sjmisc`

Here’s what the same tabulation looks like using the sjmisc package which provides us some nice additional features:

gss %>% flat_table(degree,earthsun)

##                earthsun Earth around sun Sun around earth
## degree                                                   
## LT HIGH SCHOOL                        77               44
## HIGH SCHOOL                          413              166
## JUNIOR COLLEGE                        57               26
## BACHELOR                             192               31
## GRADUATE                              97               14

..and proportions:

gss %>% flat_table(degree,earthsun,margin=c("row"),digits=0) #the digits argument is the number of significant digits

##                earthsun Earth around sun Sun around earth
## degree                                                   
## LT HIGH SCHOOL                        64               36
## HIGH SCHOOL                           71               29
## JUNIOR COLLEGE                        69               31
## BACHELOR                              86               14
## GRADUATE                              87               13

sjmisc has additional grouping functions that allow us to split a variable into equal groups. Let’s do this for degree. Note that we need to first convert it to numeric.

gss %>% transform(degree=as.numeric(degree)) %>% 
  split_var(degree,n=3,val.labels = c("Lowest tertile","Middle tertle", "Highest tertile")) %>% 
  flat_table(degree_g,earthsun,margin=c("row"),digits=1)

##                 earthsun Earth around sun Sun around earth
## degree_g                                                  
## Lowest tertile                       63.6             36.4
## Middle tertle                        71.3             28.7
## Highest tertile                      83.0             17.0

Contingency tables using `dplyr`

Here’s what a basic contingency table looks like with dplyr

gss %>% group_by(degree,earthsun) %>% summarize(n=n())

## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.

## # A tibble: 15 × 3
## # Groups:   degree [5]
##    degree         earthsun             n
##    <fct>          <fct>            <int>
##  1 LT HIGH SCHOOL Earth around sun    77
##  2 LT HIGH SCHOOL Sun around earth    44
##  3 LT HIGH SCHOOL <NA>               141
##  4 HIGH SCHOOL    Earth around sun   413
##  5 HIGH SCHOOL    Sun around earth   166
##  6 HIGH SCHOOL    <NA>               599
##  7 JUNIOR COLLEGE Earth around sun    57
##  8 JUNIOR COLLEGE Sun around earth    26
##  9 JUNIOR COLLEGE <NA>               113
## 10 BACHELOR       Earth around sun   192
## 11 BACHELOR       Sun around earth    31
## 12 BACHELOR       <NA>               242
## 13 GRADUATE       Earth around sun    97
## 14 GRADUATE       Sun around earth    14
## 15 GRADUATE       <NA>               136

Let’s get rid of those NAs:

gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>% summarize(n=n())

## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 3
## # Groups:   degree [5]
##    degree         earthsun             n
##    <fct>          <fct>            <int>
##  1 LT HIGH SCHOOL Earth around sun    77
##  2 LT HIGH SCHOOL Sun around earth    44
##  3 HIGH SCHOOL    Earth around sun   413
##  4 HIGH SCHOOL    Sun around earth   166
##  5 JUNIOR COLLEGE Earth around sun    57
##  6 JUNIOR COLLEGE Sun around earth    26
##  7 BACHELOR       Earth around sun   192
##  8 BACHELOR       Sun around earth    31
##  9 GRADUATE       Earth around sun    97
## 10 GRADUATE       Sun around earth    14

Let’s convert to proportions:

gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>% summarize(n=n()) %>% 
    mutate(proportion = n / sum(n))

## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.

## # A tibble: 10 × 4
## # Groups:   degree [5]
##    degree         earthsun             n proportion
##    <fct>          <fct>            <int>      <dbl>
##  1 LT HIGH SCHOOL Earth around sun    77      0.636
##  2 LT HIGH SCHOOL Sun around earth    44      0.364
##  3 HIGH SCHOOL    Earth around sun   413      0.713
##  4 HIGH SCHOOL    Sun around earth   166      0.287
##  5 JUNIOR COLLEGE Earth around sun    57      0.687
##  6 JUNIOR COLLEGE Sun around earth    26      0.313
##  7 BACHELOR       Earth around sun   192      0.861
##  8 BACHELOR       Sun around earth    31      0.139
##  9 GRADUATE       Earth around sun    97      0.874
## 10 GRADUATE       Sun around earth    14      0.126

Notice that the data are in a long format. This is exactly what we want for graphing and statistical analyses, but if you want a more readable table, just pivot it like so:

gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>% 
    summarize(n=n()) %>% mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% #get rid of the number of observations column since we're using proportions here 
    pivot_wider(names_from=earthsun,values_from=proportion)

## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 3
## # Groups:   degree [5]
##   degree         `Earth around sun` `Sun around earth`
##   <fct>                       <dbl>              <dbl>
## 1 LT HIGH SCHOOL              0.636              0.364
## 2 HIGH SCHOOL                 0.713              0.287
## 3 JUNIOR COLLEGE              0.687              0.313
## 4 BACHELOR                    0.861              0.139
## 5 GRADUATE                    0.874              0.126

Human evolution, politics, education

Let’s look at another relationship: endorsement of the statement “human beings developed from animals” (evolved) and see how it relates to political party affiliation (partyid)

gss %>% flat_table(polviews,evolved,margin=c("row"),digits=1)

##                      evolved True False
## polviews                               
## EXTREMELY LIBERAL            76.7  23.3
## LIBERAL                      83.1  16.9
## SLIGHTLY LIBERAL             66.0  34.0
## MODERATE                     54.8  45.2
## SLGHTLY CONSERVATIVE         52.0  48.0
## CONSERVATIVE                 35.3  64.7
## EXTRMLY CONSERVATIVE         36.8  63.2

Let’s see how the likelihood of endorsing human evolution is affected by education (degree):

gss %>% group_by(degree) %>% drop_na(degree,evolved) %>% summarize(evolved=round(mean(evolved=="True",na.rm=TRUE),2),n=n())

## # A tibble: 5 × 3
##   degree         evolved     n
##   <fct>            <dbl> <int>
## 1 LT HIGH SCHOOL    0.4     47
## 2 HIGH SCHOOL       0.53   258
## 3 JUNIOR COLLEGE    0.64    45
## 4 BACHELOR          0.64   117
## 5 GRADUATE          0.6     57

Now let’s see whether education has a differential effect for people of various political orientations. For simplicity let’s dichotomize education into respondents with a college degree or above, and those below a college degree:

gss %>% drop_na(polviews,degree,evolved) %>%
    mutate(college_or_more = (as.numeric(degree) > 3)) %>%
    group_by(polviews,college_or_more) %>% 
    summarize(humans_evolved=round(mean(evolved=="True",na.rm=TRUE),2),n=n()) %>% 
  datatable(options = list(pageLength = 20)) #check this out in a browser!

## `summarise()` has grouped output by 'polviews'. You can override using the
## `.groups` argument.

(notice the small Ns in some cells)

This is the kind of data that’s perfect for graphing to look at interactions. But we’re examining it in table form, we can pivot it to wide and make a difference column, e.g

gss %>% drop_na(polviews,degree,evolved) %>%
    mutate(college_or_more = (as.numeric(degree) > 3)) %>%
    group_by(polviews,college_or_more) %>% 
    summarize(humans_evolved=mean(evolved=="True",na.rm=TRUE)) %>%
    pivot_wider(names_from=college_or_more,values_from=humans_evolved) %>% rename("college_or_more" = `TRUE`, "lt_college" = `FALSE`) %>%
    mutate(evolution_diff = college_or_more - lt_college)

## `summarise()` has grouped output by 'polviews'. You can override using the
## `.groups` argument.

## # A tibble: 7 × 4
## # Groups:   polviews [7]
##   polviews             lt_college college_or_more evolution_diff
##   <fct>                     <dbl>           <dbl>          <dbl>
## 1 EXTREMELY LIBERAL         0.571           0.938         0.366 
## 2 LIBERAL                   0.714           0.935         0.221 
## 3 SLIGHTLY LIBERAL          0.684           0.6          -0.0842
## 4 MODERATE                  0.534           0.585         0.0511
## 5 SLGHTLY CONSERVATIVE      0.490           0.577         0.0871
## 6 CONSERVATIVE              0.404           0.25         -0.154 
## 7 EXTRMLY CONSERVATIVE      0.375           0.333        -0.0417

What does a regression model look like in the tidyverse? Simple! Let’s test for the main effects and interaction of polviews and degree on evolution using logistic regression (because our outcome variable is binary). We’re going to use scale to center the variables. We don’t need to explicitly use drop_na because glm will automatically drop missing values.

gss %>% glm(I(evolved=="True")~scale(as.numeric(polviews))*scale(as.numeric(degree)),data=.,family=binomial) %>% summary

## 
## Call:
## glm(formula = I(evolved == "True") ~ scale(as.numeric(polviews)) * 
##     scale(as.numeric(degree)), family = binomial, data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2251  -1.1399   0.5613   1.1060   1.9199  
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                            0.27419    0.09536
## scale(as.numeric(polviews))                           -0.58379    0.10311
## scale(as.numeric(degree))                              0.20776    0.09974
## scale(as.numeric(polviews)):scale(as.numeric(degree)) -0.31711    0.10670
##                                                       z value Pr(>|z|)    
## (Intercept)                                             2.875  0.00404 ** 
## scale(as.numeric(polviews))                            -5.662  1.5e-08 ***
## scale(as.numeric(degree))                               2.083  0.03725 *  
## scale(as.numeric(polviews)):scale(as.numeric(degree))  -2.972  0.00296 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 695.00  on 506  degrees of freedom
## Residual deviance: 640.54  on 503  degrees of freedom
##   (1841 observations deleted due to missingness)
## AIC: 648.54
## 
## Number of Fisher Scoring iterations: 4

Now it’s your turn!

You can look up info on the variables at https://gssdataexplorer.norc.org/variables/vfilter If you just need to quickly look up the description of the variable use head(gss$var_name) or summary(gss$var_name) to list its factors (which will be listed in order). Note that when the data were imported from SPSS, all variables were imported as factors, hence the need to convert to numeric as needed.

Children and happiness

Are people with children (childs!=0) happier (happy) than people without children? Does this relationship depend on the respondent’s gender? (sex) Their age? (age)?

Note Feel free to bin age into discrete groups (20-30, 31-40 etc.). The sjmisc package has nice functions for this. See the cheat sheet.

Income and happiness

What’s the relationship between respondent’s income (realrinc) and happiness (happy)? Use sjmisc::split_var to bin income into quintiles (i.e., 5 equal groups). How does this relationship vary for people who have a college degree or higher vs. people who don’t?

Tip Remember to convert income to numeric using as.numeric(as.character(realrinc)))

Tip

realrinc is income in 1986 dollars. To convert to 2018 dollars, multiply by 2.29. Also make sure you interpret the value of happy correctly.

Politics and cynicism

Have a look at the variable helpful. What’s the relationship between helpful and political party affiliation (partyid)? Notice something funny there) What about its relationship with religiosity (god)? Age (age)?

Since it’s hard to interpret “DEPENDS” as an answer, let’s focus on whether people are helpful or looking out for selves (the first two options).

Individual income, family income, and marital happiness

The data includes not only the respondent’s income (realrinc), but also the family income (realinc) This allows you to figure out how much other members of the family (typically the spouse) is earning. Do respondents whose income is a larger proportion of the family income have a different marital satisfaction (hapmar)?

Human evolution and elephant evolution

Half of the respondents were asked about human evolution (evolved). The other were asked about non-human animal evolution (evolved2). How does the difference between endorsing evolved and evolved2 vary by education? Political affiliation?

Examples of wrangling GSS data

Gary Lupyan

10/19/2023

Load in data

Two things to watch out for when importing data

Data types

Factor levels

Do people know that the earth goes around the sun?

Contingency tables using base R

Contingency tables using `sjmisc`

Contingency tables using `dplyr`

Human evolution, politics, education

Now it’s your turn!

Children and happiness

Income and happiness

Politics and cynicism

Individual income, family income, and marital happiness

Human evolution and elephant evolution

Examples of wrangling GSS data

Gary Lupyan

10/19/2023

Load in data

Two things to watch out for when importing data

Data types

Factor levels

Do people know that the earth goes around the sun?

Contingency tables using base R

Contingency tables using sjmisc

Contingency tables using dplyr

Human evolution, politics, education

Now it’s your turn!

Children and happiness

Income and happiness

Politics and cynicism

Individual income, family income, and marital happiness

Human evolution and elephant evolution

Contingency tables using `sjmisc`

Contingency tables using `dplyr`