library(foreign)
library(Hmisc)
library(sjmisc)
library(tidyverse)
library(ggplot2)
library(DT)
We’re loading in the SPSS-formatted data from GSS year 2018. Need foreign::spss.get to read the data
Whenever we’re loading in data from external sources, especially data as complex as the GSS, we need to make sure that we’re importing what we think we’re importing. Let’s look at a couple variables and see what kind of errors we could accidentally make
The variable age
is the respondent’s age. Simple enough.
Let’s try to plot the age distribution using hist()
.
Hmmm…
Hmm.. what’s going on?
summary(gss$age)
## 18 19 20 21 22 23
## 22 26 15 27 40 29
## 24 25 26 27 28 29
## 38 43 31 39 45 43
## 30 31 32 33 34 35
## 50 34 43 42 65 40
## 36 37 38 39 40 41
## 40 43 38 55 41 40
## 42 43 44 45 46 47
## 40 39 40 29 42 30
## 48 49 50 51 52 53
## 33 31 37 41 29 48
## 54 55 56 57 58 59
## 35 48 50 39 37 46
## 60 61 62 63 64 65
## 46 39 24 37 33 39
## 66 67 68 69 70 71
## 36 27 39 41 45 33
## 72 73 74 75 76 77
## 22 20 29 23 19 28
## 78 79 80 81 82 83
## 11 9 12 9 12 7
## 84 85 86 87 88 89 OR OLDER
## 10 12 14 5 8 29
## NA's
## 7
Uh huh… how do we fix it? What about this? Let’s just assume that those >89 are 90. They’re not, but it’s better than just leaving them out of the analysis, so….
gss$age_recoded = gss$age %>%
fct_recode("90"="89 OR OLDER") %>%
as.numeric
hist(gss$age_recoded)
mean(gss$age_recoded,na.rm=TRUE)
## [1] 31.97138
Ok.. mean is 32.. seems reasonable? Is it correct though? Let’s look at a histogram. What do you notice?
hist(gss$age_recoded)
How do we fix this?
Now try it yourself by fixing the coding of realrinc
(respondent’s annual income adjusted to 1986 dollars).
gss$age_recoded = gss$age %>%
fct_recode("90"="89 OR OLDER") %>%
as.character %>%
as.numeric
mean(gss$age_recoded,na.rm=TRUE)
## [1] 48.98377
What’s the mean and median income of participants younger than 25 in 2018 dollars? To get 2018 dollars (where these data are from), multiply realrinc (1986 dollars) by 2.29.
gss$age_recoded = gss$age %>%
fct_recode("90"="89 OR OLDER") %>%
as.character %>%
as.numeric
gss <- gss %>% mutate(realrinc = as.numeric(as.character(realrinc)))
gss <- gss %>% mutate(realrinc_2016 = realrinc*2.29)
How many participants are there in this dataset who’re younger than
25? Break it down by the participant’s gender (sex
). Do you
notice anything about the mean vs. median for males vs. females? What is
this telling you?
Solution
gss %>% filter(age_recoded<25) %>%
summarize(realrinc_2016_mean = mean(realrinc_2016,na.rm=TRUE),realrinc_2016_med = median(realrinc_2016,na.rm=TRUE),n=n())
## realrinc_2016_mean realrinc_2016_med n
## 1 16441.63 9356.94 197
The variable happy
contains participant’s responses to
this question:
Taken all together, how would you say things are these days--would you say that you are very happy, pretty happy, or not too happy?
It’s coded as those three levels
levels(gss$happy)
## [1] "VERY HAPPY" "PRETTY HAPPY" "NOT TOO HAPPY"
Let’s look at respondents’ happiness as a function of self-reported
health (health
)
gss %>% drop_na(health) %>% group_by(health) %>% summarize(happy = mean(as.numeric(happy),na.rm=TRUE))
## # A tibble: 4 × 2
## health happy
## <fct> <dbl>
## 1 EXCELLENT 1.56
## 2 GOOD 1.85
## 3 FAIR 2.02
## 4 POOR 2.35
See the problem?
levels(gss$happy)
## [1] "VERY HAPPY" "PRETTY HAPPY" "NOT TOO HAPPY"
Let’s fix it:
gss$happy_rev <- fct_rev(gss$happy)
levels(gss$happy)
## [1] "VERY HAPPY" "PRETTY HAPPY" "NOT TOO HAPPY"
levels(gss$happy_rev)
## [1] "NOT TOO HAPPY" "PRETTY HAPPY" "VERY HAPPY"
Ok, let’s switch gearts a bit. Let’s look at some relationships
between degree
(participant’s education level) and their
knowledge of whether the earth goes around the sun or the sun goes
around the earth (earthsun
)
The first way we can do this is by using contingency tables found in the base package which by default will show us frequencies:
gss %>% xtabs(~degree+earthsun,data=.)
## earthsun
## degree Earth around sun Sun around earth
## LT HIGH SCHOOL 77 44
## HIGH SCHOOL 413 166
## JUNIOR COLLEGE 57 26
## BACHELOR 192 31
## GRADUATE 97 14
or more usefully, proportions (notice how we’re piping in the
xtabs
output into prop.table
)
gss %>% xtabs(~degree+earthsun,data=.) %>% prop.table(margin=1) %>% round(2)
## earthsun
## degree Earth around sun Sun around earth
## LT HIGH SCHOOL 0.64 0.36
## HIGH SCHOOL 0.71 0.29
## JUNIOR COLLEGE 0.69 0.31
## BACHELOR 0.86 0.14
## GRADUATE 0.87 0.13
People with less than a high school degree are at chance on this question… What does this mean?). Is that more or less troubling than that 7% of people with graduate degrees get this wrong? (not a rhetorical question!)
sjmisc
Here’s what the same tabulation looks like using the
sjmisc
package which provides us some nice additional
features:
gss %>% flat_table(degree,earthsun)
## earthsun Earth around sun Sun around earth
## degree
## LT HIGH SCHOOL 77 44
## HIGH SCHOOL 413 166
## JUNIOR COLLEGE 57 26
## BACHELOR 192 31
## GRADUATE 97 14
..and proportions:
gss %>% flat_table(degree,earthsun,margin=c("row"),digits=0) #the digits argument is the number of significant digits
## earthsun Earth around sun Sun around earth
## degree
## LT HIGH SCHOOL 64 36
## HIGH SCHOOL 71 29
## JUNIOR COLLEGE 69 31
## BACHELOR 86 14
## GRADUATE 87 13
sjmisc
has additional grouping functions that allow us
to split a variable into equal groups. Let’s do this for
degree
. Note that we need to first convert it to
numeric.
gss %>% transform(degree=as.numeric(degree)) %>%
split_var(degree,n=3,val.labels = c("Lowest tertile","Middle tertle", "Highest tertile")) %>%
flat_table(degree_g,earthsun,margin=c("row"),digits=1)
## earthsun Earth around sun Sun around earth
## degree_g
## Lowest tertile 63.6 36.4
## Middle tertle 71.3 28.7
## Highest tertile 83.0 17.0
dplyr
Here’s what a basic contingency table looks like with
dplyr
gss %>% group_by(degree,earthsun) %>% summarize(n=n())
## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.
## # A tibble: 15 × 3
## # Groups: degree [5]
## degree earthsun n
## <fct> <fct> <int>
## 1 LT HIGH SCHOOL Earth around sun 77
## 2 LT HIGH SCHOOL Sun around earth 44
## 3 LT HIGH SCHOOL <NA> 141
## 4 HIGH SCHOOL Earth around sun 413
## 5 HIGH SCHOOL Sun around earth 166
## 6 HIGH SCHOOL <NA> 599
## 7 JUNIOR COLLEGE Earth around sun 57
## 8 JUNIOR COLLEGE Sun around earth 26
## 9 JUNIOR COLLEGE <NA> 113
## 10 BACHELOR Earth around sun 192
## 11 BACHELOR Sun around earth 31
## 12 BACHELOR <NA> 242
## 13 GRADUATE Earth around sun 97
## 14 GRADUATE Sun around earth 14
## 15 GRADUATE <NA> 136
Let’s get rid of those NAs:
gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>% summarize(n=n())
## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups: degree [5]
## degree earthsun n
## <fct> <fct> <int>
## 1 LT HIGH SCHOOL Earth around sun 77
## 2 LT HIGH SCHOOL Sun around earth 44
## 3 HIGH SCHOOL Earth around sun 413
## 4 HIGH SCHOOL Sun around earth 166
## 5 JUNIOR COLLEGE Earth around sun 57
## 6 JUNIOR COLLEGE Sun around earth 26
## 7 BACHELOR Earth around sun 192
## 8 BACHELOR Sun around earth 31
## 9 GRADUATE Earth around sun 97
## 10 GRADUATE Sun around earth 14
Let’s convert to proportions:
gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>% summarize(n=n()) %>%
mutate(proportion = n / sum(n))
## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: degree [5]
## degree earthsun n proportion
## <fct> <fct> <int> <dbl>
## 1 LT HIGH SCHOOL Earth around sun 77 0.636
## 2 LT HIGH SCHOOL Sun around earth 44 0.364
## 3 HIGH SCHOOL Earth around sun 413 0.713
## 4 HIGH SCHOOL Sun around earth 166 0.287
## 5 JUNIOR COLLEGE Earth around sun 57 0.687
## 6 JUNIOR COLLEGE Sun around earth 26 0.313
## 7 BACHELOR Earth around sun 192 0.861
## 8 BACHELOR Sun around earth 31 0.139
## 9 GRADUATE Earth around sun 97 0.874
## 10 GRADUATE Sun around earth 14 0.126
Notice that the data are in a long
format. This is
exactly what we want for graphing and statistical analyses, but if you
want a more readable table, just pivot it like so:
gss %>% group_by(degree,earthsun) %>% drop_na(degree,earthsun) %>%
summarize(n=n()) %>% mutate(proportion = n / sum(n)) %>%
select(-n) %>% #get rid of the number of observations column since we're using proportions here
pivot_wider(names_from=earthsun,values_from=proportion)
## `summarise()` has grouped output by 'degree'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: degree [5]
## degree `Earth around sun` `Sun around earth`
## <fct> <dbl> <dbl>
## 1 LT HIGH SCHOOL 0.636 0.364
## 2 HIGH SCHOOL 0.713 0.287
## 3 JUNIOR COLLEGE 0.687 0.313
## 4 BACHELOR 0.861 0.139
## 5 GRADUATE 0.874 0.126
Let’s look at another relationship: endorsement of the statement
“human beings developed from animals” (evolved
) and see how
it relates to political party affiliation (partyid
)
gss %>% flat_table(polviews,evolved,margin=c("row"),digits=1)
## evolved True False
## polviews
## EXTREMELY LIBERAL 76.7 23.3
## LIBERAL 83.1 16.9
## SLIGHTLY LIBERAL 66.0 34.0
## MODERATE 54.8 45.2
## SLGHTLY CONSERVATIVE 52.0 48.0
## CONSERVATIVE 35.3 64.7
## EXTRMLY CONSERVATIVE 36.8 63.2
Let’s see how the likelihood of endorsing human evolution is affected
by education (degree
):
gss %>% group_by(degree) %>% drop_na(degree,evolved) %>% summarize(evolved=round(mean(evolved=="True",na.rm=TRUE),2),n=n())
## # A tibble: 5 × 3
## degree evolved n
## <fct> <dbl> <int>
## 1 LT HIGH SCHOOL 0.4 47
## 2 HIGH SCHOOL 0.53 258
## 3 JUNIOR COLLEGE 0.64 45
## 4 BACHELOR 0.64 117
## 5 GRADUATE 0.6 57
Now let’s see whether education has a differential effect for people of various political orientations. For simplicity let’s dichotomize education into respondents with a college degree or above, and those below a college degree:
gss %>% drop_na(polviews,degree,evolved) %>%
mutate(college_or_more = (as.numeric(degree) > 3)) %>%
group_by(polviews,college_or_more) %>%
summarize(humans_evolved=round(mean(evolved=="True",na.rm=TRUE),2),n=n()) %>%
datatable(options = list(pageLength = 20)) #check this out in a browser!
## `summarise()` has grouped output by 'polviews'. You can override using the
## `.groups` argument.
(notice the small Ns in some cells)
This is the kind of data that’s perfect for graphing to look at interactions. But we’re examining it in table form, we can pivot it to wide and make a difference column, e.g
gss %>% drop_na(polviews,degree,evolved) %>%
mutate(college_or_more = (as.numeric(degree) > 3)) %>%
group_by(polviews,college_or_more) %>%
summarize(humans_evolved=mean(evolved=="True",na.rm=TRUE)) %>%
pivot_wider(names_from=college_or_more,values_from=humans_evolved) %>% rename("college_or_more" = `TRUE`, "lt_college" = `FALSE`) %>%
mutate(evolution_diff = college_or_more - lt_college)
## `summarise()` has grouped output by 'polviews'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 4
## # Groups: polviews [7]
## polviews lt_college college_or_more evolution_diff
## <fct> <dbl> <dbl> <dbl>
## 1 EXTREMELY LIBERAL 0.571 0.938 0.366
## 2 LIBERAL 0.714 0.935 0.221
## 3 SLIGHTLY LIBERAL 0.684 0.6 -0.0842
## 4 MODERATE 0.534 0.585 0.0511
## 5 SLGHTLY CONSERVATIVE 0.490 0.577 0.0871
## 6 CONSERVATIVE 0.404 0.25 -0.154
## 7 EXTRMLY CONSERVATIVE 0.375 0.333 -0.0417
What does a regression model look like in the tidyverse? Simple!
Let’s test for the main effects and interaction of polviews
and degree
on evolution
using logistic
regression (because our outcome variable is binary). We’re going to use
scale
to center the variables. We don’t need to explicitly
use drop_na
because glm will automatically drop missing
values.
gss %>% glm(I(evolved=="True")~scale(as.numeric(polviews))*scale(as.numeric(degree)),data=.,family=binomial) %>% summary
##
## Call:
## glm(formula = I(evolved == "True") ~ scale(as.numeric(polviews)) *
## scale(as.numeric(degree)), family = binomial, data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2251 -1.1399 0.5613 1.1060 1.9199
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.27419 0.09536
## scale(as.numeric(polviews)) -0.58379 0.10311
## scale(as.numeric(degree)) 0.20776 0.09974
## scale(as.numeric(polviews)):scale(as.numeric(degree)) -0.31711 0.10670
## z value Pr(>|z|)
## (Intercept) 2.875 0.00404 **
## scale(as.numeric(polviews)) -5.662 1.5e-08 ***
## scale(as.numeric(degree)) 2.083 0.03725 *
## scale(as.numeric(polviews)):scale(as.numeric(degree)) -2.972 0.00296 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 695.00 on 506 degrees of freedom
## Residual deviance: 640.54 on 503 degrees of freedom
## (1841 observations deleted due to missingness)
## AIC: 648.54
##
## Number of Fisher Scoring iterations: 4
You can look up info on the variables at https://gssdataexplorer.norc.org/variables/vfilter If
you just need to quickly look up the description of the variable use
head(gss$var_name)
or summary(gss$var_name) to list its
factors (which will be listed in order). Note that when the data were
imported from SPSS, all variables were imported as factors, hence the
need to convert to numeric as needed.
Are people with children (childs!=0
) happier
(happy
) than people without children? Does this
relationship depend on the respondent’s gender? (sex
) Their
age? (age
)?
Note Feel free to bin age into discrete groups (20-30, 31-40 etc.). The sjmisc package has nice functions for this. See the cheat sheet.
What’s the relationship between respondent’s income
(realrinc
) and happiness (happy
)? Use
sjmisc::split_var
to bin income into quintiles (i.e., 5
equal groups). How does this relationship vary for people who have a
college degree or higher vs. people who don’t?
Tip Remember to convert income to numeric using
as.numeric(as.character(realrinc)))
Tip
realrinc
is income in 1986 dollars. To convert to 2018
dollars, multiply by 2.29. Also make sure you interpret the value of
happy
correctly.
Have a look at the variable helpful
. What’s the
relationship between helpful
and political party
affiliation (partyid
)? Notice something funny there) What
about its relationship with religiosity (god
)? Age
(age
)?
Since it’s hard to interpret “DEPENDS” as an answer, let’s focus on whether people are helpful or looking out for selves (the first two options).
The data includes not only the respondent’s income
(realrinc
), but also the family income
(realinc
) This allows you to figure out how much other
members of the family (typically the spouse) is earning. Do respondents
whose income is a larger proportion of the family income have a
different marital satisfaction (hapmar
)?
Half of the respondents were asked about human evolution
(evolved
). The other were asked about non-human animal
evolution (evolved2
). How does the difference between
endorsing evolved
and evolved2
vary by
education? Political affiliation?