Analyzing Survey Results Using Logistic Regression

Try it now!

Overview

Beginning in 1972, the GSS has been keeping track of changes within society and examining the nuances of American life. Often cited in prominent publications such as The New York Times, The Wall Street Journal, and the Associated Press, this study is among the most impactful in the realm of social sciences. Let's use Julius to analyze the survey data and derive some insights into what most contributes to happiness in the US populace.

Getting Started

Import your data into Julius. The AI can read data in multiple formats, including CSV, Excel, and Google Sheets, among others. Once your data is uploaded, Julius will automatically assess and understand the nature of the data.

Initial analysis of General Social Survey results

Once your data is successfully imported, you can start your conversation with Julius.

Initial cleaning of the General Social Survey results

Before we beginning analyzing the data, we first have to clean the data. You can either give Julius a high-level task to clean the data according to best practices, or in our case, we specify the steps to take. Our first cleaning steps are dealing with the NaN values and converting the variable to numeric values using ordinal encoding.

First cleaning step for General Social Survey dataset

Additional cleaning steps we take include removing non-informative features and (eventually) converting the rest of the data to numeric type while filling in the missing values with 0.

Selecting features with a chi-squared test

In order to select the features most associated with the target variable, and therefore merit inclusion in the regression analysis, we will prompt Julius to perform a chi-square test. Our goal is to cut down the amount of features from 800+ to the top 75.

Selecting features with a chi-square test

Training a logistic regression model

Now, let's train our model. Logistic regression models the relationship between one or more independent variables and the probability of a particular outcome occurring.

While there's certainly room for further optimization, for our exploratory use case on a small dataset, the results are fine.

Finally, we will look at the significance of the coefficients to identify which predictor variables have a significant relationship with the response variable.

Results of logistic regression analysis of General Social Survey dataset

According to our model, the variables with the most significant relationships to happiness are:

Finalter — whether an individual's financial situation has been getting better, worse, or has it stayed the same.
Hapmar — happiness level of marriage.
Finrela — perceived income level relative to "American families in general".
Marcohab — cohabitation status; whether married or not, and whether cohabitating with a partner or not.
Satjob — job satisfaction.

Training a logistic regression model is a great example of a relatively complex statistical analysis made far easier using Julius. Whether doing exploratory analysis or creating visualizations for a final report, Julius is a great addition to the academic data analysis stack.