This explores how accurate a machine learning model, (stochastic gradient descent using logistic regression) ,using processed census data can predict the 2018 United States House of Representatives seats, and whether demographic data would be good indicators for electoral outcomes, and if so, which demographics are the most important.
This model uses a dataset rendered from combining the electoral results and the American Community Survey 2018 variables for each House district. The model implements recursive feature elimination and cross validation, parsing the 12,783 census variables to the 22 features it determined to be the most important in predicting whether the seats/districts should be classified as Democrat or Republican.
While the following census variables were the most likely to swing Democrat in congressional elections, the machine learning model cannot describe how these variables actually influence the seats. It only indicates that they do and the rankings of them, so further research and exploration must be done before assertions or conclusions can be drawn. The 22 ranked most valuable census variables are paraphrased below:
How many people have a computer with internet?
How many single Male households did not receive food stamps in the past year?
How many people worked part time year round who use Medicaid or other means tested health insurance?
How many people are self-employed and not in incorporated business with unpaid family workers in material moving occupations (truckers)?
How many 16 or 17 year old white males are at or above the poverty level?
How many females aged 6 to 18 years old are on Medicaid or other means tested health insurance?
How many females aged 19 to 25 are not enrolled in school and do not have health insurance coverage?
How many married couples with no children under the age of 18 earned between $15,000 to $19,999 in the past year?
How many females are married with an absent spouse, aged 85 years and older (widows)?
How many families of three to four people are at or above the poverty level?
How many people are at or above the poverty level who speak other Indo-European languages?
How many single males 65 years or older are renters?
How many homes are not occupied?
How many homes are occupied by the owner?
How many male family households with no wife present?
How many owner occupied mobile homes are there?
How many 5-17 year-olds are there?
How many females are health technologists and technicians?
How many white females 25 years or older have no high school diploma?
How many males 25 to 34 years old have no high school diploma?
How many families do not have any income saved?
How many white males 18 years or older have veteran status?
With a 60:40 split of training data to testing data I received the following accuracy scores:
Training accuracy: 0.8837209302325582
Testing Accuracy: 0.8323699421965318
Overall Accuracy: 0.8631090487238979
Percentage of Misclassified Districts: 0.1368909512761021
It misclassified 31 Democratic districts as being Republican and 28 Republican districts as being Democratic for the 2018 Midterm Elections.
Here are some visualizations of how well my model did:
Here is how the model predicted the House to go:
The bar at the top shows how many seats it predicts for the Republicans and the Democrats. The red bar is the number of predicted Republican seats and the blue bar is the number of predicted Democratic seats. The map is broken up by congressional district with each district being colored red if it’s predicted to go Republican and blue if it’s predicted to go Democrat.
Here is the actual results of the 2018 Congressional Midterms:
The bar at the top shows how many seats went to the Republicans and the Democrats. The red bar is the number of actual Republican seats in 2018 and the blue bar is the number of actual Democratic seats. The map is broken up by congressional district with each district being colored red if it went Republican and blue if it went Democrat.
Here is a comparison of the two:
Here is an error analysis of which districts my model misclassified:
Admittingly I am not sure why my model misclassified the following districts yet; however those districts could indicate swing districts or vulnerable districts.
Each district shown is a district in which my model incorrectly classified the outcome of that race. Its colored red if my model predicted it to go Republican and in actuality it went Democratic. Its colored blue if my model predicted it to go Democratic and in actuality the district went Republican in 2018. Each ‘X’ simply marks one incorrectly classified district and serves as another tooltip to view information on it.