This was my course project in Introduction to Data Science, the first course in a three-course certificate program through the University of Washington.
I’m publishing it here, warts and all, because I think there are a few interesting facets to the analysis.
A study of colleges in the US has collected information regarding school and student demographics, degrees awarded by subject area as well as financial information including average degree cost by income bracket, student tuition and scholarships, repayment rates, and post-college average earnings.
0 Where are the colleges located?
US colleges appear to be distributed similarly to the overall US population.
Among college types, however, it appears that Public colleges are the most likely to be located in rural areas, while Private for-profit colleges are primarily centered around urban areas.
This is particularly noticeable given that there are nearly twice as many private for-profit colleges as public colleges:
## # A tibble: 3 x 3
## CONTROL `n()` type
## <int> <int> <chr>
## 1 1 2044 Public
## 2 2 1956 Private non-profit
## 3 3 3703 Private for-profit
2.1. Count of colleges by degree length and region
College ownership appears related to the degree-type awarded:
I’m not sure what conclusions we can draw about regional distribution of colleges, beyond the fact that the Southeast appears overrepresented. Deeper analysis should adjust college counts by total population to compare the number of colleges per capita.
2.2. Categorize student graduation counts by degree name
Health, business/marketing, cuinary, and humanities are all top majors:
Top 5 majors by college type:
#public
t.major2[t.major2$Type=="Public",][1:5,]
## Type variable value
## 106 Public health 0.22663601
## 46 Public humanities 0.17144393
## 109 Public business_marketing 0.11892869
## 82 Public security_law_enf 0.04044814
## 94 Public mechanic_repair_tech 0.03918093
#private non-prof
t.major2[t.major2$Type=="Private Non-Profit",][1:5,]
## Type variable value
## 107 Private Non-Profit health 0.18948482
## 110 Private Non-Profit business_marketing 0.15411169
## 71 Private Non-Profit theology_relig 0.07446778
## 104 Private Non-Profit visual_performing 0.06753585
## 26 Private Non-Profit education 0.05250905
#private for-prof
t.major2[t.major2$Type=="Private For-profit",][1:5,]
## Type variable value
## 108 Private For-profit health 0.37294668
## 24 Private For-profit culinary 0.28278959
## 111 Private For-profit business_marketing 0.07457577
## 21 Private For-profit computer 0.05469259
## 96 Private For-profit mechanic_repair_tech 0.03740514
Private for-profit schools are the main providers of culinary programs, while private non-profit colleges specialize in theology/religious studies (likely religiously-affiliated colleges) and the visual and performing arts.
3.1. Study type, experimental model, and approach
The College Scorecard (https://collegescorecard.ed.gov/data/) is a collection of data on US colleges regarding school and student demographics, degrees awarded by subject area, as well as financial information including average degree cost by income bracket, student tuition and scholarships, repayment rates, and post-college earnings.
Data are self-reported and observational (there are no tests or treatments being applied).
Experimental design:
3.2. Bias
3.3. Hypothesis
School selectivity - as measured by admissions rate and average SAT score - is positively correlated with graduation rate.
School ownership (Public/Private non-profit/Private for-profit) is correlated with graduation rate. Specifically, private non-profits have the highest graduation rate, followed by public colleges followed by private for-profit colleges.
Graduation rates are highest for graduate students, followed by Bachelor’s, Associate’s, and certificates.
4.0. Cost by College Type
Private non-profit colleges are more than twice as expensive as Public colleges and nearly 40% more expensive than Private for-profit colleges.
## # A tibble: 3 x 2
## type avg_cost
## <chr> <dbl>
## 1 Public 14922.46
## 2 Private Non-profit 35607.98
## 3 Private For-profit 25902.30
4.1. Cost by Degree
Most of the cost differential between Private non-profit and for-profit colleges is due to the face that Bachelor’s and graduate degrees at Private non-profit schools are significantly more expensive than at Private for-profits.
Surprisingly, there is not a large cost increase at Private for-profits as you go from certificate degrees to Associate’s and Bachelor’s.
4.1a. Cost by Major
The most expensive majors cost 2-3x the least expensive. The most expensive majors are those most-commonly awarded at Private colleges (Philosophy, Religion, Ethnic/Cutlural/Gender studies, Architecture), while the least-expensive majors are associated with skilled trades (Mechanic/Repair/Technician, Construction, and Precision Production).
4.2. Cost by College
The most expensive colleges are all Private non-profit schools, mostly located in major metropolitan areas. Community colleges are the most affordable, whereas if you want to attend a low-cost Private college, head to the Caribbean!
Top 5 most expensive colleges
## INSTNM COSTT4_A
## 1 University of Chicago 64988
## 2 Columbia University in the City of New York 64144
## 3 Sarah Lawrence College 64056
## 4 Washington University in St Louis 63755
## 5 Occidental College 63363
Top 5 least expensive colleges
## INSTNM COSTT4_A
## 1 Cleveland Community College 5536
## 2 Colegio Universitario de San Juan 5920
## 3 Pearl River Community College 6171
## 4 Harford Community College 6336
## 5 South Texas College 6374
Top 5 least expensive Private non-profit schools
## INSTNM COSTT4_A
## 1 Caribbean University-Bayamon 8180
## 2 Caribbean University-Ponce 8438
## 3 Pontifical Catholic University of Puerto Rico-Mayaguez 8716
## 4 Atlantic University College 9521
## 5 Dewey University-Hato Rey 10021
4.3. College completion rates by state, compared to national average
Target = predict student graduation rate Train model on year T-1 (2013/2014 data) and evaluate predictions accuracy in year T (2014/2015 data)
5.1. Evaluate the primary features that affect college program completion
I chose to start with a regression tree to give first impressions of the key predictors of colleges’ graduation rates.
#Add additional predictors to the "scorecard" dataset used for data exploration
new.vars <- c('RELAFFIL','ADM_RATE','SAT_AVG','INEXPFTE','AVGFACSAL',
'PFTFAC','PCTPELL','AGE_ENTRY','HCM2','MAIN','CCBASIC',
'CCUGPROF','CCSIZSET','PCTFLOAN','FEMALE','MARRIED',
'DEPENDENT','VETERAN','FIRST_GEN','ICLEVEL','FAMINC')
data14 <- cbind(scorecard,df_14_15[,new.vars])
#delete obs with missing target variable
data14 <- data14[!is.na(data14$C150_4),]
#import training data (2013/14 data)
df_13_14 <- as.data.frame(fread("C:/Users/nadav.rindler/OneDrive - American Red Cross/Training/UW Data Science Certificate/DS250/Course Project/CollegeScorecard_Raw_Data/MERGED2013_14_PP.csv",
na.strings=c("NULL","PrivacySuppressed")))
#fix UNITID name
colnames(df_13_14)[1] <- c("UNITID")
#training data
data13 <- df_13_14[!is.na(df_13_14$C150_4),which(names(df_13_14) %in% colnames(data14))]
#Regression Tree
library(rpart)
library(rpart.plot)
#check null values - some attributes are entirely null
null_list <- colSums(is.na(data13), na.rm=FALSE)
nulls <- null_list[null_list>0]
nulls
## HCM2 LOCALE CCBASIC CCUGPROF CCSIZSET RELAFFIL ADM_RATE
## 2448 2448 2448 2448 2448 2448 651
## SAT_AVG PCIP01 PCIP03 PCIP04 PCIP05 PCIP09 PCIP10
## 1069 1 1 1 1 1 1
## PCIP11 PCIP12 PCIP13 PCIP14 PCIP15 PCIP16 PCIP19
## 1 1 1 1 1 1 1
## PCIP22 PCIP23 PCIP24 PCIP25 PCIP26 PCIP27 PCIP29
## 1 1 1 1 1 1 1
## PCIP30 PCIP31 PCIP38 PCIP39 PCIP40 PCIP41 PCIP42
## 1 1 1 1 1 1 1
## PCIP43 PCIP44 PCIP45 PCIP46 PCIP47 PCIP48 PCIP49
## 1 1 1 1 1 1 1
## PCIP50 PCIP51 PCIP52 PCIP54 COSTT4_A AVGFACSAL PFTFAC
## 1 1 1 1 60 18 183
## PCTPELL PCTFLOAN AGE_ENTRY FEMALE MARRIED DEPENDENT VETERAN
## 2 2 12 118 379 144 1319
## FIRST_GEN FAMINC
## 170 12
#convert some numeric values to factor
data13$CONTROL <- as.factor(data13$CONTROL)
data13$REGION <- as.factor(data13$REGION)
data13$ICLEVEL <- as.factor(data13$ICLEVEL)
data14$CONTROL <- as.factor(data14$CONTROL)
data14$REGION <- as.factor(data14$REGION)
data14$ICLEVEL <- as.factor(data14$ICLEVEL)
#data transformations
data13$INEXPFTE_exp <- exp(data13$INEXPFTE)
data14$INEXPFTE_exp <- exp(data14$INEXPFTE)
#religious affiliation doesn't exist in 2013 data. Join from 2014 data
#create an indicator variable to identify ANY religiously-affiliated college
data13 <- data13 %>%
select(-RELAFFIL) %>%
left_join(data14[,c("UNITID","RELAFFIL")], by="UNITID") %>%
mutate(RELAFFIL_IND=ifelse(RELAFFIL>=0,1,0))
data14$RELAFFIL_IND <- ifelse(data14$RELAFFIL>=0,1,0)
#remove from predictors - target var, ID vars, null vars
exclude <- c("C150_4","UNITID","INSTNM","HCM2","LOCALE","CCBASIC","CCUGPROF",
"CCSIZSET","RELAFFIL","LNFAMINC","STABBR","COSTT4_A")
varnames <- names(data13)[-which(names(data13) %in% exclude)]
fmla <- as.formula(paste("C150_4 ~ ", paste(varnames, collapse= "+")))
#Train tree
grad.tree <- rpart(fmla,data = data13, method="anova",
control=rpart.control(cp=0.001)) #increase model complexity over default (cp=0.01)
#prune tree to node with min xerror
p.grad.tree <- prune(grad.tree, cp=grad.tree$cptable[which.min(grad.tree$cptable[,"xerror"]),"CP"])
#plot tree
prp(p.grad.tree,faclen=2,varlen=20,cex=0.5)
I’m surprised that the most important predictors of a college’s graduation rate relate to its students’ family incomes. Family income (FAMINC), percentage of the student body on Pell Grants (PCTPELL), average age at college entry (AGE_ENTRY), percentage of first generation students (FIRST_GEN), percentage of dependent students (DEPENDENT), and percentage of students on federal loans (PCTFLOAN) are all measures of the demographics and socioeconomic background of the student body.*
There are also several important predictors related to average SAT score and educational resources (instructional expenditure per student, INEXPFTE, and average faculty salary, AVGFACSAL)
*Note: Demographic data such as family income, etc. come from the National Student Loan Data System (NSLDS) which is pulling the data from students’ FAFSA forms.
#variable importance
importance <- as.data.frame(sort(p.grad.tree$variable.importance,decreasing=TRUE))
colnames(importance) <- "importance"
head(importance, n=10)
## importance
## FAMINC 49.833834
## PCTPELL 31.222093
## AGE_ENTRY 19.237627
## FIRST_GEN 18.525942
## INEXPFTE 14.680300
## DEPENDENT 14.319560
## PCTFLOAN 5.376605
## SAT_AVG 3.735139
## AVGFACSAL 3.725172
## PCIP54 3.247551
Let’s visualize some of these relationships:
As it turns out, there is a high degree of correlation between predictors related to (1) student body’s socioeconomic background; (2) average SAT score; and (3) college’s educational expenditure:
5.2. Predict student graduation rates by college
Given the high degree of collinearity between key predictors, I opted to use Principal Component Analysis (PCA) to reduce the set of predictors to a smaller set of linearly-uncorrelated variables, while still maximizing variance of each principal component.
I used the initial regression tree to do variable selection, plugging in the predictors identified by the tree as inputs to the PCA.
This approach has several limitations: - PCA can’t handle null values. Using PCA I was only able to predict graduation rates for ~50% of colleges. Deeper analysis should account for null values using imputation or selecting predictors that have good data coverage. - PCA can only handle numeric inputs, so categorical attributes must be converted to numeric. This works well for categorical variables which are ordinal, but not so well for those which are nominal (no inherent order between levels).
library(psych) #PCA package
library(FactoMineR) #additional PCA analysis
#prepare pca dataset
pca.df <- data13[,cor.vars]
pca.df$C150_4 <- NULL
pca.df <- as.data.frame(lapply(pca.df , as.numeric))
#train pca
pca.rotate = principal(pca.df, nfactors=3, rotate = "varimax")
#apply pca scores
pca.scores = as.data.frame(pca.rotate$scores)
pca.df <- cbind(pca.df, pca.scores, C150_4=data13$C150_4)
#linear regression using principal components
pca.lm = lm(C150_4~RC1+RC2+RC3, data=pca.df)
summary(pca.lm)
##
## Call:
## lm(formula = C150_4 ~ RC1 + RC2 + RC3, data = pca.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44221 -0.05546 0.00234 0.05771 0.51543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.472161 0.003412 138.392 <2e-16 ***
## RC1 0.153575 0.002917 52.644 <2e-16 ***
## RC2 0.005700 0.003595 1.585 0.113
## RC3 -0.079445 0.004135 -19.211 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0945 on 1298 degrees of freedom
## (1146 observations deleted due to missingness)
## Multiple R-squared: 0.6845, Adjusted R-squared: 0.6837
## F-statistic: 938.6 on 3 and 1298 DF, p-value: < 2.2e-16
#Note the 1146 obs deleted due to missing values
5.2a. Evaluate model results
Ultimately the PCA linear regression approach outperformed the simple regression tree trained previously.
However, this could be due to the fact that the PCA linear model is trained only on the subset of schools which report average SAT scores (roughly half the dataset), and when AVG_SAT is excluded from the PCA model, model MSE jumps and is closer to the regression tree error rate.
PCA linear model mean squared error (MSE) on test data:
## [1] 0.008997365
Regression tree mean squared error (MSE) on test data:
## [1] 0.02325999
Baseline for comparison - average graduation rate among colleges in test data:
## [1] 0.4779574
Examine residuals - which schools are the top over/underperformers for graduation rate compared to my prediction?
Overperforming colleges are likely to be religiously affiliated private non-profit schools. Several are health/nursing focused, and many have a high percentage of female students.
## INSTNM C150_4 CONTROL
## 2855 The Christ College of Nursing and Health Sciences 0.7857 2
## 1733 College of Our Lady of the Elms 0.7811 2
## 2888 Good Samaritan College of Nursing and Health Science 0.7143 2
## 2151 Bryan College of Health Sciences 0.8000 2
## 3126 Northwest Christian University 0.6625 2
## 1170 Bethel College-Indiana 0.6983 2
## 423 Mount Saint Mary's University 0.6550 2
## 331 Fresno Pacific University 0.6327 2
## 1637 Bay Path University 0.6319 2
## 3197 Cedar Crest College 0.6907 2
## RELAFFIL_IND FEMALE resid
## 2855 0 0.9047619 0.3698549
## 1733 1 0.7703180 0.3628841
## 2888 1 0.9379845 0.3118155
## 2151 0 0.8977273 0.2997813
## 3126 1 0.6088154 0.2686893
## 1170 1 0.6677890 0.2612087
## 423 1 0.9270705 0.2543667
## 331 1 0.7465668 0.2465729
## 1637 0 NA 0.2443710
## 3197 0 0.9345088 0.2434338
Underperforming colleges are a mix of public and private non-profit schools. There seems to be some geographic concentration in the southeast/
## INSTNM C150_4 CONTROL
## 3759 Paul Quinn College 0.0380 2
## 4094 West Virginia University Institute of Technology 0.1921 1
## 6948 Pennsylvania State University-World Campus 0.2500 1
## 945 Truett-McConnell College 0.2216 2
## 9 Auburn University at Montgomery 0.2194 1
## 1193 Indiana University-Purdue University-Fort Wayne 0.2640 1
## 888 College of Coastal Georgia 0.1547 1
## 862 Abraham Baldwin Agricultural College 0.1618 1
## 7082 Augusta University 0.3030 1
## 1232 Purdue University-North Central Campus 0.2415 1
## STABBR resid
## 3759 TX -0.3158816
## 4094 WV -0.3118111
## 6948 PA -0.2401925
## 945 GA -0.2399651
## 9 AL -0.2367177
## 1193 IN -0.2326425
## 888 GA -0.2292820
## 862 GA -0.2282257
## 7082 GA -0.2261134
## 1232 IN -0.2241535
To address the patterns observed in the residuals, I tried including religious affiliation and region into the PCA linear regression model, but these attributes did not improve model fit. It’s possible that: - Only certain kinds of religious schools are associated with overperformance. Or, there could be an interaction between religious affiliation and healthcare-focused schools. - The geographic relationship may be at the state level rather than at the regional level. Or, it could be that the region definition could be changed to better reflect certain underperforming regions (e.g. “Appalachian Region” vs. Mid-Atlantic and Southeast).