Excercise 1 - Introduction to Data Science

Exercise 1.1

This exercise relates to the College data set, which can be found in the file College.csv on the website for the main course textbook James et al. http://www-bcf.usc.edu/~gareth/ISL/data.html. It contains a number of variables for 777 different universities and colleges in the US. The variables are

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10% of high school class
Top25perc : New students from top 25% of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

college <- read.csv("https://www.statlearning.com/s/College.csv")

Look at the data using the View() function. This loads a matrix or data.frame object into the spreadhseet-like viewer in RStudio, just clicking the name of the object will do in the Environment panel. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

# View(college)
rownames(college) <- college[, 1]

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[, -1]
head(college)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

1. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

# change Private to a factor variable
college$Private <- as.factor(college$Private)

pairs(college[,1:10])

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(college$Private, college$Outstate,
     xlab = "Private University", ylab = "Tuition in $")

Boxplots of Outstate versus Private: Private universities have more out of state students

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(Elite)

##  No Yes 
## 699  78

plot(college$Elite, college$Outstate, 
     xlab = "Elite University", ylab = "Tuition in $")

Boxplots of Outstate versus Elite: Elite universities have more out of state students

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow=c(2,2))
hist(college$Apps, xlab = "Applications Received", main = "")
hist(college$perc.alumni, col=2, xlab = "% of alumni who donate", main = "")
hist(college$S.F.Ratio, col=3, breaks=10, xlab = "Student/faculty ratio", main = "")
hist(college$Expend, breaks=100, xlab = "Instructional expenditure per student", main = "")

Continue exploring the data, and provide a brief summary of what you discover.

# Some interesting observations:

# what is the university with the most students in the top 10% of class
row.names(college)[which.max(college$Top10perc)]

## [1] "Massachusetts Institute of Technology"

acceptance_rate <- college$Accept / college$Apps

# what university has the smallest acceptance rate
row.names(college)[which.min(acceptance_rate)]

## [1] "Princeton University"

# what university has the most liberal acceptance rate
row.names(college)[which.max(acceptance_rate)]

## [1] "Emporia State University"

# High tuition correlates to high graduation rate
plot(college$Outstate, college$Grad.Rate)

# Colleges with low acceptance rate tend to have low S:F ratio.
plot(college$Accept / college$Apps, college$S.F.Ratio)

# Colleges with the most students from top 10% perc don't necessarily have
# the highest graduation rate. Also, rate > 100 is erroneous!
plot(college$Top10perc, college$Grad.Rate)

Exercise 1.2

This exercise involves the Auto data set available as Auto.csv from the website for the main course textbook James et al. http://www-bcf.usc.edu/~gareth/ISL/data.html. Make sure that the missing values have been removed from the data.

Auto <- read.csv("https://www.statlearning.com/s/Auto.csv", 
                 header = TRUE, na.strings = "?")
Auto <- na.omit(Auto)
dim(Auto)

## [1] 392   9

summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

Which of the predictors are quantitative, and which are qualitative?

Note: Sometimes when you load a dataset, a qualitative variable might have a numeric value. For instance, the origin variable is qualitative, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded 1 = usa; 2 = europe; 3 = japan. So we can covert it into a factor, using:

Auto$originf <- factor(Auto$origin, labels = c("usa", "europe", "japan"))
with(Auto, table(originf, origin))

##         origin
## originf    1   2   3
##   usa    245   0   0
##   europe   0  68   0
##   japan    0   0  79

Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year. Qualitative: name, origin, originf

What is the range of each quantitative predictor? You can answer this using the range() function.

#Pulling together qualitative predictors
qualitative_columns <- which(names(Auto) %in% c("name", "origin", "originf"))
qualitative_columns

## [1]  8  9 10

# Apply the range function to the columns of Auto data
# that are not qualitative
sapply(Auto[, -qualitative_columns], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[, -qualitative_columns], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

sapply(Auto[, -qualitative_columns], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

sapply(Auto[-seq(10, 85), -qualitative_columns], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

sapply(Auto[-seq(10, 85), -qualitative_columns], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

# Part (e):
pairs(Auto)

## Error in pairs.default(Auto): non-numeric argument to 'pairs'

pairs(Auto[, -qualitative_columns])

# Heavier weight correlates with lower mpg.
with(Auto, plot(mpg, weight))

# More cylinders, less mpg.
with(Auto, plot(mpg, cylinders))

# Cars become more efficient over time.
with(Auto, plot(mpg, year))

# Lets plot some mpg vs. some of our qualitative features: 
# sample just 20 observations
Auto.sample <- Auto[sample(1:nrow(Auto), 20), ]
# order them
Auto.sample <- Auto.sample[order(Auto.sample$mpg), ]
# plot them using a "dotchart"
with(Auto.sample, dotchart(mpg, name, xlab = "mpg"))

with(Auto, plot(originf, mpg), ylab = "mpg")

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

pairs(Auto)

## Error in pairs.default(Auto): non-numeric argument to 'pairs'

See descriptions of plots in (e). All of the predictors show some correlation with mpg. The name predictor has too little observations per name though, so using this as a predictor is likely to result in overfitting the data and will not generalize well.

Excercise 1 - Introduction to Data Science

Ken Benoit and Slava Mikhaylov

Exercise 1.1

Exercise 1.2