ME314 Managing and Visualizing Data

Day 3 Assignment, LSE ME314 2019


1. Normalizing data

This question uses this table: Not normalized data

from the lecture notes. For each answer, please explain as fully as possible, and feel free to use tables or diagrams if you prefer.

  1. Why does this table violate the first normal form, and what would be required to make it 1NF?

The rules to satisfy 1st normal form are:

  • That the data is in a database table.
  • The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
  • Each column contains atomic values, and there are not repeating groups of columns.

This table violates the 1NF because it contains data customer data in more than one column, in this case, Customer1, Customer2, and Customer3.

We could solve this by moving the Customer information to its own table, and linking its primary key to a matching foreign key in the original table related to SalesStaff.

See this website for more details.

  1. What additional steps would be needed to make the table 2NF, and why?

A table is in 2nd Normal Form if: - The table is in 1st normal form, and - All the non-key columns are dependent on the table’s primary key.

We know that the tables are not in 1NF, but assuming that we had fixed that, there would still issues, for example, the SalesStaff table has two columns that do not depend on the EmployeeID: SalesOffice and OfficeNumber. We would better creating a separate table for Sales offices, and linking its primary key to the SalesStaff table via a matching foreign key in the SalesStaff table.

There are other potential issues that wherein, depending on how you achieved the 1NF step, there could still be table attributes that do not completely rely on that tables’s primary key. For more examples, see here.

  1. Why might we not want to normalize data to the fullest extent possible?

Sometimes a database table that is not fully normalized still provides sufficient information for our needs, without the complexity of full normalization providing gains that outweigh the loss of simplicity. Third and higher form normalization are often ignored, for instance, for small-scale databases built for specific purposes. For large scale database scheme that need the ability to be extended easily, however, and for which data integrity is critical, full normalization is generally the best strategy.

  1. In the table below, which of the three normalization rules does this violate, if any, and why?
countryID countryName EUmember EUjoindate
00001 France true 1958-01-01
00004 Hungary true 2004-05-01
00003 Serbia false NULL
00004 Finland true 1995-01-01
00005 Russia false NULL
00006 Ireland, UK true 1973-01-01

Are there any other problems with the table, besides normalization?

Yes:
- 1NF is violated because because countryName contains multiple country values in the last row. - 2NF is violated because we could have created a table of EU membership statuses (including soon, sadly, un-joining) that would link to the Country table via CountryID. - 3NF is violates because the EUmember can be determined by whether the EUjoindate is not NULL.

  1. What would it take to full (1NF-3NF) normalize this dataset?

Write out these tables, and describe why this meets each of the normal forms. This is a database of movies watched on NetBricks, a streaming movie service.

Name Address Movies Rented Salutation Category
Bob Smith 1 Houghton Street Star Wars, Inception Dr. Scifi, Scifi
Pry Ministair 10 Downing St Brexit the Movie Lady Tragedy
Joe Bloggs 8 Myhatt St. Fast and Furious 6, Fast and Furious 7 Mr. Action, Action

Create tables for:

  • User, with a UserID, which would record the address, and salutation
  • Movies, with a MovieID
  • Categories, with a CategoryID linking to the Movies table
  • Rental Table, linking to Movies and User tables, with added fields for the date rented
# create user table
user_table <- read.csv(textConnection("userid, Name, Address, Salutation
     1,     Bob Smith, 1 Houghton Street,        Dr.
     2, Pry Ministair,     10 Downing St,       Lady
     3,    Joe Bloggs,      8 Myhatt St.,        Mr."),
     stringsAsFactors = FALSE)

# create movies table
movie_table <- data.frame(movieid = 1:5,
                          title = c("Star Wars", "Inception", "Brexit the Movie", 
                                    "Fast and Furious 6", "Fast and Furious 7"),
                          categoryid = c(1,1, 2, 3, 3),
                          stringsAsFactors = FALSE)

# create rental table
rental_table <- data.frame(userid = c(1, 1, 2, 3, 3),
                           movieid = 1:5)

# create category table
category_table <- data.frame(categoryid = 1:3,
                             genres = c("Scifi", "Tragedy", "Action"),
                             stringsAsFactors = FALSE)

Here are the four tables in fully normalized database

User table

User Table
userid Name Address Salutation
1 Bob Smith 1 Houghton Street Dr.
2 Pry Ministair 10 Downing St Lady
3 Joe Bloggs 8 Myhatt St. Mr.

Rental table

Rental Table
userid movieid
1 1
1 2
2 3
3 4
3 5

Movie table

Movie Table
movieid title categoryid
1 Star Wars 1
2 Inception 1
3 Brexit the Movie 2
4 Fast and Furious 6 3
5 Fast and Furious 7 3

Category table

Category Table
categoryid genres
1 Scifi
2 Tragedy
3 Action

These tables fits to the 3 levels of normalization because

  • For achieving 1NF, the problem was the multiple rental records in each row. By separating this as the rental table and making each row correspond with a single rental record, this issue is resolved.
  • For achieving 2NF, the partial dependency of records on the primary keys would be an issue. This could be an issue if we have both movieid and movie information in rental record because the movie information can be uniquely identified by the movie id. By separating this information into a separate table (i.e. Movie Table), this issue was resolved.
  • For achieving 3NF, the tables should not have transitive dependency. Since the movie categories does not have any information other than the titles of categories, this cannot be an issue here unless we have a secondary key of categoryid in addition to the category name, but I created a new table categoryid just in case.

2. Reshaping data

For this exercise, we will use the nycflights13 R package, whose tables have been output in .csv form here. You may do the following in either R or Python. Note that this example is developed extensively in R for Data Science.

  1. Create a subtable of the flights data, that departed before 05:53 on 2013-02-28. How many rows and columns does this subtable have?
flights <- read.csv("nycflights13/flights.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
flights_sub <- flights %>% 
    filter(year == 2013 & month == 2 & day == 28 & dep_time < 553)
sprintf("the selected data.frame has %s rows and %s columns\n", 
        nrow(flights_sub), ncol(flights_sub)) %>% 
    cat()
## the selected data.frame has 10 rows and 20 columns
  1. Merge or join the subtable from a. flights data, to produce a result that includes:
  • Departure time
  • Carrier (two digit code, from carrier)
  • Flight number
  • Destination airport name (hint: you will need to get this from the airports table)
airports <- read.csv("nycflights13/airports.csv")
tmp <- merge(flights_sub, airports, by.x = "dest", by.y = "faa", all.x = TRUE)
answer2 <- select(tmp, c("dep_time", "carrier", "flight", "name"))
print(answer2)
##    dep_time carrier flight                            name
## 1       551      MQ   4650 Hartsfield Jackson Atlanta Intl
## 2       540      B6    725                            <NA>
## 3       457      US   1117          Charlotte Douglas Intl
## 4       552      AA    707          Dallas Fort Worth Intl
## 5       549      EV   4911          Detroit Metro Wayne Co
## 6       552      B6    371  Fort Lauderdale Hollywood Intl
## 7       458      UA    697    George Bush Intercontinental
## 8       522      UA   1714    George Bush Intercontinental
## 9       539      AA   1141                      Miami Intl
## 10      551      AA    301              Chicago Ohare Intl
  1. For every airline that had flights in the flights data compute the average age of the planes it flew from the entire dataset. Age here will be defined as 2013 minus the year variable from the planes data. Hint: This involves a join operation on tailnum, but also a grouped mean to compute the age (and subtracting 2013, which you can do before or after the computation of the mean).
## your code
planes <- read.csv("nycflights13/planes.csv")

tmp <- flights %>% 
  filter(!duplicated(tailnum)) %>% 
  select(c("tailnum", "carrier") ) %>% 
  left_join(planes, by = "tailnum")
## Warning: Column `tailnum` joining factors with different levels, coercing
## to character vector
airlines <- read.csv("nycflights13/airlines.csv")

tmp %>% group_by(carrier) %>% 
  summarize(mean_age = mean(2013 - year, na.rm = TRUE)) %>%
  left_join(airlines, by = "carrier") %>% 
  select(c("name", "mean_age"))
## # A tibble: 16 x 2
##    name                        mean_age
##    <fct>                          <dbl>
##  1 Endeavor Air Inc.               8.29
##  2 American Airlines Inc.         25.4 
##  3 Alaska Airlines Inc.            5.16
##  4 JetBlue Airways                 6.50
##  5 Delta Air Lines Inc.           17.8 
##  6 ExpressJet Airlines Inc.       10.7 
##  7 Frontier Airlines Inc.          5   
##  8 AirTran Airways Corporation    10.8 
##  9 Hawaiian Airlines Inc.          1.23
## 10 Envoy Air                      35.5 
## 11 SkyWest Airlines Inc.           7.14
## 12 United Air Lines Inc.          13.1 
## 13 US Airways Inc.                11.0 
## 14 Virgin America                  4.29
## 15 Southwest Airlines Co.         11.0 
## 16 Mesa Airlines Inc.              9.44

3. Working with SQL

  1. Create a relational dataset in SQLite using the .csv data found here. Name each table so that it matches the base filenames of the input data.
library("RSQLite")
mydb <- dbConnect(SQLite(), "")

dbWriteTable(mydb, "flights", read.csv("nycflights13/flights.csv"))
dbWriteTable(mydb, "airlines", read.csv("nycflights13/airlines.csv"))
dbWriteTable(mydb, "airports", read.csv("nycflights13/airports.csv"))
dbWriteTable(mydb, "weather", read.csv("nycflights13/weather.csv"))
dbWriteTable(mydb, "planes", read.csv("nycflights13/planes.csv"))
  1. Replicate 2b above using an SQL query, including both the command and the output.
# dep_datetime is stored as the epoch second
# calculate the epoch sec with as.integer(lubridate::ymd_hm())
answer3b <- dbGetQuery(mydb, "SELECT dep_time, carrier, flight, name FROM 
            flights AS fl 
            LEFT JOIN airports AS ap ON
            fl.dest = ap.faa
            WHERE fl.year = 2013 AND month = 2 AND day = 28 AND dep_time < 553")
print(answer3b)
##    dep_time carrier flight                            name
## 1       457      US   1117          Charlotte Douglas Intl
## 2       458      UA    697    George Bush Intercontinental
## 3       522      UA   1714    George Bush Intercontinental
## 4       539      AA   1141                      Miami Intl
## 5       540      B6    725                            <NA>
## 6       549      EV   4911          Detroit Metro Wayne Co
## 7       551      AA    301              Chicago Ohare Intl
## 8       551      MQ   4650 Hartsfield Jackson Atlanta Intl
## 9       552      AA    707          Dallas Fort Worth Intl
## 10      552      B6    371  Fort Lauderdale Hollywood Intl
  1. Replicate 2c above using an SQL query, including both the command and the output.
tmp <- dbGetQuery(mydb, "SELECT flights.carrier,
                  2013 - AVG(planes.year) FROM 
                  (SELECT DISTINCT tailnum, carrier FROM flights) AS flights 
                  LEFT JOIN planes ON flights.tailnum = planes.tailnum
                  GROUP BY flights.carrier")
tmp
##    carrier 2013 - AVG(planes.year)
## 1       9E                8.287129
## 2       AA               25.402439
## 3       AS                5.156627
## 4       B6                6.497326
## 5       DL               17.671593
## 6       EV               10.558442
## 7       F9                5.000000
## 8       FL               10.794643
## 9       HA                1.230769
## 10      MQ               35.500000
## 11      OO                7.142857
## 12      UA               13.050934
## 13      US               10.996283
## 14      VX                4.288462
## 15      WN               11.005272
## 16      YV                9.438596