This question uses this table:
from the lecture notes. For each answer, please explain as fully as possible, and feel free to use tables or diagrams if you prefer.
The rules to satisfy 1st normal form are:
This table violates the 1NF because it contains data customer data in more than one column, in this case, Customer1
, Customer2
, and Customer3
.
We could solve this by moving the Customer information to its own table, and linking its primary key to a matching foreign key in the original table related to SalesStaff.
See this website for more details.
A table is in 2nd Normal Form if: - The table is in 1st normal form, and - All the non-key columns are dependent on the table’s primary key.
We know that the tables are not in 1NF, but assuming that we had fixed that, there would still issues, for example, the SalesStaff
table has two columns that do not depend on the EmployeeID: SalesOffice
and OfficeNumber
. We would better creating a separate table for Sales offices, and linking its primary key to the SalesStaff
table via a matching foreign key in the SalesStaff
table.
There are other potential issues that wherein, depending on how you achieved the 1NF step, there could still be table attributes that do not completely rely on that tables’s primary key. For more examples, see here.
Sometimes a database table that is not fully normalized still provides sufficient information for our needs, without the complexity of full normalization providing gains that outweigh the loss of simplicity. Third and higher form normalization are often ignored, for instance, for small-scale databases built for specific purposes. For large scale database scheme that need the ability to be extended easily, however, and for which data integrity is critical, full normalization is generally the best strategy.
countryID | countryName | EUmember | EUjoindate |
---|---|---|---|
00001 | France | true |
1958-01-01 |
00004 | Hungary | true |
2004-05-01 |
00003 | Serbia | false |
NULL |
00004 | Finland | true |
1995-01-01 |
00005 | Russia | false |
NULL |
00006 | Ireland, UK | true |
1973-01-01 |
Are there any other problems with the table, besides normalization?
Yes:
- 1NF is violated because because countryName
contains multiple country values in the last row. - 2NF is violated because we could have created a table of EU membership statuses (including soon, sadly, un-joining) that would link to the Country table via CountryID
. - 3NF is violates because the EUmember
can be determined by whether the EUjoindate
is not NULL
.
Write out these tables, and describe why this meets each of the normal forms. This is a database of movies watched on NetBricks, a streaming movie service.
Name | Address | Movies Rented | Salutation | Category |
---|---|---|---|---|
Bob Smith | 1 Houghton Street | Star Wars, Inception | Dr. | Scifi, Scifi |
Pry Ministair | 10 Downing St | Brexit the Movie | Lady | Tragedy |
Joe Bloggs | 8 Myhatt St. | Fast and Furious 6, Fast and Furious 7 | Mr. | Action, Action |
Create tables for:
UserID
, which would record the address, and salutationMovieID
CategoryID
linking to the Movies table# create user table
user_table <- read.csv(textConnection("userid, Name, Address, Salutation
1, Bob Smith, 1 Houghton Street, Dr.
2, Pry Ministair, 10 Downing St, Lady
3, Joe Bloggs, 8 Myhatt St., Mr."),
stringsAsFactors = FALSE)
# create movies table
movie_table <- data.frame(movieid = 1:5,
title = c("Star Wars", "Inception", "Brexit the Movie",
"Fast and Furious 6", "Fast and Furious 7"),
categoryid = c(1,1, 2, 3, 3),
stringsAsFactors = FALSE)
# create rental table
rental_table <- data.frame(userid = c(1, 1, 2, 3, 3),
movieid = 1:5)
# create category table
category_table <- data.frame(categoryid = 1:3,
genres = c("Scifi", "Tragedy", "Action"),
stringsAsFactors = FALSE)
Here are the four tables in fully normalized database
userid | Name | Address | Salutation |
---|---|---|---|
1 | Bob Smith | 1 Houghton Street | Dr. |
2 | Pry Ministair | 10 Downing St | Lady |
3 | Joe Bloggs | 8 Myhatt St. | Mr. |
userid | movieid |
---|---|
1 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
3 | 5 |
movieid | title | categoryid |
---|---|---|
1 | Star Wars | 1 |
2 | Inception | 1 |
3 | Brexit the Movie | 2 |
4 | Fast and Furious 6 | 3 |
5 | Fast and Furious 7 | 3 |
categoryid | genres |
---|---|
1 | Scifi |
2 | Tragedy |
3 | Action |
These tables fits to the 3 levels of normalization because
For this exercise, we will use the nycflights13 R package, whose tables have been output in .csv
form here. You may do the following in either R or Python. Note that this example is developed extensively in R for Data Science.
flights
data, that departed before 05:53 on 2013-02-28. How many rows and columns does this subtable have?flights <- read.csv("nycflights13/flights.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
flights_sub <- flights %>%
filter(year == 2013 & month == 2 & day == 28 & dep_time < 553)
sprintf("the selected data.frame has %s rows and %s columns\n",
nrow(flights_sub), ncol(flights_sub)) %>%
cat()
## the selected data.frame has 10 rows and 20 columns
flights
data, to produce a result that includes:carrier
)airports
table)airports <- read.csv("nycflights13/airports.csv")
tmp <- merge(flights_sub, airports, by.x = "dest", by.y = "faa", all.x = TRUE)
answer2 <- select(tmp, c("dep_time", "carrier", "flight", "name"))
print(answer2)
## dep_time carrier flight name
## 1 551 MQ 4650 Hartsfield Jackson Atlanta Intl
## 2 540 B6 725 <NA>
## 3 457 US 1117 Charlotte Douglas Intl
## 4 552 AA 707 Dallas Fort Worth Intl
## 5 549 EV 4911 Detroit Metro Wayne Co
## 6 552 B6 371 Fort Lauderdale Hollywood Intl
## 7 458 UA 697 George Bush Intercontinental
## 8 522 UA 1714 George Bush Intercontinental
## 9 539 AA 1141 Miami Intl
## 10 551 AA 301 Chicago Ohare Intl
flights
data compute the average age of the planes it flew from the entire dataset. Age here will be defined as 2013 minus the year
variable from the planes
data. Hint: This involves a join operation on tailnum
, but also a grouped mean to compute the age (and subtracting 2013, which you can do before or after the computation of the mean).## your code
planes <- read.csv("nycflights13/planes.csv")
tmp <- flights %>%
filter(!duplicated(tailnum)) %>%
select(c("tailnum", "carrier") ) %>%
left_join(planes, by = "tailnum")
airlines <- read.csv("nycflights13/airlines.csv")
tmp %>% group_by(carrier) %>%
summarize(mean_age = mean(2013 - year, na.rm = TRUE)) %>%
left_join(airlines, by = "carrier") %>%
select(c("name", "mean_age"))
## # A tibble: 16 x 2
## name mean_age
## <chr> <dbl>
## 1 Endeavor Air Inc. 8.29
## 2 American Airlines Inc. 25.4
## 3 Alaska Airlines Inc. 5.16
## 4 JetBlue Airways 6.50
## 5 Delta Air Lines Inc. 17.8
## 6 ExpressJet Airlines Inc. 10.7
## 7 Frontier Airlines Inc. 5
## 8 AirTran Airways Corporation 10.8
## 9 Hawaiian Airlines Inc. 1.23
## 10 Envoy Air 35.5
## 11 SkyWest Airlines Inc. 7.14
## 12 United Air Lines Inc. 13.1
## 13 US Airways Inc. 11.0
## 14 Virgin America 4.29
## 15 Southwest Airlines Co. 11.0
## 16 Mesa Airlines Inc. 9.44
.csv
data found here. Name each table so that it matches the base filenames of the input data.library("RSQLite")
mydb <- dbConnect(SQLite(), "")
dbWriteTable(mydb, "flights", read.csv("nycflights13/flights.csv"))
dbWriteTable(mydb, "airlines", read.csv("nycflights13/airlines.csv"))
dbWriteTable(mydb, "airports", read.csv("nycflights13/airports.csv"))
dbWriteTable(mydb, "weather", read.csv("nycflights13/weather.csv"))
dbWriteTable(mydb, "planes", read.csv("nycflights13/planes.csv"))
# dep_datetime is stored as the epoch second
# calculate the epoch sec with as.integer(lubridate::ymd_hm())
answer3b <- dbGetQuery(mydb, "SELECT dep_time, carrier, flight, name FROM
flights AS fl
LEFT JOIN airports AS ap ON
fl.dest = ap.faa
WHERE fl.year = 2013 AND month = 2 AND day = 28 AND dep_time < 553")
print(answer3b)
## dep_time carrier flight name
## 1 457 US 1117 Charlotte Douglas Intl
## 2 458 UA 697 George Bush Intercontinental
## 3 522 UA 1714 George Bush Intercontinental
## 4 539 AA 1141 Miami Intl
## 5 540 B6 725 <NA>
## 6 549 EV 4911 Detroit Metro Wayne Co
## 7 551 AA 301 Chicago Ohare Intl
## 8 551 MQ 4650 Hartsfield Jackson Atlanta Intl
## 9 552 AA 707 Dallas Fort Worth Intl
## 10 552 B6 371 Fort Lauderdale Hollywood Intl
tmp <- dbGetQuery(mydb, "SELECT flights.carrier,
2013 - AVG(planes.year) FROM
(SELECT DISTINCT tailnum, carrier FROM flights) AS flights
LEFT JOIN planes ON flights.tailnum = planes.tailnum
GROUP BY flights.carrier")
tmp
## carrier 2013 - AVG(planes.year)
## 1 9E 8.287129
## 2 AA 25.402439
## 3 AS 5.156627
## 4 B6 6.497326
## 5 DL 17.671593
## 6 EV 10.558442
## 7 F9 5.000000
## 8 FL 10.794643
## 9 HA 1.230769
## 10 MQ 35.500000
## 11 OO 7.142857
## 12 UA 13.050934
## 13 US 10.996283
## 14 VX 4.288462
## 15 WN 11.005272
## 16 YV 9.438596