Exam (elaborations)

University of California, BerkeleyDS 100sp18_hw2_solution.ipynb at master DS-100_sp18 GitHub

124 views 0 purchase

Course
DS 100

Institution
University Of California, Berkeley

Homework 2: Food Safety Course Policies Here are some important course policies. These are also located at Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If you do discuss the ...

Preview 3 out of 19 pages

View example

Uploaded on July 4, 2021
Number of pages 19
Written in 2020/2021
Type Exam (elaborations)
Contains Questions & answers

homework 2 food safety course policies here are some important course policies these are also located at httpwwwds100orgsp18 httpwwwds100orgsp18 collaboration policy data science is

Institution
University Of California, Berkeley
Course
DS 100

Examhack

Member since 3 year 244 documents sold

$13.99

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub

DS-100 / sp18

Branch: master sp18 / hw / hw2 / solution / hw2_solution.ipynb Find file Copy path

data100.instructors lab/lab06 sol release 97662bd on Mar 1

0 contributors

3822 lines (3821 sloc) 173 KB

https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 1/19

,4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub

Homework 2: Food Safety
Course Policies
Here are some important course policies. These are also located at http://www.ds100.org/sp18/ (http://www.ds100.org/sp18/).

Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually. If
you do discuss the assignments with others please include their names at the top of your solution.

Due Date
This assignment is due at 11:59pm Tuesday, February 6th. Instructions for submission are on the website.

Homework 2: Food Safety
Cleaning and Exploring Data with Pandas
<img src="scoreCard.jpg" width=400>

In this homework, you will investigate restaurant food safety scores for restaurants in San Francisco. Above is a sample score card for a restaurant.
The scores and violation information have been made available by the San Francisco Department of Public Health, and we have made these data
available to you via the DS 100 repository. The main goal for this assignment is to understand how restaurants are scored. We will walk through the
various steps of exploratory data analysis to do this. To give you a sense of how we think about each discovery we make and what next steps it
leads to we will provide comments and insights along the way.

As we clean and explore these data, you will gain practice with:

Reading simple csv files
Working with data at different levels of granularity
Identifying the type of data collected, missing values, anomalies, etc.
Exploring characteristics and distributions of individual variables

Question 0
To start the assignment, run the cell below to set up some imports and the automatic tests that we will need for this assignment:

In many of these assignments (and your future adventures as a data scientist) you will use os, zipfile, pandas, numpy, matplotlib.pyplot, and
seaborn.

1. Import each of these libraries as their commonly used abbreviations (e.g., pd, np, plt, and sns).
2. Don't forget to use the jupyter notebook "magic" to enable inline matploblib plots
(http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib).
3. Add the line sns.set() to make your plots look nicer.

In [1]: import os
import zipfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [2]: import sys

assert 'zipfile'in sys.modules
assert 'pandas'in sys.modules and pd
assert 'numpy'in sys.modules and np
assert 'matplotlib'in sys.modules and plt
assert 'seaborn'in sys.modules and sns

Downloading the data
As you saw in lectures, we can download data from the internet with Python.
Using the utils.py file from the lectures (see link (http://www.ds100.org/sp18/assets/lectures/lec05/utils.py)), define a helper function
fetch_and_cache to download the data with the following arguments:

data_url: the web address to download
file: the file in which to save the results
data_dir: (default="data") the location to save the data
f if t th fil i l d l d d
https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 2/19

, 4/18/2018 sp18/hw2_solution.ipynb at master · DS-100/sp18 · GitHub
force: if true the file is always re-downloaded

This function should return pathlib.Path object representing the file.

In [3]: import requests
from pathlib import Path

def fetch_and_cache(data_url, file, data_dir="data", force=False):
"""
Download and cache a url and return the file object.

data_url: the web address to download
file: the file in which to save the results.
data_dir: (default="data") the location to save the data
force: if true the file is always re-downloaded

return: The pathlib.Path object representing the file.
"""

### BEGIN SOLUTION
data_dir = Path(data_dir)
data_dir.mkdir(exist_ok = True)
file_path = data_dir / Path(file)
# If the file already exists and we want to force a download then
# delete the file first so that the creation date is correct.
if force and file_path.exists():
file_path.unlink()
if force or not file_path.exists():
print('Downloading...', end=' ')
resp = requests.get(data_url)
with file_path.open('wb') as f:
f.write(resp.content)
print('Done!')
else:
import time
last_modified_time = time.ctime(file_path.stat().st_mtime)
print("Using cached version last modified (UTC):", last_modified_time)
return file_path
### END SOLUTION

Now use the previously defined function to download the data from the following URL: http://www.ds100.org/sp18/assets/datasets/hw2-
SFBusinesses.zip (http://www.ds100.org/sp18/assets/datasets/hw2-SFBusinesses.zip)

In [4]: data_url = 'http://www.ds100.org/sp18/assets/datasets/hw2-SFBusinesses.zip'
file_name = 'data.zip'
data_dir = '.'

dest_path = fetch_and_cache(data_url=data_url, data_dir=data_dir, file=file_name)
print('Saved at {}'.format(dest_path))

Using cached version last modified (UTC): Wed Feb 7 17:46:26 2018
Saved at data.zip

Loading Food Safety Data
To begin our investigation, we need to understand the structure of the data. Recall this involves answering questions such as

Is the data in a standard format or encoding?
Is the data organized in records?
What are the fields in each record?

There are 4 files in the data directory. Let's use Python to understand how this data is laid out.

Use the zipfile library to list all the files stored in the dest_path directory.

Creating a ZipFile object might be a good start (the Python docs (https://docs.python.org/3/library/zipfile.html) have further details).

In [5]: # Fill in the list_files variable with a list of all the names of the files in the zip file
my_zip = ...
list_names = ...

### BEGIN SOLUTION
my_zip = zipfile.ZipFile(dest_path, 'r')
list_names = [f.filename for f in my_zip.filelist]
print(list_names)
### END SOLUTION

['violations.csv', 'businesses.csv', 'inspections.csv', 'legend.csv']

In [6]: assert isinstance(my_zip, zipfile.ZipFile)
assert isinstance(list_names, list)
assert all([isinstance(file, str) for file in list_names])

https://github.com/DS-100/sp18/blob/master/hw/hw2/solution/hw2_solution.ipynb 3/19

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Examhack. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $13.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

78677 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Exam (elaborations)

University of California, BerkeleyDS 100sp18_hw2_solution.ipynb at master DS-100_sp18 GitHub

Document information

Subjects

Written for

Seller

Reviews received

Content preview