From Data to Deliverable

Steph Samson

hello@stephsamson.com

ETL

Extract

Getting data from wherever with whatever format.

Transform

Transforming data to the structure you want it to be.

Load

Loading data into some database or warehouse for access later.

Today's Use Case: Restaurant Data

Requirements

  • Vegan, vegetarian, or vegetarian-friendly

  • Located in Poland

Tools

  • Pickle to serialize responses from FourSquare

Scraping vegemapa.pl

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


browser = webdriver.Firefox(
    executable_path='/usr/local/bin/geckodriver'
)

browser.get('http://www.vegemapa.pl/')

try:
    element = WebDriverWait(browser, 3).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'gm-style'))
    )
finally:
    listings = [item.text for item in
                browser.find_elements_by_css_selector('.panel-inner .item')]
    browser.close()
In [2]:
listings[:5]
Out[2]:
['Green Way - Vegemania\nPabianicka 245\nŁódź, Polska, 93-457\nWegetariańska\nWskaż trasę',
 'Bioway\nAl. Politechniki 1\nŁódź, Polska, 93-590\nWegetariańska\nWskaż trasę',
 'Papuvege\nSienkiewicza 15\nŁódź, Polska, 90-114\nWegańska\nWskaż trasę',
 'Green Way\nAleja Piłsudskiego 15\nŁódź, Polska, 0\nWegetariańska\nWskaż trasę',
 'PORCJA\nRoosevelta 7\nŁódź , Polska, 90-056\nWskaż trasę']

Reading CSV data

In [3]:
import pandas as pd

poland_cities = pd.read_csv('poland_cities.csv')
poland_cities.head(10)
Out[3]:
city lat lng country iso2 admin capital population population_proper
0 Katowice 50.258415 19.027545 Poland PL Śląskie admin 2746000.0 308724.0
1 Warsaw 52.250000 21.000000 Poland PL Mazowieckie primary 1707000.0 1702139.0
2 Łódź 51.750000 19.466667 Poland PL Łódzkie admin 758000.0 758000.0
3 Kraków 50.057531 19.980216 Poland PL Małopolskie admin 756000.0 755050.0
4 Gdańsk 54.360850 18.658291 Poland PL Pomorskie admin 740000.0 455830.0
5 Bytom 50.348017 18.932822 Poland PL Śląskie minor 662247.0 189186.0
6 Wrocław 51.100000 17.033333 Poland PL Dolnośląskie admin 634893.0 610049.0
7 Poznań 52.416667 16.966667 Poland PL Wielkopolskie admin 623997.0 570352.0
8 Gliwice 50.297612 18.676583 Poland PL Śląskie minor 507670.0 198835.0
9 Szczecin 53.439531 14.593891 Poland PL Zachodniopomorskie admin 407811.0 372672.0

Getting FourSquare Data

  • API with rate limit

  • Used Pickle to store responses that had a status code of 200

import os
import requests


URL = 'https://api.foursquare.com/v2'

auth = {
    'client_id': os.environ['FOURSQUARE_CLIENT_ID'],
    'client_secret': os.environ['FOURSQUARE_CLIENT_SECRET'],
    'v': '20181118'
}

venues = [
    requests.get(
        f'{ URL }/venues',
        params={**auth,
                **{'near': city,
                   # Veg* Restaurants
                   'categoryId': '4bf58dd8d48988d1d3941735'}}
    ).content['response']['venues']
    for city in cities
]

venue_ids = [venue['id'] for venue in venues]

foursquare_listings = [
    requests.get(f'{ URL }/venues/{ v_id }', params=auth)
    for v_id in venue_ids
]

Abridged FourSquare Response

{
  "meta": { ... },
  "response": {
    "venue": {
      "id": "412d2800f964a520df0c1fe3",
      "name": "Central Park",
      "contact": { ... },
      "location": {
        "address": "59th St to 110th St",
        "crossStreet": "5th Ave to Central Park West",
        "lat": 40.78408342593807,
        "lng": -73.96485328674316,
        "postalCode": "10028",
        "cc": "US",
        "city": "New York",
        "state": "NY",
        "country": "United States",
        "formattedAddress": [
          "59th St to 110th St (5th Ave to Central Park West)",
          "New York, NY 10028",
          "United States"
        ]
      },
      { ... },
      "rating": 9.8
    }
  }
}

Transforming Data

Normalize the different data sources into a target format. Cleaning data is often necessary before it can be transformed.

Target Format

A listing represented as a JSON object.

{
    'title': 'Listing',
    'type': 'object',
    'properties': {
        'name': {'type': 'str'},
        'street': {'type': 'str'},
        'city': {'type': 'str'},
        'province': {'type': 'str'},
        'zip': {'type': 'str'},
        'category': {'type': 'str'},
        'rating': {'type': 'number'},
        'pricing_tier': {'type': 'str'}
    }
}

Cleaning

Where much of the work is performed before transforming the data.

  • Drop duplicates
  • Strip whitespace

Discarding Unnecessary Data - Wskaż trasę

['Green Way - Vegemania\nPabianicka 245\nŁódź, Polska, 93-457\nWegetariańska\nWskaż trasę',
 'Bioway\nAl. Politechniki 1\nŁódź, Polska, 93-590\nWegetariańska\nWskaż trasę',
 'Papuvege\nSienkiewicza 15\nŁódź, Polska, 90-114\nWegańska\nWskaż trasę',
 'Green Way\nAleja Piłsudskiego 15\nŁódź, Polska, 0\nWegetariańska\nWskaż trasę',
 'PORCJA\nRoosevelta 7\nŁódź , Polska, 90-056\nWskaż trasę']

Handling Inconsistency - Warsaw or Warszawa

CSV file containing data on Polish cities had Warsaw while vegemapa.pl uses Warszawa.

Null Values

The FourSquare API Venue endpoints specification describes ratings to be from 0 through 10 and also specifies that not all venues will have a rating.

Those that do not have a rating will be defaulted to -1.

Combining Multiple Sources into a Single Data Frame

CSV file and scraped website data is merged on the city.

In [7]:
listings_df.head(5)
Out[7]:
name street city province category
0 Green Way - Vegemania Pabianicka 245 Łódź Łódzkie Wegetariańska
1 Bioway Al. Politechniki 1 Łódź Łódzkie Wegetariańska
2 Papuvege Sienkiewicza 15 Łódź Łódzkie Wegańska
3 Green Way Aleja Piłsudskiego 15 Łódź Łódzkie Wegetariańska
4 Jaffa - hummus & the other stories Piotrowska 138 Łódź Łódzkie Wegetariańska

This resulting data set is then merged with the FourSquare data on name, city, street, province.

In [9]:
listings_df = listings_df.merge(foursquare_listings_df, how='outer', on=['name', 'city', 'street', 'province'])
In [12]:
listings_df.sample(10)
Out[12]:
name street city province category rating pricing_tier
244 Malinowy Ogród Pegaza 2 Gdańsk Pomorskie Wegetariańska -1.0 Unknown
823 Warzywniak Plac Grunwaldzki Wrocław Województwo dolnośląskie No category 7.2 Moderate
110 Petit Wojewódzka 21 Katowice Śląskie Wegetariańska -1.0 Unknown
104 Bellmer Cafe Plac Wolności 9 Katowice Śląskie Wegetariańska -1.0 Unknown
716 Tel Aviv Food & Wine Poznańska 11 Warszawa Województwo mazowieckie No category 8.0 Moderate
682 VEGAN RAMEN SHOP Finlandzka 12a Warszawa Województwo mazowieckie No category 8.9 Unknown
416 Veganka Restobar Batorego 2 Katowice Województwo śląskie No category -1.0 Moderate
713 Chwast Food Waryńskiego 9 Warszawa Województwo mazowieckie No category 8.3 Moderate
191 Brocci Kraszewskiego 14 Poznań Wielkopolskie Wegetariańska -1.0 Unknown
106 REburger Pocztowa 10 Katowice Śląskie Wegańska -1.0 Unknown

Target Format: JSON

In [13]:
import json
json.loads(listings_df.to_json(orient='records'))[0]
Out[13]:
{'name': 'Green Way - Vegemania',
 'street': 'Pabianicka 245',
 'city': 'Łódź',
 'province': 'Łódzkie',
 'category': 'Wegetariańska',
 'rating': -1.0,
 'pricing_tier': 'Unknown'}

Loading the Data

ETL Toolkits

Thanks!

Questions?

hello@stephsamson.com