Foursquare POI Data Example

Foursquare has recently released an interesting Point of Interest (POI) Dataset, which is what we will play with in this example.

Fused.io has made that data readily available, through Source Cooperative which is how we will access it.

import duckboat as uck

Load the data

We can download the data with duckboat and save it to a file for quick iteration.

# NOTE: You can comment out the following two lines once
# you have the data saved locally to avoid downloading it again
t = uck.Table('s3://us-west-2.opendata.source.coop/fused/fsq-os-places/2024-12-03/places/10.parquet').hide()
t.save('data/fsq_places_10.parquet')

100% ▕████████████████████████████████████████████████████████████▏

With the data saved to a file, we can create a duckboat table from a local Parquet file.

t = uck.Table('data/fsq_places_10.parquet')

Data processing

As noted elsewhere in the docs, we can store data operations in multiple ways. For example, as Python functions or as a list of SQL snippet strings. Here are some examples.

# Collect an operation as a sequence of SQL snippets,
# and parameterize the target H3 resolution.
def latlng_h3(res):
    return [
        f'select h3_latlng_to_cell(latitude, longitude, {res}) as hexid',
        'select h3_h3_to_string(hexid) as hexid',
    ]

# A data processing procedure stored as a list of SQL snippet strings
count_and_order = [
    'select hexid, count(*) as num group by 1',
    'order by num',
]

Parameterized evaluation

With our core data transformation logic defined and parameterized, we can run the same transformation over the data at different H3 resolutions.

t.do(latlng_h3(0), count_and_order)

┌─────────────────┬─────────┐
│      hexid      │   num   │
│     varchar     │  int64  │
├─────────────────┼─────────┤
│ 8027fffffffffff │ 1106440 │
└─────────────────┴─────────┘

t.do(latlng_h3(1), count_and_order)

┌─────────────────┬────────┐
│      hexid      │  num   │
│     varchar     │ int64  │
├─────────────────┼────────┤
│ 8128bffffffffff │    410 │
│ 8126fffffffffff │   2835 │
│ 81277ffffffffff │  25008 │
│ 81267ffffffffff │  30173 │
│ 8127bffffffffff │  31192 │
│ 8126bffffffffff │ 313007 │
│ 81263ffffffffff │ 703815 │
└─────────────────┴────────┘

t.do(latlng_h3(2), count_and_order)

┌─────────────────┬────────┐
│      hexid      │  num   │
│     varchar     │ int64  │
├─────────────────┼────────┤
│ 8226f7fffffffff │   1051 │
│ 822897fffffffff │   1574 │
│ 8226e7fffffffff │   2092 │
│ 822697fffffffff │   3206 │
│ 822657fffffffff │   3656 │
│ 82278ffffffffff │   5745 │
│ 822637fffffffff │   6113 │
│ 8226b7fffffffff │   8115 │
│ 8226a7fffffffff │  14692 │
│ 822617fffffffff │  15879 │
│ 82279ffffffffff │  22247 │
│ 82261ffffffffff │  26908 │
│ 822677fffffffff │  32458 │
│ 8226affffffffff │  48212 │
│ 822607fffffffff │  63129 │
│ 822627fffffffff │  79629 │
│ 822687fffffffff │ 107644 │
│ 82268ffffffffff │ 132766 │
│ 82260ffffffffff │ 168373 │
│ 822757fffffffff │ 174499 │
│ 82262ffffffffff │ 188452 │
├─────────────────┴────────┤
│ 21 rows        2 columns │
└──────────────────────────┘

Loading full dataset

It is also possible to load the full dataset with DuckDB glob pattern matching. Just be careful calling this multiple times, because you may get throttled.

t = uck.Table('s3://us-west-2.opendata.source.coop/fused/fsq-os-places/2024-12-03/places/*.parquet').hide()

It also seems like DuckDB is able to pull only the data needed for the computation. The following row count query is executed much faster than it would be if we needed to download the full dataset.

t.do('select count(*)')

100% ▕████████████████████████████████████████████████████████████▏

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    104199378 │
└──────────────┘