import duckboat as uck
Lazy Evaluation Tutorial
Introduction
Since duckboat
is based on the DuckDB relational API, all expressions are “lazy” in that they defer evaluation until a result or a data preview is requested. This allows us to build up complex data processing pipelines iteratively, but without needing to compute extranous intermediate results. Instead, under the hood, DuckDB will gather the sequence of steps and pass it to a query optimizer, which will apply optimizations like predicate and projection pushdown. The full operation will be executed by DuckDB making full use of all the cores available on your machine, streaming the operations if possible, and even spilling to disk if the operation is too large to fit in memory.
= uck.Table('data/yellow_tripdata_2010-01.parquet')
t0 t0.columns
['vendor_id',
'pickup_datetime',
'dropoff_datetime',
'passenger_count',
'trip_distance',
'pickup_longitude',
'pickup_latitude',
'rate_code',
'store_and_fwd_flag',
'dropoff_longitude',
'dropoff_latitude',
'payment_type',
'fare_amount',
'surcharge',
'mta_tax',
'tip_amount',
'tolls_amount',
'total_amount']
= "select format('{:.2e}', 1.0*count(*)) as num_rows"
count_rows t0.do(count_rows)
┌──────────┐
│ num_rows │
│ varchar │
├──────────┤
│ 1.49e+07 │
└──────────┘
The following code is executed instantaneously, since no query operations are performed.
= t0.do(
t1 'where (pickup_longitude != 0) and (pickup_latitude != 0)',
'where total_amount > 0',
'select *, h3_latlng_to_cell(pickup_latitude, pickup_longitude, 12) as hexid',
'select * replace ( h3_h3_to_string(hexid) as hexid )',
'select cast(pickup_datetime as timestamp) as ts, hexid, total_amount as amt',
)
Lazy operations
Listing the table as the last expression in a Jupyter notebook makes Jupyter try to represent the table, which triggers DuckDB do either compute the full table, or, in the case that the table has many rows, compute just enough rows to show a preview. In many instances, the preview is faster to compute.
The following is still fast, but just a bit slower than the previous cell, since this is where the query associated with the operations above is actually performed.
t1
┌─────────────────────┬─────────────────┬───────────────────┐
│ ts │ hexid │ amt │
│ timestamp │ varchar │ double │
├─────────────────────┼─────────────────┼───────────────────┤
│ 2010-01-26 07:41:00 │ 8c2a100d45b01ff │ 5.0 │
│ 2010-01-30 23:31:00 │ 8c2a107258e61ff │ 16.3 │
│ 2010-01-18 20:22:20 │ 8c2a1008b82b5ff │ 12.7 │
│ 2010-01-09 01:18:00 │ 8c2a100d65653ff │ 14.3 │
│ 2010-01-18 19:10:14 │ 8c2a100d22945ff │ 6.67 │
│ 2010-01-17 09:18:00 │ 8c2a10725ac5bff │ 6.6 │
│ 2010-01-09 13:49:00 │ 8c2a100d620b7ff │ 7.4 │
│ 2010-01-09 00:25:00 │ 8c2a1072c86abff │ 12.3 │
│ 2010-01-27 18:15:00 │ 8c2a100d2bb69ff │ 12.0 │
│ 2010-01-08 16:05:00 │ 8c2a107250403ff │ 10.2 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 2010-01-05 16:50:04 │ 8c2a100d2ac47ff │ 6.0 │
│ 2010-01-03 01:30:26 │ 8c2a100d63201ff │ 8.3 │
│ 2010-01-26 18:10:00 │ 8c2a100d60839ff │ 13.8 │
│ 2010-01-26 20:23:00 │ 8c2a1072c9533ff │ 15.5 │
│ 2010-01-07 19:25:00 │ 8c2a100d654c7ff │ 6.8 │
│ 2010-01-03 19:00:53 │ 8c2a1008b368dff │ 8.199999999999998 │
│ 2010-01-28 14:10:00 │ 8c2a100d666e9ff │ 6.4 │
│ 2010-01-23 15:52:00 │ 8c2a1008bad2bff │ 30.79 │
│ 2010-01-27 15:32:00 │ 8c2a100d2280bff │ 55.07 │
│ 2010-01-01 02:58:00 │ 8c2a100d2aaa9ff │ 7.9 │
├─────────────────────┴─────────────────┴───────────────────┤
│ ? rows (>9999 rows, 20 shown) 3 columns │
└───────────────────────────────────────────────────────────┘
= t1.alias('tbl1').do("""
t2 select
a.hexid
, a.ts as ts1
, b.ts as ts2
, a.amt as amt1
, b.amt as amt2
from
tbl1 as a
inner join
tbl1 as b
using
(hexid)
""").hide()
Even though the computation for t2
is complex, we can compute a preview fairly quickly. The following runs in about 2 seconds on my laptop.
t2.show()
┌─────────────────┬─────────────────────┬─────────────────────┬────────┬───────────────────┐
│ hexid │ ts1 │ ts2 │ amt1 │ amt2 │
│ varchar │ timestamp │ timestamp │ double │ double │
├─────────────────┼─────────────────────┼─────────────────────┼────────┼───────────────────┤
│ 8c2a100d45b01ff │ 2010-01-26 07:41:00 │ 2010-01-27 20:32:00 │ 5.0 │ 48.5 │
│ 8c2a107258e61ff │ 2010-01-30 23:31:00 │ 2010-01-06 16:19:00 │ 16.3 │ 7.4 │
│ 8c2a1008b82b5ff │ 2010-01-18 20:22:20 │ 2010-01-13 14:38:14 │ 12.7 │ 8.199999999999998 │
│ 8c2a100d65653ff │ 2010-01-09 01:18:00 │ 2010-01-14 09:26:00 │ 14.3 │ 5.4 │
│ 8c2a100d22945ff │ 2010-01-18 19:10:14 │ 2010-01-28 18:43:52 │ 6.67 │ 7.0 │
│ 8c2a10725ac5bff │ 2010-01-17 09:18:00 │ 2010-01-18 21:15:00 │ 6.6 │ 7.9 │
│ 8c2a100d620b7ff │ 2010-01-09 13:49:00 │ 2010-01-16 00:08:00 │ 7.4 │ 5.5 │
│ 8c2a1072c86abff │ 2010-01-09 00:25:00 │ 2010-01-01 08:23:30 │ 12.3 │ 12.6 │
│ 8c2a100d2bb69ff │ 2010-01-27 18:15:00 │ 2010-01-20 15:07:00 │ 12.0 │ 13.0 │
│ 8c2a107250403ff │ 2010-01-08 16:05:00 │ 2010-01-25 15:10:15 │ 10.2 │ 7.92 │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │
│ 8c2a100d279d9ff │ 2010-01-14 04:14:00 │ 2010-01-27 06:02:00 │ 8.3 │ 8.0 │
│ 8c2a100d29809ff │ 2010-01-25 21:58:00 │ 2010-01-16 14:14:00 │ 13.9 │ 8.199999999999998 │
│ 8c2a10725b28dff │ 2010-01-17 23:46:00 │ 2010-01-11 00:23:00 │ 5.1 │ 8.599999999999998 │
│ 8c2a100d2c85bff │ 2010-01-02 09:52:00 │ 2010-01-31 15:11:33 │ 5.2 │ 14.95 │
│ 8c2a100d654abff │ 2010-01-27 22:29:00 │ 2010-01-28 22:00:46 │ 7.5 │ 8.3 │
│ 8c2a100d3531dff │ 2010-01-21 12:55:00 │ 2010-01-31 15:47:00 │ 11.4 │ 12.9 │
│ 8c2a10089649dff │ 2010-01-20 15:11:00 │ 2010-01-07 13:02:00 │ 10.1 │ 8.0 │
│ 8c2a100d34965ff │ 2010-01-15 20:19:00 │ 2010-01-01 23:24:01 │ 7.5 │ 4.7 │
│ 8c2a10774924bff │ 2010-01-16 20:11:00 │ 2010-01-30 03:14:04 │ 13.1 │ 11.9 │
│ 8c2a100f34e99ff │ 2010-01-21 15:09:00 │ 2010-01-14 14:36:00 │ 16.2 │ 6.2 │
├─────────────────┴─────────────────────┴─────────────────────┴────────┴───────────────────┤
│ ? rows (>9999 rows, 20 shown) 5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘
Avoiding Expensive Intermediates
However, running count_rows
on t2
forces the full join operation to be performed (previously, we only computed a partial join to dispaly the preview). The following takes about 50 seconds on my laptop.
Note that the row count for this intermediate table is about 10 billion rows. We deal with the table directly here for demonstration purposes, but we as we continue the pipeline below, we will avoid ever forming this intermediate table.
# renders slowly because you have to do the full join
t2.do(count_rows)
100% ▕████████████████████████████████████████████████████████████▏
┌──────────┐
│ num_rows │
│ varchar │
├──────────┤
│ 1.05e+10 │
└──────────┘
Again, it is intantaneous to form the expression representing t3
, as long as we don’t need to compute the expression just yet.
Note that the timestamp filtering below could have also been given above as part of the join. We’re free to do it either way and the performance will be identical because DuckDB will push the filters down in its query planning/optimization step.
= t2.do(
t3 'where ts1 < ts2',
'where ts2 < ts1 + interval 1 minute',
'select hexid, max(abs(amt1-amt2)) as diff group by 1',
'where diff > 0'
'order by diff',
'hide',
)
Note that this is faster than the t2.do(count_rows)
, even though it does more work! This cell runs in about 44 seconds on my laptop.
Materialize smaller results
This final result has about 29 thousand rows, something much more reasonable to materialize directly as a Pandas dataframe, for instance.
t3.do(count_rows)
100% ▕████████████████████████████████████████████████████████████▏
┌──────────┐
│ num_rows │
│ varchar │
├──────────┤
│ 2.86e+04 │
└──────────┘
We can translate to a Pandas dataframe in about 53 seconds.
t3.df()
100% ▕████████████████████████████████████████████████████████████▏
hexid | diff | |
---|---|---|
0 | 8c2a100d2d12bff | 0.01 |
1 | 8c2a100d2a94dff | 0.02 |
2 | 8c2a100891611ff | 0.02 |
3 | 8c2a1072c846bff | 0.02 |
4 | 8c2a100d67a93ff | 0.02 |
... | ... | ... |
28563 | 8c2a10aa2cb13ff | 175.88 |
28564 | 8c2a100f52815ff | 180.45 |
28565 | 8c2a108f664e7ff | 203.00 |
28566 | 8c2a100d676d7ff | 212.37 |
28567 | 8c2a10d76aa37ff | 213.57 |
28568 rows × 2 columns
# TODO: get a polars dataframe