Lazy Evaluation Tutorial

Introduction

Since duckboat is based on the DuckDB relational API, all expressions are “lazy” in that they defer evaluation until a result or a data preview is requested. This allows us to build up complex data processing pipelines iteratively, but without needing to compute extranous intermediate results. Instead, under the hood, DuckDB will gather the sequence of steps and pass it to a query optimizer, which will apply optimizations like predicate and projection pushdown. The full operation will be executed by DuckDB making full use of all the cores available on your machine, streaming the operations if possible, and even spilling to disk if the operation is too large to fit in memory.

import duckboat as uck

t0 = uck.Table('data/yellow_tripdata_2010-01.parquet')
t0.columns

['vendor_id',
 'pickup_datetime',
 'dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'rate_code',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'surcharge',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'total_amount']

count_rows = "select format('{:.2e}', 1.0*count(*)) as num_rows"
t0.do(count_rows)

┌──────────┐
│ num_rows │
│ varchar  │
├──────────┤
│ 1.49e+07 │
└──────────┘

The following code is executed instantaneously, since no query operations are performed.

t1 = t0.do(
    'where (pickup_longitude != 0) and (pickup_latitude != 0)',
    'where total_amount > 0',

    'select *, h3_latlng_to_cell(pickup_latitude, pickup_longitude, 12) as hexid',
    'select * replace ( h3_h3_to_string(hexid) as hexid )',
    'select cast(pickup_datetime as timestamp) as ts, hexid, total_amount as amt',
)

Lazy operations

Listing the table as the last expression in a Jupyter notebook makes Jupyter try to represent the table, which triggers DuckDB do either compute the full table, or, in the case that the table has many rows, compute just enough rows to show a preview. In many instances, the preview is faster to compute.

The following is still fast, but just a bit slower than the previous cell, since this is where the query associated with the operations above is actually performed.

t1

┌─────────────────────┬─────────────────┬───────────────────┐
│         ts          │      hexid      │        amt        │
│      timestamp      │     varchar     │      double       │
├─────────────────────┼─────────────────┼───────────────────┤
│ 2010-01-26 07:41:00 │ 8c2a100d45b01ff │               5.0 │
│ 2010-01-30 23:31:00 │ 8c2a107258e61ff │              16.3 │
│ 2010-01-18 20:22:20 │ 8c2a1008b82b5ff │              12.7 │
│ 2010-01-09 01:18:00 │ 8c2a100d65653ff │              14.3 │
│ 2010-01-18 19:10:14 │ 8c2a100d22945ff │              6.67 │
│ 2010-01-17 09:18:00 │ 8c2a10725ac5bff │               6.6 │
│ 2010-01-09 13:49:00 │ 8c2a100d620b7ff │               7.4 │
│ 2010-01-09 00:25:00 │ 8c2a1072c86abff │              12.3 │
│ 2010-01-27 18:15:00 │ 8c2a100d2bb69ff │              12.0 │
│ 2010-01-08 16:05:00 │ 8c2a107250403ff │              10.2 │
│          ·          │        ·        │                ·  │
│          ·          │        ·        │                ·  │
│          ·          │        ·        │                ·  │
│ 2010-01-05 16:50:04 │ 8c2a100d2ac47ff │               6.0 │
│ 2010-01-03 01:30:26 │ 8c2a100d63201ff │               8.3 │
│ 2010-01-26 18:10:00 │ 8c2a100d60839ff │              13.8 │
│ 2010-01-26 20:23:00 │ 8c2a1072c9533ff │              15.5 │
│ 2010-01-07 19:25:00 │ 8c2a100d654c7ff │               6.8 │
│ 2010-01-03 19:00:53 │ 8c2a1008b368dff │ 8.199999999999998 │
│ 2010-01-28 14:10:00 │ 8c2a100d666e9ff │               6.4 │
│ 2010-01-23 15:52:00 │ 8c2a1008bad2bff │             30.79 │
│ 2010-01-27 15:32:00 │ 8c2a100d2280bff │             55.07 │
│ 2010-01-01 02:58:00 │ 8c2a100d2aaa9ff │               7.9 │
├─────────────────────┴─────────────────┴───────────────────┤
│ ? rows (>9999 rows, 20 shown)                   3 columns │
└───────────────────────────────────────────────────────────┘

t2 = t1.alias('tbl1').do("""
select
      a.hexid
    , a.ts as ts1
    , b.ts as ts2
    , a.amt as amt1
    , b.amt as amt2
from
    tbl1 as a
inner join
    tbl1 as b
using
    (hexid)
""").hide()

Even though the computation for t2 is complex, we can compute a preview fairly quickly. The following runs in about 2 seconds on my laptop.

t2.show()

┌─────────────────┬─────────────────────┬─────────────────────┬────────┬───────────────────┐
│      hexid      │         ts1         │         ts2         │  amt1  │       amt2        │
│     varchar     │      timestamp      │      timestamp      │ double │      double       │
├─────────────────┼─────────────────────┼─────────────────────┼────────┼───────────────────┤
│ 8c2a100d45b01ff │ 2010-01-26 07:41:00 │ 2010-01-27 20:32:00 │    5.0 │              48.5 │
│ 8c2a107258e61ff │ 2010-01-30 23:31:00 │ 2010-01-06 16:19:00 │   16.3 │               7.4 │
│ 8c2a1008b82b5ff │ 2010-01-18 20:22:20 │ 2010-01-13 14:38:14 │   12.7 │ 8.199999999999998 │
│ 8c2a100d65653ff │ 2010-01-09 01:18:00 │ 2010-01-14 09:26:00 │   14.3 │               5.4 │
│ 8c2a100d22945ff │ 2010-01-18 19:10:14 │ 2010-01-28 18:43:52 │   6.67 │               7.0 │
│ 8c2a10725ac5bff │ 2010-01-17 09:18:00 │ 2010-01-18 21:15:00 │    6.6 │               7.9 │
│ 8c2a100d620b7ff │ 2010-01-09 13:49:00 │ 2010-01-16 00:08:00 │    7.4 │               5.5 │
│ 8c2a1072c86abff │ 2010-01-09 00:25:00 │ 2010-01-01 08:23:30 │   12.3 │              12.6 │
│ 8c2a100d2bb69ff │ 2010-01-27 18:15:00 │ 2010-01-20 15:07:00 │   12.0 │              13.0 │
│ 8c2a107250403ff │ 2010-01-08 16:05:00 │ 2010-01-25 15:10:15 │   10.2 │              7.92 │
│        ·        │          ·          │          ·          │     ·  │                ·  │
│        ·        │          ·          │          ·          │     ·  │                ·  │
│        ·        │          ·          │          ·          │     ·  │                ·  │
│ 8c2a100d279d9ff │ 2010-01-14 04:14:00 │ 2010-01-27 06:02:00 │    8.3 │               8.0 │
│ 8c2a100d29809ff │ 2010-01-25 21:58:00 │ 2010-01-16 14:14:00 │   13.9 │ 8.199999999999998 │
│ 8c2a10725b28dff │ 2010-01-17 23:46:00 │ 2010-01-11 00:23:00 │    5.1 │ 8.599999999999998 │
│ 8c2a100d2c85bff │ 2010-01-02 09:52:00 │ 2010-01-31 15:11:33 │    5.2 │             14.95 │
│ 8c2a100d654abff │ 2010-01-27 22:29:00 │ 2010-01-28 22:00:46 │    7.5 │               8.3 │
│ 8c2a100d3531dff │ 2010-01-21 12:55:00 │ 2010-01-31 15:47:00 │   11.4 │              12.9 │
│ 8c2a10089649dff │ 2010-01-20 15:11:00 │ 2010-01-07 13:02:00 │   10.1 │               8.0 │
│ 8c2a100d34965ff │ 2010-01-15 20:19:00 │ 2010-01-01 23:24:01 │    7.5 │               4.7 │
│ 8c2a10774924bff │ 2010-01-16 20:11:00 │ 2010-01-30 03:14:04 │   13.1 │              11.9 │
│ 8c2a100f34e99ff │ 2010-01-21 15:09:00 │ 2010-01-14 14:36:00 │   16.2 │               6.2 │
├─────────────────┴─────────────────────┴─────────────────────┴────────┴───────────────────┤
│ ? rows (>9999 rows, 20 shown)                                                  5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Avoiding Expensive Intermediates

However, running count_rows on t2 forces the full join operation to be performed (previously, we only computed a partial join to dispaly the preview). The following takes about 50 seconds on my laptop.

Note that the row count for this intermediate table is about 10 billion rows. We deal with the table directly here for demonstration purposes, but we as we continue the pipeline below, we will avoid ever forming this intermediate table.

# renders slowly because you have to do the full join
t2.do(count_rows)

100% ▕████████████████████████████████████████████████████████████▏

┌──────────┐
│ num_rows │
│ varchar  │
├──────────┤
│ 1.05e+10 │
└──────────┘

Again, it is intantaneous to form the expression representing t3, as long as we don’t need to compute the expression just yet.

Note that the timestamp filtering below could have also been given above as part of the join. We’re free to do it either way and the performance will be identical because DuckDB will push the filters down in its query planning/optimization step.

t3 = t2.do(
    'where ts1 < ts2',
    'where ts2 < ts1 + interval 1 minute',
    'select hexid, max(abs(amt1-amt2)) as diff group by 1',
    'where diff > 0'
    'order by diff',
    'hide',
)

Note that this is faster than the t2.do(count_rows), even though it does more work! This cell runs in about 44 seconds on my laptop.

Materialize smaller results

This final result has about 29 thousand rows, something much more reasonable to materialize directly as a Pandas dataframe, for instance.

t3.do(count_rows)

100% ▕████████████████████████████████████████████████████████████▏

┌──────────┐
│ num_rows │
│ varchar  │
├──────────┤
│ 2.86e+04 │
└──────────┘

We can translate to a Pandas dataframe in about 53 seconds.

t3.df()

100% ▕████████████████████████████████████████████████████████████▏

	hexid	diff
0	8c2a100d2d12bff	0.01
1	8c2a100d2a94dff	0.02
2	8c2a100891611ff	0.02
3	8c2a1072c846bff	0.02
4	8c2a100d67a93ff	0.02
...	...	...
28563	8c2a10aa2cb13ff	175.88
28564	8c2a100f52815ff	180.45
28565	8c2a108f664e7ff	203.00
28566	8c2a100d676d7ff	212.37
28567	8c2a10d76aa37ff	213.57

28568 rows × 2 columns

# TODO: get a polars dataframe