Concepts

Loading, saving, and transferring tables

import duckboat as uck

# read from a remote CSV file
tbl = uck.Table('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/refs/heads/master/palmerpenguins/data/penguins.csv')
tbl = uck.Table('data/penguins.csv')  # read from a local CSV file

df = tbl.df()  # write to Pandas DataFrame
uck.Table(df)  # read from Pandas DataFrame

df.to_parquet('data/penguins.parquet')  # use Pandas to write to a Parquet file
tbl = uck.Table('data/penguins.parquet')  # read from a local Parquet file

tbl_arrow = tbl.arrow()  # write to an  Arrow Table
tbl = uck.Table(tbl_arrow)  # read from an Arrow Table

Databases and Tables

The library revolves around two objects and a function: Database, Table.

  • Database is essentially a dictionary mapping names to objects that will get resolved to tables. Those objects might be a Pandas or Polars dataframe, a PyArrow Table, the local filename of a Parquet or CSV file, or a URL to a remote data file. These objects are used by duckboat and Duckdb lazily, so operations on them are deffered until a final result is requested.
  • Table is a wrapper around a DuckDB Relation.

Laziness

Since all operations are done through DuckDBPyRelation, the expressions are evaluated lazily, or only at the end when we want a table or want to display results to the console.

If you would like to materialze a Table or a Database concretely in terms of Pandas DataFrames or PyArrow Tables, you can use x.hold(kind='arrow') or x.do('arrow').

When you materialize a Database you can access the underlying tables with db[table_name].

Examples

Chaining

duckboat uses DuckDB to build up Relation expressions through chaining, which DuckDB will then execute after running the entire expression through a query planner to optimize execution.

import duckboat as uck

# uck.Table('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2010-01.parquet')
uck.Table('yellow_tripdata_2010-01.parquet').do(
    'select *, pickup_latitude as lat, pickup_longitude as lng',
    'select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
    'select hexid, avg(tip_amount) as tip  group by 1',
    'select h3_h3_to_string(hexid) as hexid, tip',
    'where tip between 10 and 20',
    'order by hexid',
)

Gives the output:

┌─────────────────┬────────────────────┐
│      hexid      │        tip         │
│     varchar     │       double       │
├─────────────────┼────────────────────┤
│ 881bb2a0b5fffff │              12.22 │
│ 882a100299fffff │  10.02818181818182 │
│ 882a10029dfffff │ 11.666666666666666 │
│ 882a1002c3fffff │               10.0 │
│ 882a10034bfffff │               16.0 │
│ 882a100353fffff │               17.6 │
│ 882a10045bfffff │               10.0 │
│ 882a100487fffff │             11.525 │
│ 882a10060dfffff │               16.1 │
│ 882a100611fffff │               10.0 │
│        ·        │                 ·  │
│        ·        │                 ·  │
│        ·        │                 ·  │
│ 882a13c4d9fffff │               11.5 │
│ 882a13d281fffff │               20.0 │
│ 882a13d529fffff │               15.0 │
│ 882a1438c3fffff │               10.0 │
│ 882a15663bfffff │               10.0 │
│ 882a1ab9c1fffff │              10.01 │
│ 882a353927fffff │              11.05 │
│ 882aa16327fffff │              12.51 │
│ 882aaab9ebfffff │              10.01 │
│ 882ad09327fffff │              12.51 │
├─────────────────┴────────────────────┤
│ 193 rows (20 shown)        2 columns │
└──────────────────────────────────────┘

Alternatives

You can also write the above as

uck.Table('data/yellow_tripdata_2010-01.parquet').do(
    'select *, pickup_latitude as lat, pickup_longitude as lng',
    'select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
    'select hexid, avg(tip_amount) as tip  group by 1',
    'select h3_h3_to_string(hexid) as hexid, tip',
    'where tip between 10 and 20',
    'order by hexid',
)

or

uck.Table('data/yellow_tripdata_2010-01.parquet').do(
    'select *, pickup_latitude as lat, pickup_longitude as lng',
).do(
    'select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
).do(
    'select hexid, avg(tip_amount) as tip  group by 1',
).do(
    'select h3_h3_to_string(hexid) as hexid, tip',
).do(
   'where tip between 10 and 20',
).do(
    'order by hexid',
)

Pivot

TODO

Storing procedures

You might store a sequence of steps as a function like

def foo(rel, res=6):
    return (rel
    | 'select pickup_latitude as lat, pickup_longitude as lng, tip_amount'       
    | f'select h3_latlng_to_cell(lat, lng, {res}) as hexid, tip_amount as tip'
    | 'select hexid, avg(tip) as tip group by 1'
    | 'select h3_h3_to_string(hexid) as hexid, tip'
    | 'where tip > 0'
    )

which you could apply with any of the following syntax:

  • table.do(foo)
  • table | foo or table >> foo

Alternatively, you could store this as a sequence of strings:

foo_list = [
    'select pickup_latitude as lat, pickup_longitude as lng, tip_amount'       
    'select h3_latlng_to_cell(lat, lng, 6}) as hexid, tip_amount as tip'
    'select hexid, avg(tip) as tip group by 1'
    'select h3_h3_to_string(hexid) as hexid, tip'
    'where tip > 0'
]

which you could apply with something like

table.do(*foo_list)

or

table.do(foo_list)

or even

table | foo_list