Concepts
Loading, saving, and transferring tables
import duckboat as uck
# read from a remote CSV file
= uck.Table('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/refs/heads/master/palmerpenguins/data/penguins.csv')
tbl = uck.Table('data/penguins.csv') # read from a local CSV file
tbl
= tbl.df() # write to Pandas DataFrame
df # read from Pandas DataFrame
uck.Table(df)
'data/penguins.parquet') # use Pandas to write to a Parquet file
df.to_parquet(= uck.Table('data/penguins.parquet') # read from a local Parquet file
tbl
= tbl.arrow() # write to an Arrow Table
tbl_arrow = uck.Table(tbl_arrow) # read from an Arrow Table tbl
Databases and Tables
The library revolves around two objects and a function: Database
, Table
.
Database
is essentially a dictionary mapping names to objects that will get resolved to tables. Those objects might be a Pandas or Polars dataframe, a PyArrow Table, the local filename of a Parquet or CSV file, or a URL to a remote data file. These objects are used byduckboat
and Duckdb lazily, so operations on them are deffered until a final result is requested.Table
is a wrapper around a DuckDB Relation.
Laziness
Since all operations are done through DuckDBPyRelation
, the expressions are evaluated lazily, or only at the end when we want a table or want to display results to the console.
If you would like to materialze a Table
or a Database
concretely in terms of Pandas DataFrames or PyArrow Tables, you can use x.hold(kind='arrow')
or x.do('arrow')
.
When you materialize a Database
you can access the underlying tables with db[table_name]
.
Examples
Chaining
duckboat
uses DuckDB to build up Relation expressions through chaining, which DuckDB will then execute after running the entire expression through a query planner to optimize execution.
import duckboat as uck
# uck.Table('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2010-01.parquet')
'yellow_tripdata_2010-01.parquet').do(
uck.Table('select *, pickup_latitude as lat, pickup_longitude as lng',
'select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
'select hexid, avg(tip_amount) as tip group by 1',
'select h3_h3_to_string(hexid) as hexid, tip',
'where tip between 10 and 20',
'order by hexid',
)
Gives the output:
┌─────────────────┬────────────────────┐
│ hexid │ tip │
│ varchar │ double │
├─────────────────┼────────────────────┤
│ 881bb2a0b5fffff │ 12.22 │
│ 882a100299fffff │ 10.02818181818182 │
│ 882a10029dfffff │ 11.666666666666666 │
│ 882a1002c3fffff │ 10.0 │
│ 882a10034bfffff │ 16.0 │
│ 882a100353fffff │ 17.6 │
│ 882a10045bfffff │ 10.0 │
│ 882a100487fffff │ 11.525 │
│ 882a10060dfffff │ 16.1 │
│ 882a100611fffff │ 10.0 │
│ · │ · │
│ · │ · │
│ · │ · │
│ 882a13c4d9fffff │ 11.5 │
│ 882a13d281fffff │ 20.0 │
│ 882a13d529fffff │ 15.0 │
│ 882a1438c3fffff │ 10.0 │
│ 882a15663bfffff │ 10.0 │
│ 882a1ab9c1fffff │ 10.01 │
│ 882a353927fffff │ 11.05 │
│ 882aa16327fffff │ 12.51 │
│ 882aaab9ebfffff │ 10.01 │
│ 882ad09327fffff │ 12.51 │
├─────────────────┴────────────────────┤
│ 193 rows (20 shown) 2 columns │
└──────────────────────────────────────┘
Alternatives
You can also write the above as
'data/yellow_tripdata_2010-01.parquet').do(
uck.Table('select *, pickup_latitude as lat, pickup_longitude as lng',
'select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
'select hexid, avg(tip_amount) as tip group by 1',
'select h3_h3_to_string(hexid) as hexid, tip',
'where tip between 10 and 20',
'order by hexid',
)
or
'data/yellow_tripdata_2010-01.parquet').do(
uck.Table('select *, pickup_latitude as lat, pickup_longitude as lng',
).do('select *, h3_latlng_to_cell(lat, lng, 8) as hexid',
).do('select hexid, avg(tip_amount) as tip group by 1',
).do('select h3_h3_to_string(hexid) as hexid, tip',
).do('where tip between 10 and 20',
).do('order by hexid',
)
Pivot
TODO
Storing procedures
You might store a sequence of steps as a function like
def foo(rel, res=6):
return (rel
| 'select pickup_latitude as lat, pickup_longitude as lng, tip_amount'
| f'select h3_latlng_to_cell(lat, lng, {res}) as hexid, tip_amount as tip'
| 'select hexid, avg(tip) as tip group by 1'
| 'select h3_h3_to_string(hexid) as hexid, tip'
| 'where tip > 0'
)
which you could apply with any of the following syntax:
table.do(foo)
table | foo
ortable >> foo
Alternatively, you could store this as a sequence of strings:
= [
foo_list 'select pickup_latitude as lat, pickup_longitude as lng, tip_amount'
'select h3_latlng_to_cell(lat, lng, 6}) as hexid, tip_amount as tip'
'select hexid, avg(tip) as tip group by 1'
'select h3_h3_to_string(hexid) as hexid, tip'
'where tip > 0'
]
which you could apply with something like
*foo_list) table.do(
or
table.do(foo_list)
or even
| foo_list table