DirectScan, fallback modes, and EXPLAIN

Suggest edits

Postgres Lakehouse is fastest when it can "push down" your entire query to DataFusion, the vectorized query used for handling queries when possible. (In the future, this will be more fine-grained as we add support for partial pushdowns.)

Postgres Lakehouse can execute your query in two modes. First, it attempts to run the entire query using Seafowl (a dedicated columnar database based on DataFusion). If Seafowl can't run the entire query, for example, because it uses PostgreSQL-specific operations like JSON, then Postgres Lakehouse will fall back to using the PostgreSQL executor, with Seafowl streaming full table contents to it.

If your query is extremely slow, it's possible that's what's happening.

You can check which mode is being used by running an EXPLAIN on the query and making sure that the top-most query node is DirectScan. For example:

explain select count from (select count(*) from tpch_sf_1.lineitem);
                                                                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=167.52..167.55 rows=1 width=8)
   ->  Append  (cost=0.00..165.01 rows=1001 width=0)
         ->  CompatScan on "16529" lineitem_2  (cost=100.00..150.00 rows=1000 width=0)
               SeafowlPlan: logical_plan
                 TableScan:  tpch_sf_1.lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment]
(6 rows)

In this case, the query is executed by PostgreSQL and Seafowl is only involved when scanning the table (see CompatScan at the bottom). The fix in this case is to explicitly name the inner COUNT(*) column, since Seafowl gives it an implicit name count(*) whereas PostgreSQL calls it count:

edb_admin=> explain select count from (select count(*) as count from tpch_sf_1.lineitem);
                             QUERY PLAN
--------------------------------------------------------------------
 DirectScan: logical_plan
   Projection:  COUNT(*) AS count
     Aggregate:  groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
       TableScan:  tpch_sf_1.lineitem projection=[]
(4 rows)

Here, we can see the DirectScan at the top, which means that Seafowl is running the entire query.

If you're having trouble rewording your query to make it run fully on Seafowl, open a support ticket.


Could this page be better? Report a problem or suggest an addition!