Penca — The version-controlled lakebase

The problem

Two stitched systems, no version control.

Production writes to one database. Analytics and ML workloads read from another. CDC stitches them together. You pay twice for storage and get paged when the pipeline breaks.

Test a migration, tune a query, train a model. Each one involves tedious data copy and environment setup, limiting you to a handful of simultaneous experiments.

When something breaks, you can't see who changed what. Debugging takes hours, rolling back means manual surgery. Audit and restore end up as custom projects.

                                                                                                      one storage bill
  Postgres / Spanner / MySQL  ────  CDC ~~~ replication lag ~~~  ────►  Warehouse / Lakehouse  ────►  consistent reads
                                                                                                      simple ops

Why now

Agents make it worse.

The OLTP/OLAP split has been painful for a decade. Weekly refreshes worked when experiments took quarters. Now AI agents iterate by the minute. They branch aggressively, read data that was written ten seconds ago, and write data with no human in the loop to catch the mistake. Same problems, now at machine speed.

The solution

A version-controlled lakebase.

A lakebase is a transactional database that sits directly on an open lakehouse. Production and analytics run on the same open files. One copy, one storage bill, no CDC. Penca is the version-controlled lakebase. Start fresh and never worry about scaling, or point it at the lakehouse you already run.

Unified object storage

Penca saves data as open columnar files on object storage and registers them with an Iceberg REST catalog. There’s no CDC pipeline and no second database to pay for. Bring your own bucket and catalog, or let us host them.

Branch like Git, on real data

Fork live production to isolated branches with zero data copy and zero shared compute in minutes. Run experiments in parallel without setup or a queue. Throw the branch away, or promote it.

Row-level versioning

Every mutation appended to an immutable log with author and timestamp. Every consistent state of the database is auditable and recoverable at any point in time. It’s Git for data.

No lock-in

Every interface is an open standard.

Each one is an open standard your team already uses: Arrow Flight SQL for queries, gRPC for programmatic access, and Apache Iceberg at rest. Penca saves data directly to your bucket as Iceberg tables, so your tools work, your data is portable, and there’s nothing proprietary to migrate off.

SQL

Standard SQL via JDBC, ODBC, or ADBC. However your production applications and BI tools already connect works out of the box.

gRPC

Programmatic access via standard gRPC clients in any language. Branch, transact, and inspect row-level history straight from your code.

Iceberg

Iceberg tables in object storage at rest. Read your data with DuckDB, Polars, pandas, Spark.

Query your data with SQL, gRPC, or your favorite analytics tool:

# Standard SQL via JDBC, ODBC, ADBC
SELECT transaction_id, customer_id, region, price
FROM checkout_events;

# gRPC API in any language
client.read_data(
    table_name="checkout_events",
    columns=["transaction_id", "customer_id", "region", "price"],
)

# Or read Iceberg tables at rest with your favorite engine
catalog = pyiceberg.catalog.load_catalog(
    "rest_catalog",
    **{
        "type": "rest",
        "uri": "https://your-catalog-url.com",
        "credential": "your_client_id:your_client_secret",
        "warehouse": "your_warehouse_name"
    }
)
catalog.load_table("penca.checkout_events").scan(
    selected_fields=("transaction_id", "customer_id", "region", "price")
).to_pandas()

Shape what’s next

Apply to be a design partner.

We want a handful of teams running production on a split system today and feeling the pain daily. You get direct access to the engineering team and a voice in the roadmap. We get sharp signal from people who live the problem.

Ready to delete your CDC pipeline?

Send a note about your team: what you’re running today, what you’re trying to unlock. We'll reply within a few days.

[email protected]