Schema lost in transit only to recreate again. WT*?

.

The current state of the art for Data Engineers is to build pipelines that ingest structured and semi-structured data in JSON, CSV, AVRO and store these as BLOBs on S3 (where the schema is lost). Then “Schema on Read” technologies such as Snowflake, Dremio or Presto process these BLOBs by (re)-applying the schema that was lost to deliver insights.

So this begs a set of questions starting with “Why is schema lost in transit?” The industry gives us two primary reasons –

  1. Schema can change frequently and “Schema on Write” data platforms cannot handle these schema changes.
  2. Performance of the ingest Pipes – transferring BLOBs is more efficient than understanding the schema.

If we compare the two approaches below:

Characteristic Schema on Write Schema on Read
Reads Fast(er) Slow(er)
Writes Slow(er) Fast(er)
Post-facto Schema Changes Nada I don’t care
Data Format Structured and Semi-structured(?) Structured, Semi-Structured and Unstructured
Validation Upfront Let’s do ETL to validate

The truth is that Engineers need the flexibility of choosing either option depending on the situation. The coming age of ML and JSON requires that flexibility to deliver both Analyst reports and real time ML. What if a data platform can give you the best of both worlds? That data platform is bi(OS).

bi(OS) was designed by not asking what the differences are, but asking why these differences exist using first principles of computer science. The end result is a real-time hyper-converged data platform that gives Data Engineers the best of both worlds. My blog next week will explain how bi(OS) does the impossible while helping Data Engineers do 10x more in days. In the meantime, I would love to hear your thoughts as to why these differences exist. Please join the conversation on the 10x Data Engineer community.