Who is M/F? Real World Data Quality

Posted on: . by Monish Suvarna

My blog last week talked about Schema lost in transit. Let me tell you a real story about a customer. We uploaded data from their repository into bi(OS) and the image above is a screenshot of what we saw on bi(OS) after an hour. We were not surprised as we have seen this story at customer after customer. “Schema on Read” platforms that the overuse of the String data types with no validation during ingestion cause junk data to be stored wrecking havoc with models downstream.

The fundamental advantage that “Schema on Read” systems have is the flexibility of ignoring schema during ingestion, but it is also its achilles heel by having no schema validation during ingestion. Data nirvana would be having your cake [ easy schema changes] and eating [schema validation] it too. That would be impossibly hard was our first reaction, but we creatively found a way to achieve that outcome with bi(OS) by supporting Dynamic Schemas.

bi(OS) keeps a versioned schema that is atomically applied to a scale out distributed system. This allows Data Engineers to make changes to the schema while ingestion is in progress. Adding/deleting columns is just clicks or an API call. So now, Data Engineers can validate data during ingestion and change the schema in seconds if upstream systems change.

Data extraction is even simpler. bi(OS) associates ingested data with a schema version and automatically applies the right schema when the data is extracted. All a user has to do is call isQL.select() and bi(OS) ensures that the right version is applied.

With bi(OS), Data Engineers get the best of both worlds. A ‘Schema on Write” validation with “Schema on Read” flexibility to support upstream changes. On top of that bi(OS) is a scale out data platform that is blazing fast for both reads and writes. The only thing bi(OS) cannot handle is unstructured data such as images and video, but even that data once processed can be stored as structured data with a pointer in bi(OS) to the unstructured data Blob.

Characteristic Schema on Write Schema on Read bi(OS)
Reads Fast(er) Slow(er) Fast(er)
Writes Slow(er) Fast(er) Fast(er)
Post-facto Schema Changes Nada I don’t care Yes!
Data Format Structured and Semi-structured(?) Structured, Semi-Structured and Unstructured Structured and Semi-structured.
Validation Upfront Let’s do ETL to validate Upfront

You can try bi(OS) by signing up here for a free PoC. Our inbuilt connectors will allow you to solve a use case from ingest to insight in just days while experimenting with dynamic schema. We would love to hear your thoughts on our 10x Data Engineer Slack channel.

P.S – The picture is a screenshot from bi(OS). bi(OS) builds a word cloud for string data types in real time. The double bonus from bi(OS) is that you can validate values for string data types by allowing only “m”, “f”, “male”, “female” as valid values and bi(OS) will enforce it.