Databricks, the company founded by the genuine developers of the Apache Spark gigantic data analytics motor, today announced that it has open-sourced Delta Lake, a storage layer that makes it easier to ensure data integrity as brand-new data flows into an enterprise’s data lake by bringing ACID transactions to these vast data repositories.
Delta Lake, which has long been a proprietary part of Databrick’s gifting
, is already in production use by companies like Viacom, Edmunds, Riot Games and McGraw Hill.
The equipment provides the ability to enforce exact schemas (which can be changed as necessary), to create snapshots and to ingest streaming data or backfill the lake as a batch job. Delta Lake also uses the Spark motor to handle the metadata of the data lake (which by itself is often a gigantic data problem). Over moment, Databricks also plans to add an audit trail, among other things.
“Today nearly every company has a data lake they are trying to attain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes’,” said Ali Ghodsi, co-founder and CEO at Databricks.
What’s important to note here is that Delta lake runs on top of existing data lakes and is accordant with the Apache spark APIs.
The company is still looking at how the project will be governed in the future. “We are still exploring non-identical models of open source project governance, but the GitHub version is well understood and presents a good trade-off between the ability to accept contributions and governance overhead,” Ghodsi said. “One thing we know for sure is we want to foster a vibrant community, as we see this as a critical piece of technology for increasing data reliability on data lakes. This is why we chose to go with a permissive open source license version: Apache License v2, same license that Apache Spark uses.”
To invite this community, Databricks plans to take outside contributions, just like the Spark project.
“We want Delta Lake technology to be used everywhere on-prem and in the cloud by tiny and big enterprises,” said Ghodsi. “This come is the fastest path to build something that can become a quality by having the community provide direction and contribute to the development efforts.” That’s also why the company decided against a commons Clause licenses that some open-source companies now use to prevent others (and especially big clouds) from using their open source tools in their own commercial SaaS offerings. “We believe the Commons Clause license is restrictive and will discourage adoption. Our primary goal with Delta Lake is to steer adoption on-prem as well as in the cloud.”