LakeFS

lakeFS is a free and open-source software developed by Treeverse. It provides scalable and format-agnostic version control for data lakes, using Git-like semantics to create and access different data versions.

First released in August 2020, its features include data version tracking, isolated development and testing, repository rollback, continuous data integration and deployment.

History
lakeFS was developed by Oz Katz and Einat Orr in 2020.

Its first public release, v0.8.1, was provided by Treeverse in August 2020. This version provided Git-like operations for any file format and AWS S3 storage compatibility, featuring a versioning engine based on MVCC.

In 2021, the versioning engine transitioned to Graveler, increasing its handling capacity to billions of objects with a limited performance impact.

In July 2021, Treeverse, the parent company of lakeFS, received an investment of $23 million in a Series A funding round, led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.

In June 2022, lakeFS Cloud was introduced as a managed service to facilitate versioning in cloud data lakes. This service helps mitigate challenges related to tracking data changes and reverting to previous versions.

Overview
lakeFS is a data versioning engine that manages data in a way similar to code. By using operations such as branching, committing, merging, and reverting, which resemble those found in Git, it facilitates the handling of data and its corresponding schema throughout the entire data life cycle.

Features
lakeFS is an interface made for interaction with object stores such as S3 as well as data management systems, such as AWS Glue and Databricks. The system assigns the task of actual data storage to backend services such as AWS, while it handles branch tracking and supports multiple storage providers.

lakeFS simplifies branch creation, tracking, and merging. It removes the need for complete dataset duplication during testing phases, thereby isolating experimental modifications. It also streamlines branch operations, supporting the creation, merging, or deletion of branches as required. Furthermore, it integrates with continuous integration and deployment pipelines via webhooks.

When dealing with arbitrary object storage, lakeFS processes data blocks via API calls. It stores branching information as metadata, enabling efficient subsequent object management as needed.

lakeFS hooks
lakeFS hooks enable specific checks and validations before key lifecycle events. Unlike Git Hooks, these hooks activate remote servers to run tests. They can be configured to assess table schemas when merging data from development or test branches into production; if validation fails, the merge is blocked. This function serves as a tool for schema enforcement and standardized rule application across various data sources and producers.

Events that can trigger these hooks may include change commits, branch merges, new branch creations, or alterations in tags. In the context of a merge, a pre-merge hook operates on the source branch before the finalization of the merge.