Parquet
- These are my notes taken from📘Parquet docs
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Consider a
The metadata of this format is specified in the parquet.thrift file, readable at: parquet.thrift
4-byte magic number "PAR1"
<Column 1 Chunk 1>
<Column 2 Chunk 1>
...
<Column N Chunk 1>
<Column 1 Chunk 2>
<Column 2 Chunk 2>
...
<Column N Chunk 2>
...
<Column 1 Chunk M>
<Column 2 Chunk M>
...
<Column N Chunk M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"
Each column is saved with a chunk of
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.
Metadata:
- File Metadata is written after the data to allow for single pass writing.
- There are two types of metadata: file metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.
Supported types:
BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays
FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
Nested Encoding:
- Definition levels specify how many optional fields in the path for the column are defined.
- Repetition levels specify at what repeated field in the path has the value repeated.
- how much nesting there is can be computed from the schema. This value defines the maximum number of bits required to store the levels (levels are defined for all values in the column).
Nulls:
- a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.
Date Pages:
- The 3 pieces of information are encoded back to back, after the page header. No padding allowed.
- In order we have:
- Repetition levels data
- Definition levels data
- Encoded values
Column chunks
- Column chunks are composed of pages written back to back.
- Pages share a common header
Error recovery:
- Losing metadata lead to lose the file
- Corruption of metadata headers lead to the loss of the the column chunk
Separation design:
- The format is explicitly designed to separate the metadata from the data.
- This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.