- Maintainer: Qianqian Fang <q.fang at neu.edu>
- License: Apache License, Version 2.0
- Version: 1 (Draft 4-preview)
- URL: https://neurojson.org/bjdata/
- Status: Under development
- Development: https://github.com/NeuroJSON/bjdata
- Acknowledgement: This project is supported by US National Institute of Health (NIH) grant U24-NS124027 (NeuroJSON)
- Abstract:
The Binary JData (BJData) Specification defines an efficient serialization protocol for unambiguously storing complex and strongly-typed binary data found in diverse applications. The BJData specification is the binary counterpart to the JSON format, both of which are used to serialize complex data structures supported by the JData specification (https://neurojson.org/jdata). The BJData spec is derived and extended from the Universal Binary JSON (UBJSON, https://ubjson.org) specification (Draft 12). It adds supports for N-dimensional packed arrays and extended binary data types.
- Introduction
- License
- Format Specification
- Recommended File Specifiers
- Acknowledgement
The Javascript Object Notation (JSON) format is ubiquitously used in today's web and native applications. JSON offers numerous advantages, including simplicity, human- and machine-readability, and versatility, with a large toolchain ecosystem ranging from numerous parsers to highly efficient document-store databases. However, its plain-text nature and schema-less design make it difficult to efficiently store and exchange strongly-typed binary data, including numerical and multi-dimensional arrays and other complex data structures generated in scientific research and imaging applications.
In recent years, efforts to address these limitations have resulted in an array of binary JSON formats, such as BSON (Binary JSON, https://bson.org), UBJSON (Universal Binary JSON, https://ubjson.org), MessagePack (https://msgpack.org), and CBOR (Concise Binary Object Representation, [RFC 7049], https://cbor.io), among others. These binary JSON counterparts are broadly adopted in speed/space-sensitive data applications and vary in terms of supported data types, flexibility of containers (arrays and objects), and associated libraries.
The Binary JData (BJData) format is a binary JSON format extended from the UBJSON Specification Draft 12 (https://github.com/ubjson/universal-binary-json) by adding native support for N-dimensional packed arrays - an essential data structure for scientific applications - and columnar table format for packed objects to offer efficient storage of tables and repeating objects, such as Python Pandas DataFrame and NumPy arrays-of-objects. These optimized container constructs greatly reduce storage overhead and redundancy, enhancing data exchange efficiency.
In addition, BJData also extends binary data types that UBJSON failed to support, including unsigned integer types, half-precision floating-point numbers, a native byte type, as well as user-defined binary extensions. With these extensions, a BJData file can store binary arrays larger than 4 GB in size, which is not possible with MessagePack (maximum data record size is limited to 4 GB) or BSON (maximum total file size is 4 GB). Despite these extensions, the BJData format remains simple to parse.
An outstanding benefit of the BJData format (and its predecessor UBJSON) as opposed to other more popular binary JSON formats, such as BSON, CBOR, and MessagePack, is its quasi-human-readability - a unique characteristic that is absent from almost all other binary formats. This is because all data semantic elements in a BJData file, e.g. the "name" fields and data-type markers, are defined in human-readable strings. The resulting binary can be partially readable using an editor with minimal or no processing. We anticipate that such a unique capability makes a data file self-explanatory and easy to support without compromising computational and storage efficiency.
The Binary JData Specification is licensed under the Apache 2.0 License.
A single construct with two optional segments (length and data) is used for all types:
[type, 1-byte char]([integer numeric length])([data])
Each element in the tuple is defined as:
-
type - A 1-byte ASCII char (Marker) used to indicate the type of the data following it.
-
length (optional) - A positive, integer numeric type (
uint8,int8,uint16,int16,uint32,int32,uint64orint64) specifying the length of the following data payload. -
data (optional) - A contiguous byte-stream containing serialized binary data representing the actual binary data for this type of value.
-
Some values are simple enough that just writing the 1-byte ASCII marker into the stream is enough to represent the value (e.g.
null), while others have a type that is specific enough that no length is needed as the length is implied by the type (e.g.int32). Yet others still require both a type and a length to communicate their value (e.g.string). In addition, some values (e.g.array) have additional (optional) parameters to improve decoding efficiency and/or to reduce the size of the encoded value even further. -
The BJData specification (since Draft-2) requires that all numeric values must be written in the Little-Endian order. This is a breaking change compared to BJData Draft-1 and UBJSON Draft-12, where numeric values are in Big-Endian order.
-
The
arrayandobjectdata types are container types, similar to JSON arrays and objects. They help partition and organize data records of all types, including container types, into composite and complex records. -
Using a strongly-typed container construct can further reduce data file sizes as it extracts common data type markers to the header. It also helps a parser to pre-allocate necessary memory buffers before reading the data payload.
In the following sections, we use a block-notation to illustrate the layout
of the encoded data. In this notation, the data type markers and individual
data payloads are enclosed by a pair of [], strictly for illustration purposes.
Both illustration markers [ and ] as well as the whitespaces between these
data elements, if present, shall be ignored when performing the actual data storage.
| Type | Total size | ASCII Marker(s) | Length required | Data (payload) |
|---|---|---|---|---|
| null | 1 byte | Z | No | No |
| no-op | 1 byte | N | No | No |
| true | 1 byte | T | No | No |
| false | 1 byte | F | No | No |
| int8 | 2 bytes | i | No | Yes |
| uint8 | 2 bytes | U | No | Yes |
| int16 | 3 bytes | I (upper case i) | No | Yes |
| uint16* | 3 bytes | u | No | Yes |
| int32 | 5 bytes | l (lower case L) | No | Yes |
| uint32* | 5 bytes | m | No | Yes |
| int64 | 9 bytes | L | No | Yes |
| uint64* | 9 bytes | M | No | Yes |
| float16/half* | 3 bytes | h | No | Yes |
| float32/single | 5 bytes | d | No | Yes |
| float64/double | 9 bytes | D | No | Yes |
| high-precision number | 1 byte + int num val + string byte len | H | Yes | Yes |
| char | 2 bytes | C | No | Yes |
| byte | 2 bytes | B | No | Yes |
| extension | 1 byte + int num val + int num val + payload | E | Yes | Yes (if not empty) |
| string | 1 byte + int num val + string byte len | S | Yes | Yes (if not empty) |
| array | 2+ bytes | [ and ] | Optional | Yes (if not empty) |
| object | 2+ bytes | { and } | Optional | Yes (if not empty) |
* Data type markers that are not defined in the UBJSON Specification (Draft 12)
The null value is equivalent to the null value from the JSON specification.
In JSON:
{
"passcode": null
}In BJData (using block-notation):
[{]
[i][8][passcode][Z]
[}]
There is no equivalent to the no-op value in the original JSON specification. When
decoding, No-Op values should be skipped.
The intended usage of the no-op value is as a valueless signal between a
producer (most likely a server) and a consumer (most likely a client) to indicate
activity, for example, as a keep-alive signal so that a client knows a server is
still working and hasn't hung or timed out.
A Boolean type is equivalent to the Boolean value (true or false) defined in
the JSON specification.
In JSON:
{
"authorized": true,
"verified": false
}In BJData (using block-notation):
[{]
[i][10][authorized][T]
[i][8][verified][F]
[}]
Unlike in JSON, which has a single Number type (used for both integers and floating point numbers), BJData defines multiple types for integers. The minimum/maximum of values (inclusive) for each integer type are as follows:
| Type | Signed | Minimum | Maximum |
|---|---|---|---|
| int8 | Yes | -128 | 127 |
| uint8 | Yes | 0 | 255 |
| int16 | No | -32,768 | 32,767 |
| uint16 | Yes | 0 | 65,535 |
| int32 | No | -2,147,483,648 | 2,147,483,647 |
| uint32 | Yes | 0 | 4,294,967,295 |
| int64 | No | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
| uint64 | Yes | 0 | 18,446,744,073,709,551,615 |
| float16/half | Yes | See IEEE 754 Spec | See IEEE 754 Spec |
| float32/single | Yes | See IEEE 754 Spec | See IEEE 754 Spec |
| float64/double | Yes | See IEEE 754 Spec | See IEEE 754 Spec |
| high-precision number | Yes | Infinite | Infinite |
Notes:
- Numeric values of
+infinity,-infinityandNaNare to be encoded using their respective IEEE 754 binary form; this is different from the UBJSON specification whereNaNand infinity are converted tonull. - It is advisable to use the smallest applicable type when encoding a number.
All integer types (uint8, int8, uint16, int16, uint32, int32, uint64 and
int64) are written in Little-Endian order (this is different from UBJSON, where all
integers are written in Big-Endian order).
All float types (half, single, double are written in Little-Endian order
(this is different from UBJSON which does not specify the Endianness of floats).
-
float16or half-precision values are written in IEEE 754 half precision floating point format, which has the following structure:- Bit 15 (1 bit) - sign
- Bit 14-10 (5 bits) - exponent
- Bit 9-0 (10 bits) - fraction (significant)
-
float32or single-precision values are written in IEEE 754 single precision floating point format, which has the following structure:- Bit 31 (1 bit) - sign
- Bit 30-23 (8 bits) - exponent
- Bit 22-0 (23 bits) - fraction (significant)
-
float64or double-precision values are written in IEEE 754 double precision floating point format, which has the following structure:- Bit 63 (1 bit) - sign
- Bit 62-52 (11 bits) - exponent
- Bit 51-0 (52 bits) - fraction (significant)
These are encoded as a string and thus are only limited by the maximum string size. Values must be written out in accordance with the original JSON number type specification.
Numeric values in JSON:
{
"int8": 16,
"uint8": 255,
"int16": 32767,
"uint16": 32768,
"int32": 2147483647,
"int64": 9223372036854775807,
"uint64": 9223372036854775808,
"float32": 3.14,
"float64": 113243.7863123,
"huge1": "3.14159265358979323846",
"huge2": "-1.93+E190",
"huge3": "719..."
}In BJData (using block-notation):
[{]
[i][4][int8][i][16]
[i][5][uint8][U][255]
[i][5][int16][I][32767]
[i][6][uint16][u][32768]
[i][5][int32][l][2147483647]
[i][5][int64][L][9223372036854775807]
[i][6][uint64][M][9223372036854775808]
[i][7][float32][d][3.14]
[i][7][float64][D][113243.7863123]
[i][5][huge1][H][i][22][3.14159265358979323846]
[i][5][huge2][H][i][10][-1.93+E190]
[i][5][huge3][H][U][200][719...]
[}]
The char type in BJData is an unsigned byte meant to represent a single
printable ASCII character (decimal values 0-127). It must not have a
decimal value larger than 127. It is functionally identical to the uint8 type,
but semantically is meant to represent a character and not a numeric value.
Char values in JSON:
{
"rolecode": "a",
"delim": ";",
}BJData (using block-notation):
[{]
[i][8][rolecode][C][a]
[i][5][delim][C][;]
[}]
The byte type in BJData is functionally identical to the uint8 type,
but semantically is meant to represent a byte and not a numeric value. In
particular, when used as the strong type of an array container it provides
a hint to the parser that an optimized data storage format may be used as
opposed to a generic array of integers.
See also optimized format below.
Byte values in JSON:
{
"binary": [222, 173, 190, 239]
"val": 123,
}BJData (using block-notation):
[{]
[i][6][binary] [[] [$][B] [#][i][4] [222][173][190][239]
[i][3][val][B][123]
[}]
The string type in BJData is equivalent to the string type from the JSON
specification apart from the fact that BJData string value requires UTF-8
encoding.
String values in JSON:
{
"username": "andy",
"imagedata": "...huge string payload..."
}BJData (using block-notation):
[{]
[i][8][username][S][i][4][andy]
[i][9][imagedata][S][l][2097152][...huge string payload...]
[}]
See also optimized format below.
The array type in BJData is equivalent to the array type from the JSON
specification.
The child elements of an array are ordered and can be accessed by their indices.
Array in JSON:
[
null,
true,
false,
4782345193,
153.132,
"ham"
]BJData (using block-notation):
[[]
[Z]
[T]
[F]
[l][4782345193]
[d][153.132]
[S][i][3][ham]
[]]
The object type in BJData is equivalent to the object type from the JSON
specification. Since value names can only be strings, the S (string) marker
must not be included since it is redundant.
The child elements of an object are ordered and can be accessed by their names.
Object in JSON:
{
"post": {
"id": 1137,
"author": "Andy",
"timestamp": 1364482090592,
"body": "The quick brown fox jumps over the lazy dog"
}
}BJData (using block-notation):
[{]
[i][4][post][{]
[i][2][id][I][1137]
[i][6][author][S][i][4][Andy]
[i][9][timestamp][L][1364482090592]
[i][4][body][S][i][43][The quick brown fox jumps over the lazy dog]
[}]
[}]
Both container types (array and object) support optional parameters that can
help optimize the container for better parsing performance and smaller size.
When a type is specified, all value types stored in the container (either array or object) are considered to be of that singular type and, as a result, type markers are omitted for each value within the container. This can be thought of as providing the ability to create a strongly-typed container in BJData.
A major different between BJData and UBJSON is that the type in a BJData
strongly-typed container is limited to non-zero-fixed-length data types, therefore,
only integers (i,U,I,u,l,m,L,M), floating-point numbers (h,d,D), char (C) and byte (B)
are qualified. All zero-length types (T,F,Z,N), variable-length types(S, H)
and container types ([,{) shall not be used in an optimized type header.
This restriction is set to reduce the security risks due to potentials of
buffer-overflow attacks using zero-length markers,
hampered readability and diminished benefit using variable/container
types in an optimized format.
The requirements for type are
- If a type is specified, it must be one of
i,U,I,u,l,m,L,M,h,d,D,C,B. - If a type is specified, it must be done so before a count.
- If a type is specified, a count must be specified as well. (Otherwise
it is impossible to tell when a container is ending, e.g. did you just parse
] or the
int8value of 93?)
[$][U]
When a count is followed by a single non-negative integer record, i.e. one of
i,U,I,u,l,m,L,M, it specifies the total child element count. This allows the
parser to pre-size any internal construct used for parsing, verify that the
promised number of child values were found, and avoid scanning for any terminating
bytes while parsing.
- A count can be specified without a type.
[#][i][64]
An optimized array has a uniform payload, i.e., all data records stored inside the payload section have the same data type and byte length, and thus can help accelerate loading and saving of such data.
When both type and count are specified and the count marker # is followed
by [, the parser should expect the following sequence to be a 1-D array with
zero or more (Ndim) integer elements (Nx, Ny, Nz, ...). This specifies an
Ndim-dimensional array of uniform type specified by the type marker after $.
The array data are serialized in the row-major format.
For example, the below two block sequences both represent an Nx*Ny*Nz*... array of
uniform numeric type:
[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] [Nx*Ny*Nz*...*sizeof(type)]
or
[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]
where Ndim is the number of dimensions, and Nx, Ny, and Nz ... are
all non-negative numbers specifying the dimensions of the N-dimensional array.
Nz/Ny/Nz/Ndim types must be one of the integer types (i,U,I,u,l,m,L,M).
The binary data of the N-dimensional array is then serialized into a 1-D vector
in the row-major element order (similar to C, C++, Javascript or Python) .
To store an N-dimensional array that is serialized using the column-major element
order (as used in MATLAB and FORTRAN), the count marker # should be followed by
an array of a single element, which must be a 1-D array of integer type as the
dimensional vector above. Either of the arrays can be in optimized or non-optimized
form. For example, either of the following
[[] [$] [type] [#] [[] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] []] [a11 a21 a31 ... a21 a22 ...]
or
[[] [$] [type] [#] [[] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] []] [a11 a21 a31 ... a21 a22 ...]
represents the same column-major N-dimensional array of type and size [Nx, Ny, Nz, ...].
The following 2x3x4 3-D uint8 array
[
[
[1,9,6,0],
[2,9,3,1],
[8,0,9,6]
],
[
[6,4,2,7],
[8,5,1,2],
[3,3,2,6]
]
]
shall be stored using row-major serialized form as
[[] [$][U] [#][[] [$][U][#][3] [2][3][4]
[1][9][6][0] [2][9][3][1] [8][0][9][6] [6][4][2][7] [8][5][1][2] [3][3][2][6]
or column-major serialized form as
[[] [$][U] [#][[] [[] [$][U][#][3] [2][3][4] []]
[1][6][2][8] [8][3][9][4] [9][5][0][3] [6][2][3][1] [9][2][0][7] [1][2][6][6]
- A count must be >= 0.
- A count can be specified alone.
- If a count is specified, the container must not specify an end-marker.
- A container that specifies a count must contain the specified number of child elements.
- If a type is specified, it must be done so before count.
- If a type is specified, a count must also be specified. A type cannot be specified alone.
- A container that specifies a type must not contain any additional type markers for any contained value.
Optimized with count
[[][#][i][5] // An array of 5 elements.
[d][29.97]
[d][31.13]
[d][67.0]
[d][2.113]
[d][23.8889]
// No end marker since a count was specified.
Optimized with both type and count
[[][$][d][#][i][5] // An array of 5 float32 elements.
[29.97] // Value type is known, so type markers are omitted.
[31.13]
[67.0]
[2.113]
[23.8889]
// No end marker since a count was specified.
Optimized with count
[{][#][i][3] // An object of 3 name:value pairs.
[i][3][lat][d][29.976]
[i][4][long][d][31.131]
[i][3][alt][d][67.0]
// No end marker since a count was specified.
Optimized with both type and count
[{][$][d][#][i][3] // An object of 3 name:float32-value pairs.
[i][3][lat][29.976] // Value type is known, so type markers are omitted.
[i][4][long][31.131]
[i][3][alt][67.0]
// No end marker since a count was specified.
BJData supports Structure-of-Arrays (SoA), as a special type of optimized container to store packed object data in either column-major or row-major orders. The payload in an SoA record are packed in the column or row order with chunks of binary data of identical byte length. This uniform columnar payload structure makes it efficient to parse and store.
[[][$] [{]<schema>[}] [#]<count> <payload> // row-major (interleaved)
[{][$] [{]<schema>[}] [#]<count> <payload> // column-major (columnar)
where:
[or{- container type (determines memory layout)$- optimized type marker{<schema>}- payload-less object defining the record structure#- count marker<count>- 1D integer OR ND dimension array<payload>- tightly packed data
The schema is a payload-less object: keys followed by type markers only, no values.
schema = '{' 1*(field-def) '}'
field-def = name type-spec
name = int-type length string-bytes
type-spec = fixed-type | bool-type | null-type | string-spec | highprec-spec
| nested-schema | array-spec
fixed-type = 'U' | 'i' | 'u' | 'I' | 'l' | 'm' | 'L' | 'M' | 'h' | 'd' | 'D' | 'C' | 'B'
bool-type = 'T' // boolean (1 byte: T or F in payload)
null-type = 'Z' // null (0 bytes in payload)
string-spec = fixed-string | dict-string | offset-string
fixed-string = 'S' int-type length // fixed-size string
dict-string = '[' '$' 'S' '#' count 1*(string-value) // dictionary-based string
offset-string = '[' '$' int-type ']' // offset-table-based variable string
highprec-spec = fixed-highprec | dict-highprec | offset-highprec
fixed-highprec= 'H' int-type length // fixed-size high-precision number
dict-highprec = '[' '$' 'H' '#' count 1*(highprec-value) // dictionary-based high-precision
offset-highprec = '[' '$' int-type ']' // offset-table-based variable high-precision (same as string)
nested-schema = '{' 1*(field-def) '}'
array-spec = '[' 1*(type-spec) ']' // fixed array with explicit element types
Key rules:
- Fixed-length numeric types:
U i u I l m L M h d D C B Tin schema means "boolean type" - each value is 1 byte (TorFmarker) in payloadZin schema means "null field" - no bytes in payload (placeholder/reserved field)- Strings (
S) and high-precision numbers (H) support three storage modes:- Fixed-length:
S <int-type> <length>orH <int-type> <length> - Dictionary-based:
[$S#<count><str1><str2>...or[$H#<count><val1><val2>... - Offset-table-based:
[$<int-type>]
- Fixed-length:
- Nested objects
{...}are allowed if all fields use supported types - No optimized containers can be used inside the schema, except in the case of serving as dictionary/offset-table markers for variable-length strings, as described in #4 above
FandNare not used in schema (useTfor boolean,Zfor null)
In a schema context, S and H followed by an integer define fixed-length strings or high-precision numbers:
{ i4 name S i 16 } // "name" is a 16-byte fixed string
{ i5 value H i 32 } // "value" is a 32-byte fixed high-precision number
In the payload, each record contributes exactly the specified bytes - no length prefix. Strings shorter than the length are right-padded with null bytes (0x00).
Use case: Strings with known maximum length (codes, IDs, short labels).
A dictionary for mapping string-value rows/columns should be indicated by a payload-less optimized string-array.
Use case: Repeated/categorical string values.
Schema syntax:
[$S#<count><str1><str2>...
[$H#<count><val1><val2>...
where each string/value is encoded as a standard BJData string or high-precision number
(with length prefix). No closing ] is needed because the count is specified.
Example schema:
{
i6 status [$S#i 3 // dictionary with 3 string values
i 6 active // index 0: "active"
i 8 inactive // index 1: "inactive"
i 7 pending // index 2: "pending"
}
Payload encoding: Each record stores a single integer index referencing the dictionary. The index type is automatically determined as the smallest unsigned integer type that can represent the dictionary size:
- count ≤ 255:
U(uint8, 1 byte) - count ≤ 65535:
u(uint16, 2 bytes) - count ≤ 4294967295:
m(uint32, 4 bytes) - otherwise:
M(uint64, 8 bytes)
Benefits:
- Excellent compression for low-cardinality categorical data
- O(1) lookup for string values
- Payload remains fixed-size per record
An offset-table based storage is used for storing string vectors of variable lengths.
It first concatenates all strings into a single linear buffer, and assigns an integer as
the offset from the beginning of the buffer for each string element. The use of
offset-table should be indicated by an payload-less optimized array containing only the
optimized type [$<type>].
Use case: Diverse strings with highly variable lengths (names, descriptions, free text).
Schema syntax:
[$<offset-type>]
Where <offset-type> is an integer type (i, U, I, u, l, m, L, M)
specifying the byte-offset type stored in the payload.
Example schema:
{
i2 id m // uint32 (4 bytes in payload)
i4 name [$l] // variable string with int32 offsets
i5 value d // float64 (8 bytes in payload)
}
Storage structure:
The offset-table-based string field stores an offset index in the fixed payload area. After all record payloads, an offset table and concatenated string buffer are appended:
┌─────────────────────────────────────────────────────────────────┐
│ Schema Header │
├─────────────────────────────────────────────────────────────────┤
│ Fixed-size payload (N records) │
│ - Each string field position stores a sequential index (0..N-1)│
│ - Other fields store actual values │
├─────────────────────────────────────────────────────────────────┤
│ Offset Table: (N+1) offsets of <offset-type> │
│ [0, end1, end2, ..., end_N] │
├─────────────────────────────────────────────────────────────────┤
│ String Buffer: concatenated strings (no length prefixes) │
│ str1 ∥ str2 ∥ str3 ∥ ... ∥ str_N │
└─────────────────────────────────────────────────────────────────┘
Decoding string i:
offset_start = offset_table[i]
offset_end = offset_table[i+1]
string_i = string_buffer[offset_start:offset_end]
Multiple variable-length fields: When a schema contains multiple offset-table-based fields, their offset tables and string buffers are appended in schema field order.
Empty strings: Represented by consecutive identical offsets in the offset table.
Benefits:
- Efficient for highly variable string lengths
- No wasted padding bytes
- Random access via offset table
- String buffer can be memory-mapped
In normal BJData, T and F are zero-length value markers:
T // true (no payload)
F // false (no payload)
In a schema context, T means "boolean type" - a 1-byte field:
{ i6 active T } // "active" is a boolean field
In the payload, each boolean value is stored as a single byte: T (0x54) for true,
F (0x46) for false.
In a schema context, Z means "null/placeholder field" with zero bytes in payload:
{
i2 id m // uint32 (4 bytes)
i8 reserved Z // placeholder (0 bytes)
i4 data d // float64 (8 bytes)
}
This is useful for:
- Reserved fields for future expansion
- Marking fields that exist in the schema but carry no data
- Sparse structures where some fields are always null
Using existing container markers:
| Syntax | Layout | Description |
|---|---|---|
[$ |
Row-major (Interleaved) | Array of records - each complete record stored sequentially |
{$ |
Column-major (Columnar) | Object of arrays - all values of each field stored together |
[$ {<schema>} #<count> <interleaved-payload>
Payload order: <record₁><record₂><record₃>...
Example: 3 particles with {x:float64, y:float64, id:uint32, active:bool}
[ $ { i1 x d i1 y d i2 id m i6 active T } # i 3
<x₁:8><y₁:8><id₁:4><active₁:1> <x₂:8><y₂:8><id₂:4><active₂:1> ...
Payload: 3 × 21 bytes = 63 bytes, interleaved
{$ {<schema>} #<count> <columnar-payload>
Payload order: <all field₁ values><all field₂ values>...
Example: Same 3 particles
{ $ { i1 x d i1 y d i2 id m i6 active T } # i 3
<x₁:8><x₂:8><x₃:8> <y₁:8><y₂:8><y₃:8> <id₁:4><id₂:4><id₃:4> <T><F><T>
Payload: (3×8) + (3×8) + (3×4) + (3×1) = 63 bytes, columnar
Why this design:
[= "ordered sequence" → sequence of records (row-major){= "named fields" → fields as separate arrays (column-major)- No new markers needed
{
i4 name S i 32 // 32-byte fixed string
i8 position { // nested object (24 bytes total)
i1 x d
i1 y d
i1 z d
}
i6 active T // boolean (1 byte)
i5 flags U // uint8 (1 byte)
}
Record size: 32 + 24 + 1 + 1 = 58 bytes
Use array syntax with repeated type markers:
{
i2 id m // uint32 (4 bytes)
i3 pos [d d d] // array of 3 float64 (24 bytes)
i5 color [U U U U] // array of 4 uint8 (4 bytes)
i5 flags [T T T T] // array of 4 booleans (4 bytes)
}
Record size: 4 + 24 + 4 + 4 = 36 bytes
For longer arrays, repeat the type marker:
{
i4 data [d d d d d d d d d d] // array of 10 float64 (80 bytes)
}
{
i6 vertex [d d d] // position: 3 float64 (24 bytes)
i6 normal [h h h] // normal: 3 float16 (6 bytes)
i5 color [U U U U] // RGBA: 4 uint8 (4 bytes)
i7 visible T // visibility: boolean (1 byte)
}
Record size: 24 + 6 + 4 + 1 = 35 bytes
Both [$ and {$ support ND dimensions:
[$ {<schema>} #[<dim₁> <dim₂> ...] <payload>
{$ {<schema>} #[<dim₁> <dim₂> ...] <payload>
Example: 4×3 grid of particles (row-major)
[ $ { i1 x d i1 y d i6 active T } # [ i 4 i 3 ]
<12 records in row-major order>
Total: 12 records × 17 bytes = 204 bytes
Data: 2 sensors
[
{"id": 1, "pos": {"x": 1.0, "y": 2.0}, "val": [0.1, 0.2, 0.3], "on": true},
{"id": 2, "pos": {"x": 3.0, "y": 4.0}, "val": [0.4, 0.5, 0.6], "on": false}
]Row-major encoding:
Byte Hex Meaning
---- ---- -------
0 5B [ (array-style SoA = row-major)
1 24 $
2 7B { (schema start)
3 69 i (int8 key length)
4 02 2
5-6 6964 "id"
7 6D m (uint32)
8 69 i
9 03 3
10-12 706F73 "pos"
13 7B { (nested object start)
14 69 i
15 01 1
16 78 "x"
17 64 d (float64)
18 69 i
19 01 1
20 79 "y"
21 64 d (float64)
22 7D } (nested object end)
23 69 i
24 03 3
25-27 76616C "val"
28 5B [ (array start)
29 64 d (float64)
30 64 d
31 64 d
32 5D ] (array end)
33 69 i
34 02 2
35-36 6F6E "on"
37 54 T (boolean type)
38 7D } (schema end)
39 23 #
40 69 i
41 02 2 (count = 2)
--- PAYLOAD (2 records × 45 bytes) ---
42-45 id₁: uint32 = 1
46-53 pos.x₁: float64 = 1.0
54-61 pos.y₁: float64 = 2.0
62-69 val₁[0]: float64 = 0.1
70-77 val₁[1]: float64 = 0.2
78-85 val₁[2]: float64 = 0.3
86 on₁: T (true)
87-90 id₂: uint32 = 2
91-98 pos.x₂: float64 = 3.0
99-106 pos.y₂: float64 = 4.0
107-114 val₂[0]: float64 = 0.4
115-122 val₂[1]: float64 = 0.5
123-130 val₂[2]: float64 = 0.6
131 on₂: F (false)
Record size: 4 + 8 + 8 + 24 + 1 = 45 bytes
Total: 42 (header) + 90 (payload) = 132 bytes
Data: 3 users with variable-length names and categorical status
[
{"id": 1, "status": "active", "name": "Alice", "code": "U001"},
{"id": 2, "status": "pending", "name": "Bob", "code": "U002"},
{"id": 3, "status": "active", "name": "Dr. Christopher Williams", "code": "U003"}
]Schema (block notation):
[{]
[i][2][id][m] // uint32 (4 bytes)
[i][6][status][$][S][#][i][3] // dictionary with 3 values
[i][6][active] // index 0
[i][8][inactive] // index 1
[i][7][pending] // index 2
[i][4][name][$][l][]] // offset-based variable string (int32 offsets)
[i][4][code][S][i][4] // fixed 4-byte string
[}]
Record payload size: 4 (id) + 1 (status index) + 4 (name offset index) + 4 (code) = 13 bytes
Memory layout (row-major):
┌──────────────────────────────────────────────────────────────┐
│ HEADER (schema + count = 3) │
├──────────────────────────────────────────────────────────────┤
│ Record 1: [id=1] [status_idx=0] [name_idx=0] [code="U001"] │ 13 bytes
│ Record 2: [id=2] [status_idx=2] [name_idx=1] [code="U002"] │ 13 bytes
│ Record 3: [id=3] [status_idx=0] [name_idx=2] [code="U003"] │ 13 bytes
├──────────────────────────────────────────────────────────────┤
│ Name Offset Table (4 × int32): │
│ [0, 5, 8, 32] │ 16 bytes
├──────────────────────────────────────────────────────────────┤
│ Name String Buffer: │
│ "AliceBobDr. Christopher Williams" │ 32 bytes
└──────────────────────────────────────────────────────────────┘
Total: header + 39 (records) + 16 (offset table) + 32 (strings) = header + 87 bytes
| Marker | In Schema Means | Payload Size | Notes |
|---|---|---|---|
U |
uint8 | 1 byte | |
i |
int8 | 1 byte | |
u |
uint16 | 2 bytes | |
I |
int16 | 2 bytes | |
l |
int32 | 4 bytes | |
m |
uint32 | 4 bytes | |
L |
int64 | 8 bytes | |
M |
uint64 | 8 bytes | |
h |
float16 | 2 bytes | |
d |
float32 | 4 bytes | |
D |
float64 | 8 bytes | |
C |
char | 1 byte | |
B |
byte | 1 byte | |
T |
boolean | 1 byte | Payload: T or F marker |
Z |
null/placeholder | 0 bytes | No payload |
S <int> <len> |
fixed string | len bytes |
No length prefix in payload |
H <int> <len> |
fixed high-precision | len bytes |
No length prefix in payload |
[$S#<n>... |
dictionary string | 1-8 bytes | Index into embedded dictionary |
[$H#<n>... |
dictionary high-precision | 1-8 bytes | Index into embedded dictionary |
[$<type>] |
offset-based string/H | sizeof(type) | Offset table + buffer appended |
{...} |
nested object | sum of fields | All fields must be supported types |
[...] |
fixed array | sum of elements | Explicit element types listed |
The extension type in BJData provides a mechanism for storing application-specific
or predefined binary data types that are not natively supported by the core BJData
specification. This enables interoperability with other systems and future extensibility
without modifying the core format.
An extension value is encoded using the marker E followed by two integers and
a binary payload:
[E][type-id][byte-length][payload]
where:
- E (0x45) - The 1-byte ASCII marker indicating an extension type
- type-id - An integer value (
i,U,I,u,l,m,L, orM) specifying the extension type identifier - byte-length - An integer value (
i,U,I,u,l,m,L, orM) specifying the length of the payload in bytes - payload - A contiguous byte-stream of the specified length containing the extension data
| Range | Description |
|---|---|
| 0–255 | Reserved for predefined types defined by this specification |
| 256+ | Application-specific types for user-defined extensions |
Applications can assign type IDs 256 and above for custom data types. The meaning of these IDs is determined by the application and should be documented separately.
Reserved extension types (0–255) are limited to fixed-byte-length records to ensure predictable parsing and efficient storage. Each reserved type ID corresponds to exactly one fixed payload size.
An extension with type ID 10 (UUID) containing 16 bytes of data:
[E][U][10][U][16][...16 bytes of UUID data...]
The following extension type IDs are reserved and defined by this specification. Each reserved type has exactly one fixed payload size for unambiguous parsing. Parsers that do not recognize a reserved type ID should treat the extension as opaque binary data and preserve it for round-trip serialization.
| Type ID | Name | Payload Size | Description |
|---|---|---|---|
| 0 | Reserved | — | Reserved for future use |
| 1 | epoch_s | 4 bytes | Epoch time in seconds (uint32) |
| 2 | epoch_us | 8 bytes | Epoch time in microseconds (int64) |
| 3 | epoch_ns | 12 bytes | Epoch time in nanoseconds (int64 + uint32) |
| 4 | date | 4 bytes | Calendar date (year, month, day) |
| 5 | time_s | 4 bytes | Time of day in seconds |
| 6 | datetime_us | 8 bytes | Date and time in microseconds since epoch |
| 7 | timedelta_us | 8 bytes | Time duration in microseconds |
| 8 | complex64 | 8 bytes | Complex number (single precision) |
| 9 | complex128 | 16 bytes | Complex number (double precision) |
| 10 | uuid | 16 bytes | Universally Unique Identifier (RFC 4122) |
| 11–255 | Reserved | — | Reserved for future specification |
Represents an instantaneous point in time as seconds since the Unix epoch (1970-01-01 00:00:00 UTC).
seconds (uint32)
║──────┬──────┬──────┬──────║
0 1 2 3 4
- seconds: Unsigned 32-bit integer, Little-Endian
- Range: 1970-01-01 00:00:00 to 2106-02-07 06:28:15 UTC
- Precision: 1 second
Timestamp for 2024-01-15 10:30:00 UTC (epoch = 1705315800):
[E][U][1][U][4][0x58][0x8D][0xA2][0x65]
Represents an instantaneous point in time as microseconds since the Unix epoch
(1970-01-01 00:00:00 UTC). Compatible with Python's datetime.timestamp() * 1e6.
microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0 1 2 3 4 5 6 7 8
- microseconds: Signed 64-bit integer, Little-Endian
- Positive values: dates after 1970-01-01 00:00:00 UTC
- Negative values: dates before 1970-01-01 00:00:00 UTC
- Range: approximately ±292,471 years from epoch
- Precision: 1 microsecond
epoch_us = unix_timestamp_seconds * 1_000_000 + microseconds
Timestamp for 2024-01-15 10:30:00.123456 UTC (epoch_us = 1705315800123456):
[E][U][2][U][8][0x40][0x15][0xE3][0xED][0xED][0x0C][0x06][0x00]
Represents an instantaneous point in time with nanosecond precision, stored as
seconds plus nanoseconds since the Unix epoch. Compatible with NumPy's datetime64[ns].
seconds (int64) nanoseconds (uint32)
║────┬────┬────┬────┬────┬────┬────┬────╫────┬────┬────┬────║
0 1 2 3 4 5 6 7 8 9 10 11 12
- seconds: Signed 64-bit integer (bytes 0–7), Little-Endian
- Signed to support dates before 1970
- Range: approximately ±292 billion years
- nanoseconds: Unsigned 32-bit integer (bytes 8–11), Little-Endian
- Range: [0, 999999999]
- Sub-second component, always non-negative
- Precision: 1 nanosecond
Timestamp for 2024-01-15 10:30:00.123456789 UTC:
- Seconds: 1705315800
- Nanoseconds: 123456789
[E][U][3][U][12][0x78][0x8D][0xA2][0x65][0x00][0x00][0x00][0x00][0x15][0xCD][0x5B][0x07]
Represents a calendar date without time-of-day information.
year (int16) month day
║──────┬──────╫──────╫──────║
0 1 2 3 4
- year: Signed 16-bit integer (bytes 0–1), Little-Endian
- Range: -32768 to 32767
- month: Unsigned 8-bit integer (byte 2)
- Range: [1, 12]
- day: Unsigned 8-bit integer (byte 3)
- Range: [1, 31]
- Uses Gregorian calendar
Date for 2024-01-15:
- Year: 2024 (0x07E8)
- Month: 1
- Day: 15
[E][U][4][U][4][0xE8][0x07][0x01][0x0F]
Represents a time of day with second precision, without date information.
hour min sec (rsv)
║──────╫──────╫──────╫──────║
0 1 2 3 4
- hour: Unsigned 8-bit integer (byte 0), range [0, 23]
- minute: Unsigned 8-bit integer (byte 1), range [0, 59]
- second: Unsigned 8-bit integer (byte 2), range [0, 60] (60 for leap second)
- reserved: Byte 3, set to 0
Time for 10:30:45:
[E][U][5][U][4][0x0A][0x1E][0x2D][0x00]
Represents a date and time as microseconds since the Unix epoch. This is
functionally identical to epoch_us (Type ID: 2) but semantically emphasizes
the datetime interpretation.
microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0 1 2 3 4 5 6 7 8
- microseconds: Signed 64-bit integer, Little-Endian
- Range: approximately ±292,471 years from epoch
- Precision: 1 microsecond
An extension datetime record for 2024-01-15 10:30:00.123456 UTC:
[E][U][6][U][8][0x40][0x15][0xE3][0xED][0xED][0x0C][0x06][0x00]
Represents a time duration in microseconds. Compatible with Python's
timedelta.total_seconds() * 1e6.
microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0 1 2 3 4 5 6 7 8
- microseconds: Signed 64-bit integer, Little-Endian
- Positive values: forward duration
- Negative values: backward duration
- Range: approximately ±292,471 years
- Precision: 1 microsecond
total_seconds = timedelta_us / 1_000_000
days = timedelta_us // 86_400_000_000
Duration of 5 days, 3 hours, 30 minutes, 15.5 seconds (= 444615500000 μs):
[E][U][7][U][8][0x60][0xAC][0x86][0x6E][0x67][0x00][0x00][0x00]
Represents a complex number with single-precision (32-bit) floating-point
components. Compatible with NumPy's complex64 data type.
real (float32) imaginary (float32)
║──────┬──────┬──────┬──────╫──────┬──────┬──────┬──────║
0 1 2 3 4 5 6 7 8
- real: 32-bit float (bytes 0–3), Little-Endian, IEEE 754
- imaginary: 32-bit float (bytes 4–7), Little-Endian, IEEE 754
A complex number z = a + bi is stored as [a][b].
Complex number 3.0 + 4.0i:
[E][U][8][U][8][0x00][0x00][0x40][0x40][0x00][0x00][0x80][0x40]
Represents a complex number with double-precision (64-bit) floating-point
components. Compatible with NumPy's complex128 data type and Python's complex.
real (float64) imaginary (float64)
║────┬────┬────┬────┬────┬────┬────┬────╫────┬────┬────┬────┬────┬────┬────┬────║
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
- real: 64-bit float (bytes 0–7), Little-Endian, IEEE 754
- imaginary: 64-bit float (bytes 8–15), Little-Endian, IEEE 754
Complex number 3.0 + 4.0i:
[E][U][9][U][16][0x00][0x00][0x00][0x00][0x00][0x00][0x08][0x40]
[0x00][0x00][0x00][0x00][0x00][0x00][0x10][0x40]
Represents a 128-bit Universally Unique Identifier as defined in RFC 4122.
time_low time_mid time_hi clk_hi clk_lo node
║────┬────┬────┬────╫────┬────╫────┬────╫────╫────╫────┬────┬────┬────┬────┬────║
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
- time_low: bytes 0–3, Big-Endian
- time_mid: bytes 4–5, Big-Endian
- time_hi_and_version: bytes 6–7, Big-Endian
- clock_seq_hi_and_reserved: byte 8
- clock_seq_low: byte 9
- node: bytes 10–15
The 16 bytes are stored in standard UUID byte order (network byte order / Big-Endian) as per RFC 4122. This is an exception to BJData's Little-Endian convention.
The format supports all UUID versions (1, 4, 5, 7, etc.) as defined by RFC 4122 and RFC 9562.
UUID 550e8400-e29b-41d4-a716-446655440000:
[E][U][10][U][16][0x55][0x0e][0x84][0x00][0xe2][0x9b][0x41][0xd4]
[0xa7][0x16][0x44][0x66][0x55][0x44][0x00][0x00]
When a parser encounters an extension type ID it does not recognize:
-
Reserved range (0–255): The parser should preserve the extension as opaque binary data to enable round-trip serialization. It may optionally issue a warning.
-
Application range (256+): The parser should either:
- Pass the extension to an application-provided handler, or
- Preserve it as opaque binary data, or
- Raise an error (configurable behavior)
| Payload Size | Type IDs |
|---|---|
| 4 bytes | epoch_s (1), date (4), time_s (5) |
| 8 bytes | epoch_us (2), datetime_us (6), timedelta_us (7), complex64 (8) |
| 12 bytes | epoch_ns (3) |
| 16 bytes | complex128 (9), uuid (10) |
All multi-byte numeric values in extension payloads are stored in Little-Endian order, consistent with the rest of the BJData specification, with the exception of UUID which follows RFC 4122 (network byte order / Big-Endian).
For Binary JData files, the recommended file suffix is ".bjd".
The MIME type for a Binary JData document is "application/jdata-binary"
The BJData spec is derived from the Universal Binary JSON (UBJSON, https://ubjson.org) specification (Draft 12) developed by Riyad Kalla and other UBJSON contributors.
The initial version of this MarkDown-formatted specification was derived from the documentation included in the Py-UBJSON repository (Commit 5ce1fe7).
This specification was developed as part of the NeuroJSON project (https://neurojson.org) with funding support from the US National Institute of Health (NIH) under grant U24-NS124027.