Skip to content

Latest commit

 

History

History
1578 lines (1261 loc) · 54 KB

File metadata and controls

1578 lines (1261 loc) · 54 KB

Binary JData: A portable interchange format for complex binary data

The Binary JData (BJData) Specification defines an efficient serialization protocol for unambiguously storing complex and strongly-typed binary data found in diverse applications. The BJData specification is the binary counterpart to the JSON format, both of which are used to serialize complex data structures supported by the JData specification (https://neurojson.org/jdata). The BJData spec is derived and extended from the Universal Binary JSON (UBJSON, https://ubjson.org) specification (Draft 12). It adds supports for N-dimensional packed arrays and extended binary data types.

Table of Content

Introduction

The Javascript Object Notation (JSON) format is ubiquitously used in today's web and native applications. JSON offers numerous advantages, including simplicity, human- and machine-readability, and versatility, with a large toolchain ecosystem ranging from numerous parsers to highly efficient document-store databases. However, its plain-text nature and schema-less design make it difficult to efficiently store and exchange strongly-typed binary data, including numerical and multi-dimensional arrays and other complex data structures generated in scientific research and imaging applications.

In recent years, efforts to address these limitations have resulted in an array of binary JSON formats, such as BSON (Binary JSON, https://bson.org), UBJSON (Universal Binary JSON, https://ubjson.org), MessagePack (https://msgpack.org), and CBOR (Concise Binary Object Representation, [RFC 7049], https://cbor.io), among others. These binary JSON counterparts are broadly adopted in speed/space-sensitive data applications and vary in terms of supported data types, flexibility of containers (arrays and objects), and associated libraries.

The Binary JData (BJData) format is a binary JSON format extended from the UBJSON Specification Draft 12 (https://github.com/ubjson/universal-binary-json) by adding native support for N-dimensional packed arrays - an essential data structure for scientific applications - and columnar table format for packed objects to offer efficient storage of tables and repeating objects, such as Python Pandas DataFrame and NumPy arrays-of-objects. These optimized container constructs greatly reduce storage overhead and redundancy, enhancing data exchange efficiency.

In addition, BJData also extends binary data types that UBJSON failed to support, including unsigned integer types, half-precision floating-point numbers, a native byte type, as well as user-defined binary extensions. With these extensions, a BJData file can store binary arrays larger than 4 GB in size, which is not possible with MessagePack (maximum data record size is limited to 4 GB) or BSON (maximum total file size is 4 GB). Despite these extensions, the BJData format remains simple to parse.

An outstanding benefit of the BJData format (and its predecessor UBJSON) as opposed to other more popular binary JSON formats, such as BSON, CBOR, and MessagePack, is its quasi-human-readability - a unique characteristic that is absent from almost all other binary formats. This is because all data semantic elements in a BJData file, e.g. the "name" fields and data-type markers, are defined in human-readable strings. The resulting binary can be partially readable using an editor with minimal or no processing. We anticipate that such a unique capability makes a data file self-explanatory and easy to support without compromising computational and storage efficiency.

License

The Binary JData Specification is licensed under the Apache 2.0 License.

Format Specification

A single construct with two optional segments (length and data) is used for all types:

[type, 1-byte char]([integer numeric length])([data])

Each element in the tuple is defined as:

  • type - A 1-byte ASCII char (Marker) used to indicate the type of the data following it.

  • length (optional) - A positive, integer numeric type (uint8, int8, uint16, int16, uint32, int32, uint64 or int64) specifying the length of the following data payload.

  • data (optional) - A contiguous byte-stream containing serialized binary data representing the actual binary data for this type of value.

Notes

  • Some values are simple enough that just writing the 1-byte ASCII marker into the stream is enough to represent the value (e.g. null), while others have a type that is specific enough that no length is needed as the length is implied by the type (e.g. int32). Yet others still require both a type and a length to communicate their value (e.g. string). In addition, some values (e.g. array) have additional (optional) parameters to improve decoding efficiency and/or to reduce the size of the encoded value even further.

  • The BJData specification (since Draft-2) requires that all numeric values must be written in the Little-Endian order. This is a breaking change compared to BJData Draft-1 and UBJSON Draft-12, where numeric values are in Big-Endian order.

  • The array and object data types are container types, similar to JSON arrays and objects. They help partition and organize data records of all types, including container types, into composite and complex records.

  • Using a strongly-typed container construct can further reduce data file sizes as it extracts common data type markers to the header. It also helps a parser to pre-allocate necessary memory buffers before reading the data payload.

In the following sections, we use a block-notation to illustrate the layout of the encoded data. In this notation, the data type markers and individual data payloads are enclosed by a pair of [], strictly for illustration purposes. Both illustration markers [ and ] as well as the whitespaces between these data elements, if present, shall be ignored when performing the actual data storage.

Type Total size ASCII Marker(s) Length required Data (payload)
null 1 byte Z No No
no-op 1 byte N No No
true 1 byte T No No
false 1 byte F No No
int8 2 bytes i No Yes
uint8 2 bytes U No Yes
int16 3 bytes I (upper case i) No Yes
uint16* 3 bytes u No Yes
int32 5 bytes l (lower case L) No Yes
uint32* 5 bytes m No Yes
int64 9 bytes L No Yes
uint64* 9 bytes M No Yes
float16/half* 3 bytes h No Yes
float32/single 5 bytes d No Yes
float64/double 9 bytes D No Yes
high-precision number 1 byte + int num val + string byte len H Yes Yes
char 2 bytes C No Yes
byte 2 bytes B No Yes
extension 1 byte + int num val + int num val + payload E Yes Yes (if not empty)
string 1 byte + int num val + string byte len S Yes Yes (if not empty)
array 2+ bytes [ and ] Optional Yes (if not empty)
object 2+ bytes { and } Optional Yes (if not empty)

* Data type markers that are not defined in the UBJSON Specification (Draft 12)

The null value is equivalent to the null value from the JSON specification.

Example

In JSON:

{
    "passcode": null
}

In BJData (using block-notation):

[{]
    [i][8][passcode][Z]
[}]

There is no equivalent to the no-op value in the original JSON specification. When decoding, No-Op values should be skipped.

The intended usage of the no-op value is as a valueless signal between a producer (most likely a server) and a consumer (most likely a client) to indicate activity, for example, as a keep-alive signal so that a client knows a server is still working and hasn't hung or timed out.


A Boolean type is equivalent to the Boolean value (true or false) defined in the JSON specification.

Example

In JSON:

{
    "authorized": true,
    "verified": false
}

In BJData (using block-notation):

[{]
    [i][10][authorized][T]
    [i][8][verified][F]
[}]

Unlike in JSON, which has a single Number type (used for both integers and floating point numbers), BJData defines multiple types for integers. The minimum/maximum of values (inclusive) for each integer type are as follows:

Type Signed Minimum Maximum
int8 Yes -128 127
uint8 Yes 0 255
int16 No -32,768 32,767
uint16 Yes 0 65,535
int32 No -2,147,483,648 2,147,483,647
uint32 Yes 0 4,294,967,295
int64 No -9,223,372,036,854,775,808 9,223,372,036,854,775,807
uint64 Yes 0 18,446,744,073,709,551,615
float16/half Yes See IEEE 754 Spec See IEEE 754 Spec
float32/single Yes See IEEE 754 Spec See IEEE 754 Spec
float64/double Yes See IEEE 754 Spec See IEEE 754 Spec
high-precision number Yes Infinite Infinite

Notes:

  • Numeric values of +infinity, -infinity and NaN are to be encoded using their respective IEEE 754 binary form; this is different from the UBJSON specification where NaN and infinity are converted to null.
  • It is advisable to use the smallest applicable type when encoding a number.

Integer

All integer types (uint8, int8, uint16, int16, uint32, int32, uint64 and int64) are written in Little-Endian order (this is different from UBJSON, where all integers are written in Big-Endian order).

Float

All float types (half, single, double are written in Little-Endian order (this is different from UBJSON which does not specify the Endianness of floats).

High-Precision

These are encoded as a string and thus are only limited by the maximum string size. Values must be written out in accordance with the original JSON number type specification.

Examples

Numeric values in JSON:

{
    "int8": 16,
    "uint8": 255,
    "int16": 32767,
    "uint16": 32768,
    "int32": 2147483647,
    "int64": 9223372036854775807,
    "uint64": 9223372036854775808,
    "float32": 3.14,
    "float64": 113243.7863123,
    "huge1": "3.14159265358979323846",
    "huge2": "-1.93+E190",
    "huge3": "719..."
}

In BJData (using block-notation):

[{]
    [i][4][int8][i][16]
    [i][5][uint8][U][255]
    [i][5][int16][I][32767]
    [i][6][uint16][u][32768]
    [i][5][int32][l][2147483647]
    [i][5][int64][L][9223372036854775807]
    [i][6][uint64][M][9223372036854775808]
    [i][7][float32][d][3.14]
    [i][7][float64][D][113243.7863123]
    [i][5][huge1][H][i][22][3.14159265358979323846]
    [i][5][huge2][H][i][10][-1.93+E190]
    [i][5][huge3][H][U][200][719...]
[}]

The char type in BJData is an unsigned byte meant to represent a single printable ASCII character (decimal values 0-127). It must not have a decimal value larger than 127. It is functionally identical to the uint8 type, but semantically is meant to represent a character and not a numeric value.

Example

Char values in JSON:

{
    "rolecode": "a",
    "delim": ";",
}

BJData (using block-notation):

[{]
    [i][8][rolecode][C][a]
    [i][5][delim][C][;]
[}]

The byte type in BJData is functionally identical to the uint8 type, but semantically is meant to represent a byte and not a numeric value. In particular, when used as the strong type of an array container it provides a hint to the parser that an optimized data storage format may be used as opposed to a generic array of integers.

See also optimized format below.

Example

Byte values in JSON:

{
    "binary": [222, 173, 190, 239]
    "val": 123,
}

BJData (using block-notation):

[{]
    [i][6][binary] [[] [$][B] [#][i][4] [222][173][190][239]
    [i][3][val][B][123]
[}]

The string type in BJData is equivalent to the string type from the JSON specification apart from the fact that BJData string value requires UTF-8 encoding.

Example

String values in JSON:

{
    "username": "andy",
    "imagedata": "...huge string payload..."
}

BJData (using block-notation):

[{]
    [i][8][username][S][i][4][andy]
    [i][9][imagedata][S][l][2097152][...huge string payload...]
[}]

See also optimized format below.

The array type in BJData is equivalent to the array type from the JSON specification.

The child elements of an array are ordered and can be accessed by their indices.

Example

Array in JSON:

[
    null,
    true,
    false,
    4782345193,
    153.132,
    "ham"
]

BJData (using block-notation):

[[]
    [Z]
    [T]
    [F]
    [l][4782345193]
    [d][153.132]
    [S][i][3][ham]
[]]

The object type in BJData is equivalent to the object type from the JSON specification. Since value names can only be strings, the S (string) marker must not be included since it is redundant.

The child elements of an object are ordered and can be accessed by their names.

Example

Object in JSON:

{
    "post": {
        "id": 1137,
        "author": "Andy",
        "timestamp": 1364482090592,
        "body": "The quick brown fox jumps over the lazy dog"
    }
}

BJData (using block-notation):

[{]
    [i][4][post][{]
        [i][2][id][I][1137]
        [i][6][author][S][i][4][Andy]
        [i][9][timestamp][L][1364482090592]
        [i][4][body][S][i][43][The quick brown fox jumps over the lazy dog]
    [}]
[}]

Both container types (array and object) support optional parameters that can help optimize the container for better parsing performance and smaller size.

Type - $

When a type is specified, all value types stored in the container (either array or object) are considered to be of that singular type and, as a result, type markers are omitted for each value within the container. This can be thought of as providing the ability to create a strongly-typed container in BJData.

A major different between BJData and UBJSON is that the type in a BJData strongly-typed container is limited to non-zero-fixed-length data types, therefore, only integers (i,U,I,u,l,m,L,M), floating-point numbers (h,d,D), char (C) and byte (B) are qualified. All zero-length types (T,F,Z,N), variable-length types(S, H) and container types ([,{) shall not be used in an optimized type header. This restriction is set to reduce the security risks due to potentials of buffer-overflow attacks using zero-length markers, hampered readability and diminished benefit using variable/container types in an optimized format.

The requirements for type are

  • If a type is specified, it must be one of i,U,I,u,l,m,L,M,h,d,D,C,B.
  • If a type is specified, it must be done so before a count.
  • If a type is specified, a count must be specified as well. (Otherwise it is impossible to tell when a container is ending, e.g. did you just parse ] or the int8 value of 93?)

Example (uint8 type):

[$][U]

Count - #

When a count is followed by a single non-negative integer record, i.e. one of i,U,I,u,l,m,L,M, it specifies the total child element count. This allows the parser to pre-size any internal construct used for parsing, verify that the promised number of child values were found, and avoid scanning for any terminating bytes while parsing.

  • A count can be specified without a type.

Example (count of 64):

[#][i][64]

Optimized array

An optimized array has a uniform payload, i.e., all data records stored inside the payload section have the same data type and byte length, and thus can help accelerate loading and saving of such data.

Optimized N-dimensional array of uniform type

When both type and count are specified and the count marker # is followed by [, the parser should expect the following sequence to be a 1-D array with zero or more (Ndim) integer elements (Nx, Ny, Nz, ...). This specifies an Ndim-dimensional array of uniform type specified by the type marker after $. The array data are serialized in the row-major format.

For example, the below two block sequences both represent an Nx*Ny*Nz*... array of uniform numeric type:

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...]  [Nx*Ny*Nz*...*sizeof(type)]
  or
[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]

where Ndim is the number of dimensions, and Nx, Ny, and Nz ... are all non-negative numbers specifying the dimensions of the N-dimensional array. Nz/Ny/Nz/Ndim types must be one of the integer types (i,U,I,u,l,m,L,M). The binary data of the N-dimensional array is then serialized into a 1-D vector in the row-major element order (similar to C, C++, Javascript or Python) .

To store an N-dimensional array that is serialized using the column-major element order (as used in MATLAB and FORTRAN), the count marker # should be followed by an array of a single element, which must be a 1-D array of integer type as the dimensional vector above. Either of the arrays can be in optimized or non-optimized form. For example, either of the following

[[] [$] [type] [#] [[] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...] []]  [a11 a21 a31 ... a21 a22 ...]
  or
[[] [$] [type] [#] [[] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] []] [a11 a21 a31 ... a21 a22 ...]

represents the same column-major N-dimensional array of type and size [Nx, Ny, Nz, ...].

Example (a 2x3x4 uint8 array):

The following 2x3x4 3-D uint8 array

[
     [
          [1,9,6,0],
          [2,9,3,1],
          [8,0,9,6]
      ],
      [
          [6,4,2,7],
          [8,5,1,2],
          [3,3,2,6]
      ]
]

shall be stored using row-major serialized form as

 [[] [$][U] [#][[] [$][U][#][3] [2][3][4]
    [1][9][6][0] [2][9][3][1] [8][0][9][6] [6][4][2][7] [8][5][1][2] [3][3][2][6]

or column-major serialized form as

 [[] [$][U] [#][[] [[] [$][U][#][3] [2][3][4] []]
    [1][6][2][8] [8][3][9][4] [9][5][0][3] [6][2][3][1] [9][2][0][7] [1][2][6][6]

Additional rules

  • A count must be >= 0.
  • A count can be specified alone.
  • If a count is specified, the container must not specify an end-marker.
  • A container that specifies a count must contain the specified number of child elements.
  • If a type is specified, it must be done so before count.
  • If a type is specified, a count must also be specified. A type cannot be specified alone.
  • A container that specifies a type must not contain any additional type markers for any contained value.

Array Examples

Optimized with count

[[][#][i][5] // An array of 5 elements.
    [d][29.97]
    [d][31.13]
    [d][67.0]
    [d][2.113]
    [d][23.8889]
// No end marker since a count was specified.

Optimized with both type and count

[[][$][d][#][i][5] // An array of 5 float32 elements.
    [29.97] // Value type is known, so type markers are omitted.
    [31.13]
    [67.0]
    [2.113]
    [23.8889]
// No end marker since a count was specified.

Object Examples

Optimized with count

[{][#][i][3] // An object of 3 name:value pairs.
    [i][3][lat][d][29.976]
    [i][4][long][d][31.131]
    [i][3][alt][d][67.0]
// No end marker since a count was specified.

Optimized with both type and count

[{][$][d][#][i][3] // An object of 3 name:float32-value pairs.
    [i][3][lat][29.976] // Value type is known, so type markers are omitted.
    [i][4][long][31.131]
    [i][3][alt][67.0]
// No end marker since a count was specified.

BJData supports Structure-of-Arrays (SoA), as a special type of optimized container to store packed object data in either column-major or row-major orders. The payload in an SoA record are packed in the column or row order with chunks of binary data of identical byte length. This uniform columnar payload structure makes it efficient to parse and store.

Core Syntax

[[][$]  [{]<schema>[}]  [#]<count>  <payload>    // row-major (interleaved)
[{][$]  [{]<schema>[}]  [#]<count>  <payload>    // column-major (columnar)

where:

  • [ or { - container type (determines memory layout)
  • $ - optimized type marker
  • {<schema>} - payload-less object defining the record structure
  • # - count marker
  • <count> - 1D integer OR ND dimension array
  • <payload> - tightly packed data

Schema Definition

The schema is a payload-less object: keys followed by type markers only, no values.

schema        = '{' 1*(field-def) '}'
field-def     = name type-spec
name          = int-type length string-bytes
type-spec     = fixed-type | bool-type | null-type | string-spec | highprec-spec
              | nested-schema | array-spec
fixed-type    = 'U' | 'i' | 'u' | 'I' | 'l' | 'm' | 'L' | 'M' | 'h' | 'd' | 'D' | 'C' | 'B'
bool-type     = 'T'                        // boolean (1 byte: T or F in payload)
null-type     = 'Z'                        // null (0 bytes in payload)
string-spec   = fixed-string | dict-string | offset-string
fixed-string  = 'S' int-type length        // fixed-size string
dict-string   = '[' '$' 'S' '#' count 1*(string-value)   // dictionary-based string
offset-string = '[' '$' int-type ']'       // offset-table-based variable string
highprec-spec = fixed-highprec | dict-highprec | offset-highprec
fixed-highprec= 'H' int-type length        // fixed-size high-precision number
dict-highprec = '[' '$' 'H' '#' count 1*(highprec-value) // dictionary-based high-precision
offset-highprec = '[' '$' int-type ']'     // offset-table-based variable high-precision (same as string)
nested-schema = '{' 1*(field-def) '}'
array-spec    = '[' 1*(type-spec) ']'      // fixed array with explicit element types

Key rules:

  1. Fixed-length numeric types: U i u I l m L M h d D C B
  2. T in schema means "boolean type" - each value is 1 byte (T or F marker) in payload
  3. Z in schema means "null field" - no bytes in payload (placeholder/reserved field)
  4. Strings (S) and high-precision numbers (H) support three storage modes:
    • Fixed-length: S <int-type> <length> or H <int-type> <length>
    • Dictionary-based: [$S#<count><str1><str2>... or [$H#<count><val1><val2>...
    • Offset-table-based: [$<int-type>]
  5. Nested objects {...} are allowed if all fields use supported types
  6. No optimized containers can be used inside the schema, except in the case of serving as dictionary/offset-table markers for variable-length strings, as described in #4 above
  7. F and N are not used in schema (use T for boolean, Z for null)

Mode 1: Fixed-length storage (S/H with length)

In a schema context, S and H followed by an integer define fixed-length strings or high-precision numbers:

{ i4 name S i 16 }       // "name" is a 16-byte fixed string
{ i5 value H i 32 }      // "value" is a 32-byte fixed high-precision number

In the payload, each record contributes exactly the specified bytes - no length prefix. Strings shorter than the length are right-padded with null bytes (0x00).

Use case: Strings with known maximum length (codes, IDs, short labels).


Mode 2: Dictionary-based storage ([$S#<n>... or [$H#<n>...)

A dictionary for mapping string-value rows/columns should be indicated by a payload-less optimized string-array.

Use case: Repeated/categorical string values.

Schema syntax:

[$S#<count><str1><str2>...
[$H#<count><val1><val2>...

where each string/value is encoded as a standard BJData string or high-precision number (with length prefix). No closing ] is needed because the count is specified.

Example schema:

{
  i6 status [$S#i 3                    // dictionary with 3 string values
    i 6 active                         // index 0: "active"
    i 8 inactive                       // index 1: "inactive"  
    i 7 pending                        // index 2: "pending"
}

Payload encoding: Each record stores a single integer index referencing the dictionary. The index type is automatically determined as the smallest unsigned integer type that can represent the dictionary size:

  • count ≤ 255: U (uint8, 1 byte)
  • count ≤ 65535: u (uint16, 2 bytes)
  • count ≤ 4294967295: m (uint32, 4 bytes)
  • otherwise: M (uint64, 8 bytes)

Benefits:

  • Excellent compression for low-cardinality categorical data
  • O(1) lookup for string values
  • Payload remains fixed-size per record

Mode 3: Offset-table-based storage ([$<type>])

An offset-table based storage is used for storing string vectors of variable lengths. It first concatenates all strings into a single linear buffer, and assigns an integer as the offset from the beginning of the buffer for each string element. The use of offset-table should be indicated by an payload-less optimized array containing only the optimized type [$<type>].

Use case: Diverse strings with highly variable lengths (names, descriptions, free text).

Schema syntax:

[$<offset-type>]

Where <offset-type> is an integer type (i, U, I, u, l, m, L, M) specifying the byte-offset type stored in the payload.

Example schema:

{
  i2 id m                    // uint32 (4 bytes in payload)
  i4 name [$l]               // variable string with int32 offsets
  i5 value d                 // float64 (8 bytes in payload)
}

Storage structure:

The offset-table-based string field stores an offset index in the fixed payload area. After all record payloads, an offset table and concatenated string buffer are appended:

┌─────────────────────────────────────────────────────────────────┐
│ Schema Header                                                   │
├─────────────────────────────────────────────────────────────────┤
│ Fixed-size payload (N records)                                  │
│   - Each string field position stores a sequential index (0..N-1)│
│   - Other fields store actual values                            │
├─────────────────────────────────────────────────────────────────┤
│ Offset Table: (N+1) offsets of <offset-type>                    │
│   [0, end1, end2, ..., end_N]                                    │
├─────────────────────────────────────────────────────────────────┤
│ String Buffer: concatenated strings (no length prefixes)        │
│   str1 ∥ str2 ∥ str3 ∥ ... ∥ str_N                               │
└─────────────────────────────────────────────────────────────────┘

Decoding string i:

offset_start = offset_table[i]
offset_end = offset_table[i+1]
string_i = string_buffer[offset_start:offset_end]

Multiple variable-length fields: When a schema contains multiple offset-table-based fields, their offset tables and string buffers are appended in schema field order.

Empty strings: Represented by consecutive identical offsets in the offset table.

Benefits:

  • Efficient for highly variable string lengths
  • No wasted padding bytes
  • Random access via offset table
  • String buffer can be memory-mapped

Boolean type (T)

In normal BJData, T and F are zero-length value markers:

T                        // true (no payload)
F                        // false (no payload)

In a schema context, T means "boolean type" - a 1-byte field:

{ i6 active T }          // "active" is a boolean field

In the payload, each boolean value is stored as a single byte: T (0x54) for true, F (0x46) for false.


Null type (Z)

In a schema context, Z means "null/placeholder field" with zero bytes in payload:

{ 
  i2 id m                // uint32 (4 bytes)
  i8 reserved Z          // placeholder (0 bytes)
  i4 data d              // float64 (8 bytes)
}

This is useful for:

  • Reserved fields for future expansion
  • Marking fields that exist in the schema but carry no data
  • Sparse structures where some fields are always null

Using existing container markers:

Syntax Layout Description
[$ Row-major (Interleaved) Array of records - each complete record stored sequentially
{$ Column-major (Columnar) Object of arrays - all values of each field stored together

Row-Major: [$

[$  {<schema>}  #<count>  <interleaved-payload>

Payload order: <record₁><record₂><record₃>...

Example: 3 particles with {x:float64, y:float64, id:uint32, active:bool}

[ $ { i1 x d  i1 y d  i2 id m  i6 active T } # i 3
  <x₁:8><y₁:8><id₁:4><active₁:1>  <x₂:8><y₂:8><id₂:4><active₂:1>  ...

Payload: 3 × 21 bytes = 63 bytes, interleaved

Column-major layout: {$

{$  {<schema>}  #<count>  <columnar-payload>

Payload order: <all field₁ values><all field₂ values>...

Example: Same 3 particles

{ $ { i1 x d  i1 y d  i2 id m  i6 active T } # i 3
  <x₁:8><x₂:8><x₃:8>  <y₁:8><y₂:8><y₃:8>  <id₁:4><id₂:4><id₃:4>  <T><F><T>

Payload: (3×8) + (3×8) + (3×4) + (3×1) = 63 bytes, columnar

Why this design:

  • [ = "ordered sequence" → sequence of records (row-major)
  • { = "named fields" → fields as separate arrays (column-major)
  • No new markers needed

Nested objects

{
  i4 name S i 32           // 32-byte fixed string
  i8 position {            // nested object (24 bytes total)
    i1 x d
    i1 y d  
    i1 z d
  }
  i6 active T              // boolean (1 byte)
  i5 flags U               // uint8 (1 byte)
}

Record size: 32 + 24 + 1 + 1 = 58 bytes

Fixed-length arrays in schema

Use array syntax with repeated type markers:

{
  i2 id m                  // uint32 (4 bytes)
  i3 pos [d d d]           // array of 3 float64 (24 bytes)
  i5 color [U U U U]       // array of 4 uint8 (4 bytes)
  i5 flags [T T T T]       // array of 4 booleans (4 bytes)
}

Record size: 4 + 24 + 4 + 4 = 36 bytes

For longer arrays, repeat the type marker:

{
  i4 data [d d d d d d d d d d]   // array of 10 float64 (80 bytes)
}

Nested Arrays with Mixed Types

{
  i6 vertex [d d d]        // position: 3 float64 (24 bytes)
  i6 normal [h h h]        // normal: 3 float16 (6 bytes)
  i5 color [U U U U]       // RGBA: 4 uint8 (4 bytes)
  i7 visible T             // visibility: boolean (1 byte)
}

Record size: 24 + 6 + 4 + 1 = 35 bytes


Both [$ and {$ support ND dimensions:

[$  {<schema>}  #[<dim₁> <dim₂> ...]  <payload>
{$  {<schema>}  #[<dim₁> <dim₂> ...]  <payload>

Example: 4×3 grid of particles (row-major)

[ $ { i1 x d  i1 y d  i6 active T } # [ i 4  i 3 ]
  <12 records in row-major order>

Total: 12 records × 17 bytes = 204 bytes


Example 1: Fixed-Length Fields Only

Data: 2 sensors

[
  {"id": 1, "pos": {"x": 1.0, "y": 2.0}, "val": [0.1, 0.2, 0.3], "on": true},
  {"id": 2, "pos": {"x": 3.0, "y": 4.0}, "val": [0.4, 0.5, 0.6], "on": false}
]

Row-major encoding:

Byte  Hex   Meaning
----  ----  -------
0     5B    [ (array-style SoA = row-major)
1     24    $
2     7B    { (schema start)
3     69    i (int8 key length)
4     02    2
5-6   6964  "id"
7     6D    m (uint32)
8     69    i
9     03    3
10-12 706F73 "pos"
13    7B    { (nested object start)
14    69    i
15    01    1
16    78    "x"
17    64    d (float64)
18    69    i
19    01    1
20    79    "y"
21    64    d (float64)
22    7D    } (nested object end)
23    69    i
24    03    3
25-27 76616C "val"
28    5B    [ (array start)
29    64    d (float64)
30    64    d
31    64    d
32    5D    ] (array end)
33    69    i
34    02    2
35-36 6F6E  "on"
37    54    T (boolean type)
38    7D    } (schema end)
39    23    #
40    69    i
41    02    2 (count = 2)
--- PAYLOAD (2 records × 45 bytes) ---
42-45       id₁: uint32 = 1
46-53       pos.x₁: float64 = 1.0
54-61       pos.y₁: float64 = 2.0
62-69       val₁[0]: float64 = 0.1
70-77       val₁[1]: float64 = 0.2
78-85       val₁[2]: float64 = 0.3
86          on₁: T (true)
87-90       id₂: uint32 = 2
91-98       pos.x₂: float64 = 3.0
99-106      pos.y₂: float64 = 4.0
107-114     val₂[0]: float64 = 0.4
115-122     val₂[1]: float64 = 0.5
123-130     val₂[2]: float64 = 0.6
131         on₂: F (false)

Record size: 4 + 8 + 8 + 24 + 1 = 45 bytes
Total: 42 (header) + 90 (payload) = 132 bytes


Example 2: Variable-Length String Fields

Data: 3 users with variable-length names and categorical status

[
  {"id": 1, "status": "active", "name": "Alice", "code": "U001"},
  {"id": 2, "status": "pending", "name": "Bob", "code": "U002"},
  {"id": 3, "status": "active", "name": "Dr. Christopher Williams", "code": "U003"}
]

Schema (block notation):

[{]
  [i][2][id][m]                        // uint32 (4 bytes)
  [i][6][status][$][S][#][i][3]        // dictionary with 3 values
    [i][6][active]                     // index 0
    [i][8][inactive]                   // index 1
    [i][7][pending]                    // index 2
  [i][4][name][$][l][]]                // offset-based variable string (int32 offsets)
  [i][4][code][S][i][4]                // fixed 4-byte string
[}]

Record payload size: 4 (id) + 1 (status index) + 4 (name offset index) + 4 (code) = 13 bytes

Memory layout (row-major):

┌──────────────────────────────────────────────────────────────┐
│ HEADER (schema + count = 3)                                  │
├──────────────────────────────────────────────────────────────┤
│ Record 1: [id=1] [status_idx=0] [name_idx=0] [code="U001"]   │  13 bytes
│ Record 2: [id=2] [status_idx=2] [name_idx=1] [code="U002"]   │  13 bytes
│ Record 3: [id=3] [status_idx=0] [name_idx=2] [code="U003"]   │  13 bytes
├──────────────────────────────────────────────────────────────┤
│ Name Offset Table (4 × int32):                               │
│   [0, 5, 8, 32]                                              │  16 bytes
├──────────────────────────────────────────────────────────────┤
│ Name String Buffer:                                          │
│   "AliceBobDr. Christopher Williams"                         │  32 bytes
└──────────────────────────────────────────────────────────────┘

Total: header + 39 (records) + 16 (offset table) + 32 (strings) = header + 87 bytes


Marker In Schema Means Payload Size Notes
U uint8 1 byte
i int8 1 byte
u uint16 2 bytes
I int16 2 bytes
l int32 4 bytes
m uint32 4 bytes
L int64 8 bytes
M uint64 8 bytes
h float16 2 bytes
d float32 4 bytes
D float64 8 bytes
C char 1 byte
B byte 1 byte
T boolean 1 byte Payload: T or F marker
Z null/placeholder 0 bytes No payload
S <int> <len> fixed string len bytes No length prefix in payload
H <int> <len> fixed high-precision len bytes No length prefix in payload
[$S#<n>... dictionary string 1-8 bytes Index into embedded dictionary
[$H#<n>... dictionary high-precision 1-8 bytes Index into embedded dictionary
[$<type>] offset-based string/H sizeof(type) Offset table + buffer appended
{...} nested object sum of fields All fields must be supported types
[...] fixed array sum of elements Explicit element types listed

The extension type in BJData provides a mechanism for storing application-specific or predefined binary data types that are not natively supported by the core BJData specification. This enables interoperability with other systems and future extensibility without modifying the core format.

Format

An extension value is encoded using the marker E followed by two integers and a binary payload:

[E][type-id][byte-length][payload]

where:

  • E (0x45) - The 1-byte ASCII marker indicating an extension type
  • type-id - An integer value (i, U, I, u, l, m, L, or M) specifying the extension type identifier
  • byte-length - An integer value (i, U, I, u, l, m, L, or M) specifying the length of the payload in bytes
  • payload - A contiguous byte-stream of the specified length containing the extension data

Type ID Ranges

Range Description
0–255 Reserved for predefined types defined by this specification
256+ Application-specific types for user-defined extensions

Applications can assign type IDs 256 and above for custom data types. The meaning of these IDs is determined by the application and should be documented separately.

Reserved extension types (0–255) are limited to fixed-byte-length records to ensure predictable parsing and efficient storage. Each reserved type ID corresponds to exactly one fixed payload size.

Example (block notation)

An extension with type ID 10 (UUID) containing 16 bytes of data:

[E][U][10][U][16][...16 bytes of UUID data...]

The following extension type IDs are reserved and defined by this specification. Each reserved type has exactly one fixed payload size for unambiguous parsing. Parsers that do not recognize a reserved type ID should treat the extension as opaque binary data and preserve it for round-trip serialization.

Type Summary

Type ID Name Payload Size Description
0 Reserved Reserved for future use
1 epoch_s 4 bytes Epoch time in seconds (uint32)
2 epoch_us 8 bytes Epoch time in microseconds (int64)
3 epoch_ns 12 bytes Epoch time in nanoseconds (int64 + uint32)
4 date 4 bytes Calendar date (year, month, day)
5 time_s 4 bytes Time of day in seconds
6 datetime_us 8 bytes Date and time in microseconds since epoch
7 timedelta_us 8 bytes Time duration in microseconds
8 complex64 8 bytes Complex number (single precision)
9 complex128 16 bytes Complex number (double precision)
10 uuid 16 bytes Universally Unique Identifier (RFC 4122)
11–255 Reserved Reserved for future specification

Represents an instantaneous point in time as seconds since the Unix epoch (1970-01-01 00:00:00 UTC).

Payload Format (4 bytes)

    seconds (uint32)
║──────┬──────┬──────┬──────║
0      1      2      3      4
  • seconds: Unsigned 32-bit integer, Little-Endian
  • Range: 1970-01-01 00:00:00 to 2106-02-07 06:28:15 UTC
  • Precision: 1 second
Example

Timestamp for 2024-01-15 10:30:00 UTC (epoch = 1705315800):

[E][U][1][U][4][0x58][0x8D][0xA2][0x65]

Represents an instantaneous point in time as microseconds since the Unix epoch (1970-01-01 00:00:00 UTC). Compatible with Python's datetime.timestamp() * 1e6.

Payload Format (8 bytes)

                    microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0      1      2      3      4      5      6      7      8
  • microseconds: Signed 64-bit integer, Little-Endian
  • Positive values: dates after 1970-01-01 00:00:00 UTC
  • Negative values: dates before 1970-01-01 00:00:00 UTC
  • Range: approximately ±292,471 years from epoch
  • Precision: 1 microsecond
Conversion
epoch_us = unix_timestamp_seconds * 1_000_000 + microseconds
Example

Timestamp for 2024-01-15 10:30:00.123456 UTC (epoch_us = 1705315800123456):

[E][U][2][U][8][0x40][0x15][0xE3][0xED][0xED][0x0C][0x06][0x00]

Represents an instantaneous point in time with nanosecond precision, stored as seconds plus nanoseconds since the Unix epoch. Compatible with NumPy's datetime64[ns].

Payload Format (12 bytes)
            seconds (int64)              nanoseconds (uint32)
║────┬────┬────┬────┬────┬────┬────┬────╫────┬────┬────┬────║
0    1    2    3    4    5    6    7    8    9    10   11   12
  • seconds: Signed 64-bit integer (bytes 0–7), Little-Endian
    • Signed to support dates before 1970
    • Range: approximately ±292 billion years
  • nanoseconds: Unsigned 32-bit integer (bytes 8–11), Little-Endian
    • Range: [0, 999999999]
    • Sub-second component, always non-negative
  • Precision: 1 nanosecond

Example

Timestamp for 2024-01-15 10:30:00.123456789 UTC:

  • Seconds: 1705315800
  • Nanoseconds: 123456789
[E][U][3][U][12][0x78][0x8D][0xA2][0x65][0x00][0x00][0x00][0x00][0x15][0xCD][0x5B][0x07]

Represents a calendar date without time-of-day information.

Payload Format (4 bytes)
   year (int16)    month   day
║──────┬──────╫──────╫──────║
0      1      2      3      4
  • year: Signed 16-bit integer (bytes 0–1), Little-Endian
    • Range: -32768 to 32767
  • month: Unsigned 8-bit integer (byte 2)
    • Range: [1, 12]
  • day: Unsigned 8-bit integer (byte 3)
    • Range: [1, 31]
  • Uses Gregorian calendar
Example

Date for 2024-01-15:

  • Year: 2024 (0x07E8)
  • Month: 1
  • Day: 15
[E][U][4][U][4][0xE8][0x07][0x01][0x0F]

Represents a time of day with second precision, without date information.

Payload Format (4 bytes)
 hour   min    sec    (rsv)
║──────╫──────╫──────╫──────║
0      1      2      3      4
  • hour: Unsigned 8-bit integer (byte 0), range [0, 23]
  • minute: Unsigned 8-bit integer (byte 1), range [0, 59]
  • second: Unsigned 8-bit integer (byte 2), range [0, 60] (60 for leap second)
  • reserved: Byte 3, set to 0
Example

Time for 10:30:45:

[E][U][5][U][4][0x0A][0x1E][0x2D][0x00]

Represents a date and time as microseconds since the Unix epoch. This is functionally identical to epoch_us (Type ID: 2) but semantically emphasizes the datetime interpretation.

Payload Format (8 bytes)
                    microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0      1      2      3      4      5      6      7      8
  • microseconds: Signed 64-bit integer, Little-Endian
  • Range: approximately ±292,471 years from epoch
  • Precision: 1 microsecond
Example

An extension datetime record for 2024-01-15 10:30:00.123456 UTC:

[E][U][6][U][8][0x40][0x15][0xE3][0xED][0xED][0x0C][0x06][0x00]

Represents a time duration in microseconds. Compatible with Python's timedelta.total_seconds() * 1e6.

Payload Format (8 bytes)
                    microseconds (int64)
║──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────║
0      1      2      3      4      5      6      7      8
  • microseconds: Signed 64-bit integer, Little-Endian
  • Positive values: forward duration
  • Negative values: backward duration
  • Range: approximately ±292,471 years
  • Precision: 1 microsecond
Conversion
total_seconds = timedelta_us / 1_000_000
days = timedelta_us // 86_400_000_000
Example

Duration of 5 days, 3 hours, 30 minutes, 15.5 seconds (= 444615500000 μs):

[E][U][7][U][8][0x60][0xAC][0x86][0x6E][0x67][0x00][0x00][0x00]

Represents a complex number with single-precision (32-bit) floating-point components. Compatible with NumPy's complex64 data type.

Payload Format (8 bytes)
      real (float32)              imaginary (float32)
║──────┬──────┬──────┬──────╫──────┬──────┬──────┬──────║
0      1      2      3      4      5      6      7      8
  • real: 32-bit float (bytes 0–3), Little-Endian, IEEE 754
  • imaginary: 32-bit float (bytes 4–7), Little-Endian, IEEE 754
Mathematical Notation

A complex number z = a + bi is stored as [a][b].

Example

Complex number 3.0 + 4.0i:

[E][U][8][U][8][0x00][0x00][0x40][0x40][0x00][0x00][0x80][0x40]

Represents a complex number with double-precision (64-bit) floating-point components. Compatible with NumPy's complex128 data type and Python's complex.

Payload Format (16 bytes)
            real (float64)                       imaginary (float64)
║────┬────┬────┬────┬────┬────┬────┬────╫────┬────┬────┬────┬────┬────┬────┬────║
0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16
  • real: 64-bit float (bytes 0–7), Little-Endian, IEEE 754
  • imaginary: 64-bit float (bytes 8–15), Little-Endian, IEEE 754
Example

Complex number 3.0 + 4.0i:

[E][U][9][U][16][0x00][0x00][0x00][0x00][0x00][0x00][0x08][0x40]
                [0x00][0x00][0x00][0x00][0x00][0x00][0x10][0x40]

Represents a 128-bit Universally Unique Identifier as defined in RFC 4122.

Payload Format (16 bytes)
    time_low        time_mid  time_hi  clk_hi clk_lo        node
║────┬────┬────┬────╫────┬────╫────┬────╫────╫────╫────┬────┬────┬────┬────┬────║
0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16
  • time_low: bytes 0–3, Big-Endian
  • time_mid: bytes 4–5, Big-Endian
  • time_hi_and_version: bytes 6–7, Big-Endian
  • clock_seq_hi_and_reserved: byte 8
  • clock_seq_low: byte 9
  • node: bytes 10–15

The 16 bytes are stored in standard UUID byte order (network byte order / Big-Endian) as per RFC 4122. This is an exception to BJData's Little-Endian convention.

UUID Versions

The format supports all UUID versions (1, 4, 5, 7, etc.) as defined by RFC 4122 and RFC 9562.

Example

UUID 550e8400-e29b-41d4-a716-446655440000:

[E][U][10][U][16][0x55][0x0e][0x84][0x00][0xe2][0x9b][0x41][0xd4]
                 [0xa7][0x16][0x44][0x66][0x55][0x44][0x00][0x00]

Implementation Notes

Handling Unknown Extension Types

When a parser encounters an extension type ID it does not recognize:

  1. Reserved range (0–255): The parser should preserve the extension as opaque binary data to enable round-trip serialization. It may optionally issue a warning.

  2. Application range (256+): The parser should either:

    • Pass the extension to an application-provided handler, or
    • Preserve it as opaque binary data, or
    • Raise an error (configurable behavior)

Type Summary by Payload Size

Payload Size Type IDs
4 bytes epoch_s (1), date (4), time_s (5)
8 bytes epoch_us (2), datetime_us (6), timedelta_us (7), complex64 (8)
12 bytes epoch_ns (3)
16 bytes complex128 (9), uuid (10)

Extension Data Endianness

All multi-byte numeric values in extension payloads are stored in Little-Endian order, consistent with the rest of the BJData specification, with the exception of UUID which follows RFC 4122 (network byte order / Big-Endian).

Recommended File Specifiers

For Binary JData files, the recommended file suffix is ".bjd". The MIME type for a Binary JData document is "application/jdata-binary"

Acknowledgement

The BJData spec is derived from the Universal Binary JSON (UBJSON, https://ubjson.org) specification (Draft 12) developed by Riyad Kalla and other UBJSON contributors.

The initial version of this MarkDown-formatted specification was derived from the documentation included in the Py-UBJSON repository (Commit 5ce1fe7).

This specification was developed as part of the NeuroJSON project (https://neurojson.org) with funding support from the US National Institute of Health (NIH) under grant U24-NS124027.