Skip to content

Tracking Issue: PatchedArray #7818

@a10y

Description

@a10y

Motivation

The idea of "patching" another encoding is pervasive in Vortex. The idea is to use the output of another encoding that is partially applicable to your array as if it were applicable at all positions, but then to store the extra values to "patch" as a second step in decoding.

We currently have three ways of representing Patching in the codebase:

  1. Patches child arrays of encodings such as BitPackedArray and ALPArray
  2. SparseArray which stores a default value and a set of patches
  3. A new PatchedArray that is based on the RFC

We think that it would be good to shift all uses of (1) and (2) over to the new PatchedArray to unify all of these usages in one single codepath that participates in the optimizer rules and iterative execution.

Design

The original PatchedArray RFC implemented a FastLanes-specific ordering of the patch indices and values, based on the lane that they are assigned at decoding time. This was done to optimize for GPU decoding, where we wanted to make sure each CUDA thread is able to seek to its patches in constant time.

However, as we found in #7440, we can get similar performance just from the existing sorted patches, using the chunk_offsets.

Currently PatchedArray is an experimental encoding that is disabled by default, and we have no known users of the encoding. To keep things simple and make it easy to migrate from the existing interior Patches layouts, we should try and make PatchedArray mirror the Patches array, i.e. having an indices and values child along with a (non-optional) chunk_offsets child to allow for random seeks.

Steps

  • Update PatchedArray to have indices/values/chunk_offsets children instead
  • Enable compressor to write PatchedArray by default for BitPacked, ALP and others
  • Switch over usage of SparseArray to PatchedArray and mark SparseArray as deprecated

Unresolved questions

  • Do we want to keep the G-ALP style transposed patches encoding and make this a new encoding instead?

Implementation history

Metadata

Metadata

Assignees

No one assigned

    Labels

    tracking-issueShared implementation context for work likely to span multiple PRs.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions