FIP-0071 Source

TitleDeterministic State Access
AuthorSteven
Discussions-Tohttps://github.com/filecoin-project/FIPs/discussions/764
StatusFinal
TypeTechnical Core
Created2023-07-25

Spec Sections

FIP-0071: Deterministic State Access

Simple Summary

This FIP brings us a step closer to user-defined WebAssembly actors by introducing deterministic rules defining what “state” (IPLD blocks) an actor is and is not allowed to read. Specifically, this FIP specifies that an actor may only read state “reachable” from:

  1. The actor’s state-tree.
  2. Parameters passed into the actor from other actors.
  3. Blocks returned to the actor from other actors.

Where “reachable” means that the state can be “reached” by traversing IPLD links (CIDs) from the “roots” listed above.

Abstract

This FIP introduces additional logic into the FVM’s state management system to ensure that:

  1. An actor may only read its own state.
  2. When an actor writes new state, that new state may only reference an actors existing state.

This ensures that an actor cannot read and/or reference random IPLD data in the client’s blockstore which could lead to network forks (once users are allowed to deploy arbitrary native actors to the network).

Change Motivation

Currently, built-in actors (but not EVM smart contracts) can read arbitrary data “blocks” (IPLD blocks) out of the Filecoin client’s “state” store. Nothing ensures that these blocks are actually a part of the current tipset’s state-tree, let alone the executing actor’s state-tree. Instead, they could be “garbage” left over from previous tipsets and/or forks.

This is currently “safe” as all built-in actors only access “their” state by traversing links (CIDs) down from their state-tree root. However, if a (not currently allowed) user-defined WebAssembly actor were to try to access a random block not in its state-tree, it could cause the Filecoin network to fork as said block may be present in some client’s datastores but not others.

Specification

This FIP proposes to enforce the existing built-in actor behavior in the FVM itself. That is, actors will be required to traverse CIDs from their state-root when accessing state and will not be able to link to “arbitrary” CIDs not present in their state when writing new blocks.

It does this by introducing a “reachable set”, the set of currently “accessible” blocks (specifically, the CIDs of said blocks), and the following two rules:

  1. The actor may only read (open) blocks in the reachable set. Blocks [referenced by][ block-analysis ] newly opened blocks are added to the reachable set.
  2. The actor may only write (create) blocks that [referenced][ block-analysis ] blocks currently in the reachable set. Newly created blocks are added to the reachable set.

This should allow for natural manipulation of IPLD state-trees while preventing an actor from reading arbitrary state.

Background

This section describes the FVM as it exists today. For more context and history, you may want to read about the IPLD memory model.

IPLD API

The FVM’s exposes 4 syscalls for reading/writing IPLD state to actors:

  1. ipld::block_create(codec, data) -> handle registers a new IPLD block with the FVM, returning a block “handle” (similar to a file descriptor).
  2. ipld::block_open(cid) -> handle “opens” the IPLD block referenced by the passed CID, returning a handle to the block.
  3. ipld::block_read(handle, offset, length) -> data reads data from an open block into the actor’s memory.
  4. ipld::block_link(handle, multihash_code, multihash_length) creates a CID for the block referenced by the given block “handle”, using the specified multihash code and length (currently limited to blake2b-256).

Additionally, it exposes 2 syscalls for manipulating the actor’s state-tree itself:

  1. self::set_root(cid) update’s the actor’s state-root CID.
  2. self::root() -> cid returns the actor’s current state-root CID.

Finally, IPLD blocks can be sent between actors by as message parameters/return values (via send::send). Such IPLD blocks are sent by-handle, not by-cid.

NOTE: the function signatures above are for illustrative purposes. The actual syscalls are defined in FIP0030.

Supported IPLD Codecs

The FVM currently supports reading (ipld::block_open) and writing (ipld::block_create) state with the following IPLD codecs (SUPPORTED_CODECS):

  1. Raw (0x55) – raw data
  2. CBOR (0x51) – cbor-encoded data
  3. DagCBOR (0x71) – cbor-encoded linked (IPLD) data

Of these codecs, only the last one, DagCBOR, can “link” to other objects.

Gas

We define 4 new gas fees:

Name Gas Description
ipld_cbor_scan_per_field 85 the fee charged per CBOR field read
ipld_cbor_scan_per_cid 950 the fee charged per CID read while parsing CBOR
ipld_link_tracked 550 the fee charged per “reachable” CID tracked
ipld_link_checked 500 the fee charged per “reachable” CID checked

Read on for how these fees will be applied.

IPLD blocks may link to (reference by CID) IPLD blocks with a SUPPORTED_CODECS using either either of the following multihash constructions:

  1. A 32-byte Blake2b-256 hash (code 0xb220). E.g., a cid of the form cidv1-CODEC-blake2b256-32-DIGEST.
  2. A “identity” hash (code 0x0) with up to a 64 byte digest, “inlining” the referenced block into the CID itself.

We will call these “allowed CIDs”.

Additionally, FVM block link analysis will ignore CIDs of sealed and unsealed commitments, as long as the multihash digest is less than 64 bytes. These CIDs already exist in the Filecoin state-tree, but the data they refer to (sectors and pieces) aren’t considered to be a part of the state-tree itself.

  • FILCommitmentSealed (0xf102)
  • FILCommitmentUnsealed (0xf101)

We will call these “ignored CIDs”.

For IPLD block link analysis, we define a function that takes in a codec and an IPLD block, and returns the CIDs of all blocks directly “reachable” (linked from) said block. That is:

func ListReachable(codec u64, block []byte) ([]Cid, error)

This function will parse the provided block according to the specified codec (see below) and:

  1. If it encounters an “allowed CID”:
    1. If the CID uses the identity hash function (inlines a block), this function will not return the CID directly but will instead recursively parse the inlined block.
    2. Otherwise, this function will emit the CID.
  2. Any “ignored CIDs” will be skiped and ignroed.
  3. Any other CIDs will cause this function to signal an error.

Codec: Raw & CBOR

Raw & (non-IPLD) CBOR blocks contain no links and will not be parsed. ListReachable will simply return an empty list of CIDs and will never error.

Codec: DagCBOR

For DagCBOR, we read the CBOR field-by-field according to the CBOR specification with minimal validation. Importantly, we do not require canonical CBOR although we do reject indefinite-length fields.

We start by setting the expected number of fields to 1. If this number would ever exceed 2^64, we abort processing the block with an error.

Then, while the expected number of fields is non-zero (and we are not out of gas), we:

  1. Charge gas (ipld_cbor_scan_per_field) for reading a single CBOR field.
  2. Decrement the expected number of fields by 1.
  3. Read the CBOR field “header” (major type + immediate value):
    1. We read one byte where the first 3 bits (0-7) are the major type and the remaining 5 are the additional information.
    2. If the additional information is in the range 0-23 inclusive, we treat it as the immediate value.
    3. If the additional information is 24, 25, 26, 27; we read an additional 1, 2, 4, 8 bytes respectively; decode said bytes as a big-endian integer; and treat that as the immediate value. We do not validate that such integers are “minimally encoded” (e.g., 0x1 could be encoded in 8 bytes).
    4. If the additional information is greater than 27, we treat the CBOR as malformed. Importantly, this means that the FVM will not support indefinite length strings, maps, arrays, etc.
  4. If the major type is 0 (integer), 1 (negative integer), or 7 (special), we continue.
  5. If the major type is 2 (byte string) or 3 (string), we seek forward “immediate value” bytes and continue. We do not validate the string’s encoding.
  6. If the major type is 4 (array) we add the immediate value to the expected number of fields, and continue.
  7. If the major type is 5 (map) we add two times the immediate value to the expected number of fields, and continue. We do not or restrict the allowed key/value types, nor do we validate the order of keys, nor do we check for duplicates.
  8. If the major type is 6 (tag):
    1. If the immediate value is 42, we:
      1. Charge gas (ipld_cbor_scan_per_cid) for reading a single CID from a CBOR field.
      2. Read the next CBOR header as described in step 3.
      3. If the field is not a byte-string, abort with an error.
      4. If the field does not start with a single 0x0 byte, abort with an error.
      5. We attempt to parse the rest of the byte-string as a CID with a maximum hash digest size of 64 bytes. If that fails, we abort with an error.
      6. Finally, we handle the CID per the definition of the ListReachable function and continue.
    2. Otherwise, we add we add 1 to the number of expected fields and continue.

Finally, we abort with an error if either:

  1. We reach the end of the block unexpectedly.
  2. The number of expected fields reaches zero before we reach the end of the block.

Reachable Set

Next, we define a “reachable set”. This is the set of CIDs that can currently be “opened” by a running actor instance.

Importantly, this set is per-actor-instance, not global. For example, if actor A calls actor B calls back into actor A, there are three distinct “reachable sets”, one for each actor invocation.

ipld::block_create

On ipld::block_create(codec, data) -> handle, the FVM:

  1. Validates that the codec is in the SUPPORTED_CODECS set.
  2. Calls ListReachable(codec, data), charges ipld_link_checked per CID, then validates that all returned CIDs are currently in the reachable set.
    1. If this function returns an error, the syscall fails with Serialization.
    2. If this function returns any CIDs not currently in the reachable set, the syscall fails with NotFound.
  3. Continues to create the block as usual, recording the reachable CIDs alongside the block.

ipld::block_open

On ipld::block_open(cid) -> handle, the FVM:

  1. Validates that the requested CID is in the reachable set, charging ipld_link_checked gas.
  2. Loads the block from the client’s blockstore.
  3. Calls ListReachable on the newly opened block, adding the referenced CIDs to the reachable set and charging , ipld_link_tracked gas per CID referenced.

NOTE: The CID must be blake2b-256, not an identity hash. Actors must handle blocks inlined into identity hashes internally as said CIDs are not tracked in the reachable set.

On ipld::block_link(handle, hash_code, hash_len) -> cid, the FVM:

  1. Validates that the handle references an open block and that the hash code/len are exactly blake2b-256 and 32 respectively.
  2. Puts the block into the client’s blockstore (or, at least, a write buffer for said blockstore).
  3. Adds the newly created CID to the reachable set, charging ipld_link_tracked gas.

self::root

On self::root() -> cid, the FVM:

  1. Adds the root CID to the reachable set, charging ipld_link_tracked gas.
  2. Returns the root CID.

self::set_root

On self::set_root(cid), the FVM:

  1. Validates that the cid is in the reachable set, charging ipld_link_checked gas.
  2. Updates the actor’s state-root.

NOTE: The root CID must be a blake2b-256 CID, not an identity-hashed inlined block.

send::send

On send::send(..., parameters_handle, ...) -> (..., return_handle, ...), the FVM:

  1. Validates that parameters_handle is a valid IPLD block handle.
  2. Copies the referenced block into the receiving actor’s open block table.
  3. Adds the CIDs reachable from the sent (parameters) block to the receiving actor’s reachable set (charging ipld_link_tracked gas per CID). This operation requires no additional parsing as the reachable CIDs are already recorded alongside the block.
  4. Invokes the receiving actor.
  5. On return, copies the returned block into the caller’s open block table.
  6. Adds the CIDs reachable from the returned block to the caller’s reachable set (charging ipld_link_tracked gas per CID).
  7. Returns to the caller.

Design Rationale

Per-Instance v. Per-Message “reachable” Sets

This FIP could have proposed a single “reachable” set maintained across an entire top-level message invocation. In many ways this would have been simpler however:

  1. Behavior in some actor A would influence the reachability of IPLD blocks in some unrelated block B just because A was invoked before B.
  2. This FIP is written in such a way to enable future optimizations where temporary IPLD blocks may be garbage collected in-between calls if/when they don’t end up linked into the actor’s state (potentially leading to significant gas/memory savings). Any form of “global” reachable set would make this kind of optimization impossible.

Inlined blocks as state-root CIDs

This FIP forbids “inlining” blocks into state-root CIDs as we’d otherwise need to perform IPLD Link Analysis on the inlined block when calling ipld::set_root.

Ignoring v. Rejecting Unknown CIDs

The link analysis algorithm defined in this FIP rejects CIDs with unknown codecs and/or hash functions instead of simply ignoring such CIDs for better forward-compatibility. This way, new codecs and hash functions can be supported in the future without worrying that such CIDs may already be present in the state (where they would have previously not been covered by link analysis).

Native v. Wasm

We previously considered performing link analysis in a Wasm module to:

  1. Take advantage of our existing Wasm gas accounting rather than manually charging for gas based on the CBOR’s structure.
  2. Ensure that all implementations used the exact same Wasm parsing and validation logic.

However, this would have increased implementation complexity and introduced additional runtime costs (switching in and out of Wasm isn’t free).

Instead, for improved performance and simplicity, we chose to make the CBOR parsing logic as simple as possible, forgoing any validation beyond what was strictly necessary to extract the CIDs from a block. This let us define a simple gas model based solely on the number of CBOR fields and CIDs present in the block.

Validation

We chose not to perform extensive CBOR validation when creating new DagCBOR blocks and instead do the minimum validation required to extract links. We only guarantee that all written DagCBOR blocks are correctly structured: have valid CBOR major types and can be traversed, field by field.

We made this choice for a few reasons:

  1. Performance. Fully validating CBOR would require validating UTF-8 (walking all strings byte by byte), map key order, etc.
  2. Security. The validation logic would become a part of the spec. Any deviation from said validation logic in any client implementation would cause forks. Any bug would either reject valid blocks or accept invalid blocks (betraying the user’s expectation that all blocks in the state-tree are fully validated CBOR).

To highlight this point:

  1. The DagCBOR spec has been tweaked many times and may change again in the future.
  2. The validation rules are complicated.
  3. Most DagCBOR implementations don’t (or haven’t until recently) correctly implemented this spec.

So, instead, we went for a small but easily verifiable implementation that can be concisely specified with no strange edge-cases and/or exceptions.

Backwards Compatibility

This FIP requires a network upgrade to account for changes in the gas model, but it’s otherwise backwards compatible and will require no state migration.

Test Cases

https://github.com/filecoin-project/ref-fvm/pull/1824

Security Considerations

This FIP primarily aims to increase the security of the FVM and guard against malicious and/or buggy Wasm actors. However, as with any code:

  1. This FIP increases the complexity of the protocol, potentially introducing bugs.
  2. This FIP proposes algorithms to parse potentially user-specified data. If the parsing algorithms and gas costs are not carefully designed/implemented, this could introduce an attack vector.

Incentive Considerations

This FIP will slightly increase the gas costs of reading/writing state (due to the IPLD link analysis). However, it is not expected that these costs will be significant: A single IPLD read costs at least 200k gas while link analysis is 85 gas per CBOR field and ~1500 gas per CID (both for reads and writes).

E.g., the Miner actor’s root state has 16 CBOR fields (15 fields plus the object itself) and 7 links leading to ~11.7k gas. The current cost of opening the miner state-root is ~220k gas, so the read cost increase is ~5.3%. Write cost overhead should be negligible as we already charge 3340gas/byte.

Product Considerations

This is a necessary step towards allowing arbitrary Wasm/IPLD actors.

Implementation

https://github.com/filecoin-project/ref-fvm/pull/1824

Copyright and related rights waived via CC0.

Citation

Please cite this document as:

Steven, "FIP-0071: Deterministic State Access," Filecoin Improvement Proposals, no. 0071, July 2023. [Online serial]. Available: https://fips.filecoin.io/fips/fip-0071.