Q: By what process do we identify content?
A: the generation of a CUUID, i.e. a Content-addressed Universally Unique Identifier
This is an early work in Progress. There are always issues.
Define a commitment identifier that is secure, interoperable, evolvable, and cute.
A CUUID is a UUIDv8 that is derived from some content, i.e. what the UUID RFC calls a name-based UUID for names within a namespace. This document defines several kinds of CUUIDs and their corresponding namespaces.
Sometimes there is data, and it is big and unwieldy and very uncute.
So we give the data an identifier that is usually smaller, in some text format that is likely to be human-readable, unique, and usable in many contexts. That way we can store the commitment as an identifier that references the data instead of storing the data itself.
For decades now, folks have used UUIDs for this purpose, i.e. for universally unique identifiers. However, some UUIDs are more powerful than others. These content-addressing UUIDs, i.e. CUUIDs, can be derived from any given data stream, which makes them useful as data commitments.
It would be useful to have a well-defined and secure content-addressing UUID scheme, and this document defines one that uses a more secure cryptographic hash function than the normatively defined UUIDv3 and UUIDv5 schemes in RFC 9562.
CUUIDs are UUIDv8s that verify against some namespace
+ some name
(i.e. some data) using one of the following schemes:
Note every CUUID only verifies against both the namespace
and name
. CUUIDs allow for a number of data commitments to different kinds of data, each with its own specified namespace
value, e.g. Data Stream CUUIDs, Canonicalized Media CUUIDs, and URL CUUIDs.
CUUIDs are inspired by UUIDv3 (md5) and UUIDv5 (SHA-1), but aims to use more modern crypto while conforming to the latest recommendations from RFC 9562 that name-based UUIDs using new cryptographic hash functions (e.g. SHA2) should use UUIDv8.
CUUID Conformant implementations MUST support verification of CUUID-UUIDv8-SHA2 CUUIDs.
A CUUID-UUIDv8-SHA2 is identical to the UUID v8 format described in RFC9562 B.2. Example of a UUIDv8 Value (Name-Based).
A CUUID-UUIDv8-SHA2 is identical to the UUID v8 format described in RFC9562 B.2. Example of a UUIDv8 Value (Name-Based).
Data Stream CUUIDs may be used to commit to untyped data of any size.
A Data Stream CUUID MUST be a CUUID derived with namespace 026d1093-7ee7-570b-b78c-add35fa5ec5b
. The name may be any binary data.
CUUIDs are commitsments to content with an associated media type.
A Media CUUID MUST be a CUUID derived with a namespace which is the UUIDv5 of the URL of the media type specification. For example, one could use a URL like https://www.iana.org/assignments/media-types/application/cbor which would correspond to the URL UUIDv5 088a66e5-2323-5f85-bb2a-dd8c23a8d388
.
When a CUUID is used in the context of some explicit media type, it may be a Canonicalized Media CUUID.
Canonicalized Media CUUIDs are Media CUUIDs for media data in its canonicalized form.
For example, the JSON values { "a": 1 }
and {"a":1}
would parse the same, and both have the same result after applying RFC 8785 JSON Canonicalization Scheme (JCS). The two example JSON values have different Data Stream CUUIDs but identical application/json
Canonicalized Media CUUIDs.
A Canonicalized Media CUUID MUST be derived from binary data that has been canonicalized according to the canonicalization scheme matching its media type in the Media Type Canonicalization Scheme Registry below.
A Data CUUID that cannot be recreated from data with the same bits as the output of a registered canonicalization scheme is not a Canonicalized Media CUUID.
Note: registrations with provisional status are subject to change.
Media Type | Canonicalization | Status |
---|---|---|
application/json | RFC 8785 JSON Canonicalization Scheme (JCS) | final |
application/n-quads | W3C RDFC-1.0 | final |
application/cbor | dCBOR: A Deterministic CBOR Application Profile | provisional pending RFC |
URL CUUIDs are CUUIDs that are derived from a URL.
URL CUUIDs MUST be name-based UUIDs that use namespace 6ba7b811-9dad-11d1-80b4-00c04fd430c8
, i.e. the namespace registered for URLs in RFC9562.
CUUID URNs are URNs in the uuid
namespace like those in RFC 9562 Figure 4.
Example CUUID URN: urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6
The urn
URI Scheme is registered with IANA in IETF RFC 8141.
The uuid
URN namespace is registered with IANA in IETF RFC 9562.
Because all URNs are URIs, all CUUID URNs MAY also be considered CUUID URIs.
In order to be secure, a commitment scheme must commit to data, and that commitment (usually) must not commit to additional data besides the originally committed-to data. Otherwise, we'd call it an insecure commitment scheme.
There is a tradeoff between CUUID size and the complexity of content it can commit to. An identifier of size zero commits to nothing and everything. An identifier too big isn't cute enough for our acute purpose, and on top of that it may hinder interoperability.
Note CUUIDs are limited by the UUID specification to 128 bits in size, and this is the constraint on the bit-security and resilience to preimage attacks. While this constraint did not prevent utility of UUIDv5 and UUIDv3 with even weaker hash function security than CUUID-UUIDv8-SHA2
It's hard to be perfectly secure. The goal should be to:
UUIDs have been interoperating for 40 years.
UUIDs have been standardized by the UN/ITU, and represent the consensus of the IETF community. They've received public review and been approved for publication by the Internet Engineering Steering Group (IESG).
If you've never heard of them before, they're easy to get the gist of quick. They're ergonomic, and we're already used to seeing them everywhere. They work well in URLs, filesystems, common databases, etc. Hakuna Matata.
CUUIDs are evolvable because UUIDv8 is evolvable and the syntax of URIs is evolvable.
UUIDv8 is evolvable is because UUID are self-describing. A UUID has var
(variant) and ver
(version) sections that respectively indicate the overall bit layout variant (e.g. non IETF UUIDs), and for IETF UUIDs, which version of IETF UUID.
URIs are evolvable because their semantics are identified by URI Scheme, and URI Scheme evolution is extensible via registration with IANA.
Alice has a local file system with many files.
Alice also has access to one or more file metadata services she can query for related information about any file.
Some of Alice's files are really big: terrabytes in size. It's expensive for her to send the file to each file metadata service.
Alice and her file metadata services want to a gree on a standard deterministic file identifier they can use to talk about any file without having to pass big files around. However, they want to make sure adopting these identifiers doesn't increase the risk of a malice-in-the-middle attack where the file is mutated by an intermediary.
Alice and her file metadata services can all adopt Data Stream CUUIDs as a deterministic file identifier. Because it is derived from a cryptographic commitment to the underlying file data, the Data CUUID can also be used as a checksum to verify that any data identified as the Data CUUID has not been tampered with. Alice can fetch data by the CUUID, verify it, and only if it verifies, use the data without worrying about it containing maliciously modified data.