For the last few months, I've been thinking about data portability in ActivityPub and, more generally, throughout the web. In doing so I've been trying to synthesize various patterns from Dmitri and bumblefudge's FEP-7952: Roadmap for Actor and Object Portability, Lisa's LOLA Portability for Activity Pub, evanpro's Data Portability in ActivityPub, with ideas I was immersed in while working for Protocol Labs from 2022-2024 on web3.storage and IPFS (e.g. IPLD, which bsky.app's AT Protocol relies on).
For me, Web 3.0 has always been about data portability.
Almost 20 years ago, when I was a teenager, I was reading things like this from 2006[^1]
"People keep asking what Web 3.0 is," Berners-Lee said. "I think maybe when you've got an overlay of scalable vector graphics - everything rippling and folding and looking misty - on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource." Said Sheehan: "I believe the semantic Web will be profound. In time, it will be as obvious as the Web seems obvious to us today."
― A 'more revolutionary' Web, nytimes.com 2006
(Let's just ignore for now that "Web 3.0" got detourned into "Web3" by cryptobros no sooner than 8 years later.)
Despite the temptations for so-called "web2" companies (well, and most web3 companies), whose business incentives tend to lead toward various kinds of Vendor Lock-in, the revolutionary potential of "Web 3.0", and really "Web 1.0" (before venture capitalists contrived "Web 2.0" just like they contrived "web3" and "Web5"), is building technologies that allow end-users to escape the powerful forces that inevitable converge toward locking in what Julia Netter calls digital bodies).
Whether the lock is attached to the chains of a single SaaS vendor, a standard controlled by a single corporation, or a single blockchain's idiosyncracies, it's hard to escape chains and lock-in. But not impossible. In my view, its less about the technology storing the data, or even the mechanics of how a protocol says to pass it around. I think it's more about socially constructing syntax and semantics for how a broad ecosystem can interpret some information.
In July 2014, w3.org chartered a Social Web Working Group and the first use case listed was:
User control of personal data: Some users would like to have autonomous control over their own social data, and share their data selectively across various systems.
and the first deliverable:
Social Data Syntax
A JSON-based syntax to allow the transfer of social information, such as status updates, across differing social systems. One input to this deliverable is ActivityStreams 2.0
Historical note: most of ActivityStreams 2.0 was done by James Snell and published at IETF before W3C was ever involved and I'd been implementing it at my dayjob at Livefyre.
I share this as context to make the point that In ActivityPub, Data Portability comes largely from the Social Data Syntax, not the network protocols.
The data syntax and semantics enables, amongst other things, data portability and a good federation protocol.
But various endeavors above are trying to solve data portability primarily via network protocols. I've been wondering whether that's the right layer on which to tackle this challenge? Or if some enhancement at the data syntax layer might help.
There are quite a few things that make social data portability hard in the ActivityPub ecosystem.
To some extent, data portability would be a lot easier if only end-users would use ActivityPub implementations that fully implemented the protocol and in a conformant way, especially the Client to Server Interactions.
One of the whole goals of W3C's Social WG was to deliver a Social API, i.e.:
A document that defines a specification for a client-side API that lets developers embed and format third party information such as social status updates inside Web applications.
To a large extent, ActivityPub did deliver this. It describes HTTP APIs for a client to fetch collections of activities and objects, e.g. an actor's outbox (stuff they post) and an actor's inbox (stuff others posted to them). This enables data portability away from any particular server, because anyone can just write a script or make a GUI app that fetches all the items from an actor's outbox and inbox, and help the end-user port them wherever they want!
Note: I'm going to focus a bit on Mastodon here because 1) it's what many people use and 2) it's what I personally use day-to-day. I want to be accurate, but my goal here isn't to pick on Mastodon in particular-- it is to illustrate any implementation being fast-and-loose about protocol conformance / marketing presents a challenge for end-user understanding and data portability.
Many end-users use Mastodon servers, and, unfortunately, Mastodon is not an ActivityPub conformant Server. According to nightpool et al, "Mastodon does not support the ActivityPub standard C2S API (GET inbox)" (which would enable data portability). If I understand the docs correctly, Mastodon also doesn't implement any handler for POST requests to an actor's outbox, for which several requirements are described in ActivityPub section 6.
What that means is that end-users that use Mastodon for their 'open social web' server are actually still experiencing Vendor lock-in (to Mastodon) that ActivityPub was supposed to solve!
What's more, not only do Mastodon users not have access to the in-protocol mechanism for data export, the custom export/import process they came up with also doesn't fulfill end-user expectations. Yes, users can go into their Mastodon settings and export their posts. Unfortunately, they can't actually import those posts to another implementation, not even another Mastodon instance.
This was surprising to me the first time I used Mastodon's export. Erin Kissane's "Notes from Mastodon Migration" chronicles a similar disappointment.
So… yes, you can move your account. The process isn’t that difficult. But even if it works well, which it doesn’t always, you lose a lot—more than I think is reasonable to ask of people who just want to hang out with their friends.
https://docs.joinmastodon.org/user/moving/ says
Mastodon currently does not support importing posts or media due to technical limitations, but your archive can be viewed by any software that understands how to parse Activity Streams 2.0 documents.
Now, Mastodon is great and I use it regularly, but for the sake of end-users and developers, I feel obligated to point out very clearly that it is a bit misleading that joinmastodon.org says it is "Interoperable" and "Built on open web protocols, Mastodon can speak with any other platform that implements ActivityPub."
This is at best a partial truth. Mastodon can't speak with ActivityPub conformant Clients that rely on the requirements related writing to ActivityPub outboxes. For example, I believe this is why evanpro's client ap
or andstatus Android App are unable to write to Mastodon servers in a standard way.
This lack of conformance here has ripple effects throughout the ecosystem. Because of Mastodon's market share but lack of protocol conformance and interoperability, developers of open social web applications have had a hard choice to make for 5 years now: do you build for ActivityPub's protocol, or do you build for the Mastodon Protocol? The former is fairly well-specified, has consensus-driven governance via W3C / FEPs, but is not implemented by Mastodon. The latter has all the end-users. I can't say I blame indie devs and small businesses who want to build for real users for building "Mastodon Apps" and not ActivityPub Apps. At the same time, end-users may suffer in the long run as they realize they are locked into Mastodon's ecosystem, and despite what they were perhaps-falsely advertised on joinmastodon.org ("Mastodon can speak with any other platform that implements ActivityPub"), they will soon find out that their friends who chose to use a product built on the ActivityPub specification may not actually interoperate with the Mastodon server (or "Mastodon App") they chose.
For the sake of end-users (like me), I wish Mastodon would implement the entirety of ActivityPub and deliver on the promises of ActivityPub-based interoperation made on joinmastodon.org. I admit it's not clear to me what their incentive is to do, since this lack of interoperability via ActivityPub is leading to a really big ecosystem around Mastodon's own protocol. The only path towards ActivityPub-based interoperation I see is more competition from other server implementations and somehow getting users to switch to them. I'd also like to see disincentives for advertising wide ActivityPub interoperability that Mastodon doesn't provide, which is one of the reasons I'm caling that out here. It's false advertising that causes confusion amongst users and other stakeholders and ultimately hurts the ActivityPub ecosystem and broader open web standards movements.
While this has long been a sort of uncomfortable truth for the more self-congratulatory parts of the Mastodon and even ActivityPub ecosystems, it is definitely not the only challenge to data portability.
To be able to port data around, we need some kind of identifier by which to refer to the data that is being ported, accessed, or otherwise used.
In the internet architecture, most protocols use Internationalized Resource Identifiers (aka IRIs) for this (or related and more colloquially used URI or URL).
ActivityStreams2 defines an id
property (equivalent to JSON-LDs @id
) that must have a single IRI value.
Now, there are dozens of different URI Schemes that define the exact semantics of a URI or IRI. You're probably most familiar with https:
like you often see in your web browser (or the less privacy-preseving http:
we'll see in some examples).
Example 7 of activitystreams-core has an id
:
{
"@context": "https://www.w3.org/ns/activitystreams",
"id": "http://example.org/foo",
"type": "Note",
"name": "My favourite stew recipe",
"attributedTo": {
"id": "http://joe.website.example/",
"type": "Person",
"name": "Joe Smith"
},
"published": "2014-08-21T12:34:56Z"
}
What I mean by "Provider-dependent Identifier" is that the way this example chose to identify the note is based on the name of the HTTP server on which it (hopefully 😅) can be found (example.org
).
If this were a name that Joe controlled indefinitely, like joe.org
, then this is a pretty darn good way of identifying things, because it wouldn't be dependent on a specific service provider. If joe wanted to switch server providers, he could port the data to another server with a different IP Address, then change the DNS record for joe.org
to point to the new IP Address. Eventually, web clients will resolve joe.org
to the new IP address, and as long as the new IP address served the same logical object at the same HTTP path, things would work and clients wouldn't even necessarily know or care that joe.org had ported his data around.
However, many ActivityPub users do not use identifiers based on names they control indefinitely, myself included! A lot of times I link to my mastodon posts like https://mastodon.social/@bengo/113131359767970893. This post identifier depends on the server name mastodon.social
, not a domain name I control like bengo.is
. Every time we link to a post like this using an identifier with a server name that we can't later redirect to a new service provider, we're further locking ourselves in to the ActivityPub service provider we currently use. And we're giving up one of the ways that data portability can be possible based on a DNS name controller's ability to change DNS records to point to a new service provider.
So what kinds of identifiers are there that don't depend on a provider that the end-user might later want to port away from?
Instead of identifying content by the location of a provider that may serve it, instead identify the content by a checksum of the content itself. Content-addressable Storage (aka CAS) was used by Bell Labs in the 1980s and has more recently been popularized by BitTorrent magnet URIs and, to some extent, by ipfs. IETF has published an RFC defining a ni:
URI Scheme for this as RFC6920 Naming Things with Hashes.
Instead of identifying content by location or even by checksum, identify it by a name with enough information that others can verify any content certified by it. For example, if a name contains the public key from a public/private-keypair, then the controller of the private key can create content, sign it with the private key, and append it to the content envelope. Then, readers can verify that the signature extracted from any content envelope was verifies against the content in the envelope and the public key appended to the envelope. Often, the self-certifying name is a hash of a public key, in which case this is a bit of a generalization of Content Addressing, and instead of content addressing a specific unit of content, it's content addressing a public key. Unlike most content-hash-based content addressing schemes, these self-certifying name schemes allow for associating many pieces of content with the name, e.g. allowing the controller of the name to 'Update' (or append to) the referent of the name. This technique has been popularized recently by IPNS which itself was influenced by "Escaping the Evils of Centralized Control with self-certifying pathnames" (Mazières, 1997)
Decentralized Identifiers is a W3C Standard for IRIs starting with did:{method-name}:{method-specific-id}
(as opposed to the common https:
IRI scheme). Precise name creation, updating, resolution techniques vary based on the did method in the IRI immediately after the did:
prefix. There is a registry of did methods. DIDs can be used for self-certifying names, e.g. did:tdw.
This question will be explored in further research. The goal of this writing is mostly to identify the challenges with https:
IRIs and especially those whose authority
is not controlled by the ActivityPub actor who created the content being linked to.
The last challenge I want to name is that it's hard to enable end-to-end portability of an ActivityPub data as originated on a client because so much of the data we read on the fediverse is actually generated by servers and not by clients.
Almost all ActivityPub servers throw away the original object submitted by the end-user's client and instead generate a new object on behalf of the end-user. Many servers do this ostensibly to be helpful, but it also makes it almost impossible to do end-to-end data integrity verification.
I already described above how nonconforming servers like Mastodon are a barrier to data portability affordances built in to the ActivityPub protocol. Mastodon doesn't even implement the POST to outbox API, which is the only way a client that cares about end to end data integrity could submit a proof to the ActivityPub Outbox server. But let's say that it did. There's still another challenge. Many ActivityPub Outbox servers then add add data on top of what the end-user submitted or (understandably) supplement what the client submitted with extra metadata that is specific to the server that is serving the object. This is what I'm calling a Server-generated Object.
In fact, the ActivityPub spec essentially requires that outbox servers generate objects on behalf of the end-user, for example with requirements like
If an Activity is submitted with a value in the id property, servers MUST ignore this and generate a new id for the Activity
Personally I think this requirement to 'ignore' the client-chosen id should be reconsidered, and servers should consider respecting the client-generated ID as long as it is sufficiently unique, similar to how email allows clients to choose the Message-ID. This could make it easier for a client to identify data after it has been ported around across several servers and/or other systems.
Another requirement that encourages server-generated activities as distinct from what comes from the client.
The server MUST remove the bto and/or bcc properties, if they exist, from the ActivityStreams object before delivery, but MUST utilize the addressing originally stored on the bto / bcc properties for determining recipients in delivery.
The server MUST then add this new Activity to the outbox collection.
―https://www.w3.org/TR/activitypub/#client-to-server-interactions
Sever-generated objects aren't invalid. They are required by the spec. But when servers throw away the original data submitted by the client that might have a data integrity proof, this is a challenge to data portability because the end-user's client no longer has any checksum to verify that their destination server has the same data as either their first server or, perhaps better yet, the data submitted to their first server by their client.
What do we want out of ActivityPub Data Portability? I think we want to be able to ensure end-to-end integrity of data as it originates on ActivityPub clients, and we want to ensure authenticity of the data as authored by the end-user before it ever gets sent to a server. Moreover, we should always make sure we have loose copupling between providers of end-user identity, authenticity, and social data servers to prevent any one server implementation from locking in end-users and their data.
Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle.
―https://en.wikipedia.org/wiki/Data_integrity
When exercising data portability, we want to make sure that the data ported to a target destination was not corrupted along the path from its source.
It would be really embarassing to kick off a transfer process, see it complete, and only months later realize that some of your images had been lost on the way when a friend complains they can't make sense of your old posts. (or worse, a malicious transfer agent or target server could insert bad words into your posts!).
One useful tool to help with data integrity verification is a Checksum which is often created from a Cryptographic Hash. This allows for the creation of a short summary string that can be deterministically recreated from the content used to produce it. If every ActivityPub server maintained a checksum of the data it stored on behalf of an end-user, then after a data porting process, the destination server and end-user could ensure the integrity of the transfered data on the destination server by making sure the checksum was the same as on the source server.
Integrity isn't only a concern in the context of data portability from server to server. A malicious ActivityPub Outbox server could tamper with the content submitted by a client as soon as it receives it. How would you detect this?
Ideally, all ActivityPub data would have some kind of data integrity proof before it ever leaves an ActivityPub Client, i.e. an integrity proof of the Client-generated Object. That way, after successfully submitting it to an ActivityPub Outbox Server, the client could verify the integrity of their submission as represented on the server, e.g. by following the link in the Location
HTTP header that Outbox submission responses MUST include, then ensuring that the data served there verifies against the integrity proof created before data was ever submitted to the outbox.
A similar integrity check would be useful after intiiating a data transfer from server to server. Either the source client or the first server (or both) could check the integrity proof against the data served by the destination server.
Unfortunately, it's not quite this easy.
If you're still reading, you may think the idea of end-to-end verification of what was submitted by the client sounds a little pedantic or overkill. Well, you're right, but also, I'm not making this up.
The 2017-09-07 Candidate Recommendation for ActivityPub explicitly described adding proofs into the JSON-LD of client-generated objects sent to an ActivityPub outbox
ActivityPub implementations may use Linked Data Signatures and signed HTTP messages for authentication and authorization. (Linked Data Signatures are best used when authentication is meant to be "long lived" and attached to an object, such as verifying that an object truly was posted by this actor, and signed HTTP messages should be used when authentication or authorization is ephemeral.) This has the advantage of clean integration with existing JSON-LD based technologies already used by ActivityPub. However, at the time of specification, Linked Data Signatures and HTTP signatures are very young, particularly in adoption.
As part of the SocialWG process building consensus for a final Technical Recommendation, this language was removed along with a lot of other text that would have helped with data integrity and authentication because the other standards they linked to weren't done by the deadline that ActivityPub was targeting (though a small, non-hyperlinked reference to "Linked Data Signatures" remains in S 4.1 of the final ActivityPub TR).
Back then, the hyperlink would have resolved to this draft report from the W3C Digital Verification Community Group. But later, that work was folded into the excellent W3C Credentials Community Group that incubates work related to Verifiable Credentials and Decentralized Identifiers and became the Verifiable Credential Data Integrity 1.0 TR finalized this year. That means we can now rely on a stable standard for the end-to-end integrity that ActivityPub was always meant to provide, both from client-to-server and in data portability from provider to provider. Good things come to those who wait.
Example 2 in vc-data-integrity gives a preview of what an integrity proof looks like.
{
"myWebsite": "https://hello.world.example/",
"proof": {
"type": "DataIntegrityProof",
"cryptosuite": "eddsa-jcs-2022",
"created": "2023-03-05T19:23:24Z",
"verificationMethod": "https://di.example/issuer#z6MkjLrk3gKS2nnkeWcmcxiZPGskmesDpuwRBorgHxUXfxnG",
"proofPurpose": "assertionMethod",
"proofValue": "zQeVbY4oey5q2M3XKaxup3tmzN4DRFTLVqpLMweBrSxMY2xHX5XTYV8nQApmEcqaqA3Q1gVHMrXFkXJeV6doDwLWx"
}
}
Whereas this proof is on a very simple JSON object, using this with ActivityPub would simply involve applying the same algorithm to a JSON-LD ActivityPub object that uses properties from the ActivityPub and ActivityStreams 2.0 standards.
One thing that might help with the data portability challenge of server-generated objects is the conceptual distinction between the ActivityPub data submitted by the client to the ActivityPub server (and its integrity) and the data later generated by the ActivityPub server based on that submission. In the context of data portability to a new server, both the client-generated object and the server-generated object may be ported, and thus both may benefit from integrity proofs.
But, forced to choose one, I prefer to port the client-generated object from which the server-generated object was generated. A lot of the server-generated object's metadata may be obsolete after migrating to a new server, where e.g. the URLs of the replies and likes collections will all be different.
It may very well be easier for the target server of an ActivityPub data migration to use the original client-generated object instead of the resulting server-generated object. And even if not, as long as an end-user can get their original client-generated object from the server, they may be able to simply replay each of them as POSTs to the outbox of their new c2s-conformant ActivityPub Server.
Perhaps it is a good idea for servers that generate ActivityPub objects to explicitly indicate that via the AS2 generator property.
{
"@context": "https://www.w3.org/ns/activitystreams",
"summary": "A simple note",
"type": "Note",
"content": "This is all there is.",
"generator": {
"type": ["Application", "MastodonServer"],
"id": "https://mastodon.social"
}
}
Or, for an ActivityPub Outbox server that generates an object based on a client-generated Object submitted to the ActivityPub Outbox
{
"@context": "https://www.w3.org/ns/activitystreams",
"summary": "A simple note",
"type": "Note",
"content": "This is all there is.",
"generator": {
"type": ["Application", "ActivityPubOutboxPostProcess"],
"id": "https://example.com/outbox",
"input": "https://example.com/link-to-unmodified-client-generated-object"
}
}
When followed, the link to the input
of the ActivityPubOutboxPostProcess
would be the exact request body of the HTTP Post request sent by the client to the outbox server, and would include any DataIntegrityProof
created by the client.
In information security, message authentication or data origin authentication is a property that a message has not been modified while in transit (data integrity) and that the receiving party can verify the source of the message.
―https://en.wikipedia.org/wiki/Message_authentication
Data Integrity Proofs provide a means to ensure the integrity of data, i.e. that it wasn't tampered with after the creation of the proof. But what if a malicious server serving the integrity proof is really motivated to modify the data, and also generates a new integrity proof / checksum after modifying the data from what the end-user and their client intended?
Detecting this requires authenticating the generator not just of the ActivityPub data but of the integrity proof itself. We don't want an integrity proof from just anyone, and maybe not even only from an ActivityPub actor identified by a provider-dependent identifier. We want a proof we believe was generated by the end-user that created the object (e.g. their ActivityPub client and/or the device it ran on) and not just by the server they happen to use. This protect against a malicious server, but perhaps more importantly, it also aids in data portability because even when ActivityPub data is ported between servers, the original proof of authenticity can still be used.
Whether it's in online server-to-server federation or the more rare (but very real!) case where an end-user wants to port their social presence from one ActivityPub server to another, a big challenge to data portability and the related problem of data integrity verification is the fact that today on the fediverse authenticity is frequently rooted in an ActivityPub Actor Server.
When a request comes in to an Actor's inbox to federate some message, the inbox server may want to authenticate the party sending the request (e.g. to determine if that party is authorized to send something). While this isn't required by ActivityPub, it is an option, and (iiuc) it is required by Mastodon, as described in their Security docs on HTTP Signatures.
Mastodon requires the use of HTTP Signatures in order to validate that any activity received was authored by the actor generating it.
But wait, in general in ActivityPub, your Outbox server is not necessarily the same as the Actor server that serves an Actor's representation over HTTPS. There's not even a requirement in ActivityPub that an Actor's URI be served over HTTPS vs another URI scheme. Even for https:
actor URIs, it should be considered a bad practice to let your ActivityPub Server have the private key associated with the end-user.
It does make sense to me that an ActivityPub server would want to require that another server (acting as an HTTP Client) add authentication to its request to prove the server is not on a block list, but I think Mastodon is conflating things here by requiring that an inbox request include actor authentication for server to server federation. Actor authentication can be provided by client-generated Data Integrity Proofs.
The more concerning part of what Mastodon has normalized is that the Mastodon Server generates keypairs, not the end-user that is the true entity that the social objects are attributedTo
. If I understand right, Mastodon doesn't even allow end-users to download these private keys or configure the public keys associated with an actor. I can empathize with Mastodon got here (since it was largely created before any of ActivityPub or these authenticity standards), but 5 years after ActivityPub was standardized, this is a huge challenge to end-user security and data portability, and we have to do better, even if it means moving away from Mastodon.
ActivityPub Actor servers should really let an end-user that the Actor represents bring their own verification methods, and only share the public keys with the ActivityPub Server. The private keys then can remain on the clients, ideally in a non-exportable way. An ActivityPub outbox only needs to verify that client-generated objects are authorized to add to the outbox, not create new signatures on behalf of an actor. There's no reason an ActivityPub server should demand to control the end-user's private keys.
Mastodon's popularity as an "ActivityPub Server" has led to a bunch of confusion about the protocol itself, in part because Mastodon and other implementations inspired by it conflate at least three separate components that were meant to be separate by the original ActivityPub editor (Christine): Identity, Authentication, and data Serving.
Mastodon puts all of these under one roof, and in many ways that is awesome because it makes things easy to use, but it also stands in the way of what's best for end-users vs instance operators and/or the organization that makes mastodon.
Let me define these terms:
id
is dereferencedIn Mastodon, all of these are the same system :/ I didn't really anticipate this. I thought things might be more like this:
Accept
header that is required by ActivityPub clients) to social.bengo.is/actor. The response after the redirect would include public keys for the Authentication mechanism.I want to see more 'unbundling' of these monolithic ActivityPub servers. Unbundling authentication and moving it to the client will help with data portability via DataIntegrityProofs that are useful even when an end-user moves from server to server.
One great thing that shakes out of unbundling Identity from Social Server something I really think we would have had by now: Single Sign On. I think a lot of people want 'Account Portability' because what they really want is Single Sign On. When they first signed up for Mastodon, there was no way to bring-your-own identity. And when they want to try out another provider, even to another Mastodon deployment, there's no way to use your first fediverse identity to log into your second and third ActivityPub communities.
Something that shakes out of unbundling Authentication from Social Servers and even Actor Servers (e.g. using cryptographic authentication and not a actor-server-dependent authentication scheme) is the ability to fully author signed social content without an internet connection, a gateway into Local First computing. Local first authentication also allows us to move towards the removal the 1:1 coupling between a verifiable social object and any particular ActivityPub Outbox Server. A signed ActivityPub Object could be sent in parallel to several Outboxes similar to what nostr calls relays.
This was meant to illustrate some of the very real challenges inherent to ActivityPub data portability both in the protocol and in the common applications that the vast majority of end-users use and want to port to/from.
The good news is that from the beginning, ActivityPub was designed to make data portability easier than some popular implementations have made it. We will need to change those implementations or choose to use new ones, but we don't necessarily need to change the whole protocol.
Here is a loose sketch of a path forward to develop ActivityPub services that new users adopting the fediverse can use on day 1 to have a better data portability experience.
[^1]: For more history, watch my 2022 dwebcamp.org talk about the 15th Anniversary of Web 3.0 Social