There is an ongoing discussion about the introduction of a High Performance masternode (HPMN) type with strengthened hardware resources with the goal of reducing data platform fees. Three options are proposed by Dash Core Group in the Introductory presentation on High Performance Masternodes.
Here, I propose another option that seamlessly distributes platform storage to a subset of nodes, enabling low fees without requiring changes to the masternode network. Also, the level of decentralization can (but does not have to) be selected per data contract.
I am looking for feedback from a platform developer or a technically savvy person on the feasibility of the proposed implementation. I have not studied the Platform code yet, so there may be technical difficulties that I am not aware of. I propose the solution based on high-level assumptions. If any of the explicit or implicit assumptions do not hold, the proposed solution may need to be refined or discarded.
Proposed solution
The key idea is not to store platform data on all nodes or use HPMN, but instead to distribute it randomly and deterministically to a subset of nodes from a deterministic masternode list (*).
(*) Some refer to this type of solution as "sharding", but I will avoid that term. The term is somewhat misleading and overly simplistic because in this case only the storage is sharded, the shards are distributed (the same masternode typically participates in multiple shards), the shards are not defined at the platform/network level, etc.
The subset of nodes that store platform data varies for each data contract. Each node in the subset contains all data of the respective data contract. The size of the subset, i.e. the number of redundant nodes, is defined by the owner of the contract.
How many redundant nodes?
The data contract (DC) owner chooses the number of redundant nodes DC_N when creating the data contract based on the desired level of decentralization. It knows best what the decentralization needs of the application are. If the need is low, it can reduce the fees for users and itself by choosing a low DC_N (i.e. DC_N=10). If it is a critical application that needs to be uncensorable even to governments, it can choose a high DC_N (e.g. DC_N=1000 or even DC_N=MaxInt <=> infinity <=> all nodes in the list)
Which nodes? How are the nodes selected for the subset of masternodes to contain the application data, and how are they tracked?
The nodes used for data storage are randomly and deterministically derived from the hash of the data contract. The number of redundant nodes can be obtained from the metadata of the data contract.
In Dash the list of masternodes is deterministic and their IP addresses are public. An example (in Python for simplicity) of the function that random deterministically builds a subset of masternodes for a data contract is:
Inputs: data contract id (DC_ID), number of redundant nodes
There is no need to index or track which nodes store data for which data contract.
No changes are made to the API with the proposed solution. All requests (queries) can be made to any enabled master node as before (even if it is not included in the subset for that application, i.e. does not have that application data). The additional process of finding a node that actually has the data is performed by the node that receives the query (you can think of this solution as the storage virtualization layer introduced in the middle to make access to stored data seamless and transparent). For the user, the result is as if the data were available on the node itself.
The list of nodes for a data contract can be cached to speed up node selection.
(Masternode churn) What if a node storing DC data becomes unavailable, i.e., is shut down, leaves the masternode network because the owner sold the collateral, etc.?
Do nothing. The churn rate for masternodes is low. When a node leaves, another one takes its place, but it does not initially contain any data, even if designated to do so. The key takeaway is that the new node does not need to synchronize platform data right away. The new node starts empty. When a request for data comes (because the node can be selected), the node cannot provide the data, so another node must be selected from the list.
Similar to Copy-on-Write technique and caching, the node that receives a request and cannot respond will select another node from the list of (redundant) nodes for that data contract, get the data, update its local storage, and respond to the request. Next time, the node will be ready. All of this is transparent to the user making the request.
This affects performance slightly, but is a rare occurrence given the low churn rate of masternodes and typically high data redundancy.
Recap
Two main concepts determine the implementation:
- (1) distribution of application data to a randomly selected subset of nodes
- (2) shifting the responsibility for selecting the targeted number of redundant nodes (degree of decentralization) to the data contract owner
Advantages of the proposed solution:
- less need to increase the storage requirements for masternodes (the load is distributed)
- low fees (again, thanks to the distribution of storage costs)
- no change of the masternode network
- maximum flexibility in decentralization (L1 decentralization is preserved, while L2 applications can set the desired level of decentralization as they wish)
- seamless for users (user interaction with the platform is unchanged, developer interfaces are unchanged, performance loss is likely to be small in practice)
Here, I propose another option that seamlessly distributes platform storage to a subset of nodes, enabling low fees without requiring changes to the masternode network. Also, the level of decentralization can (but does not have to) be selected per data contract.
I am looking for feedback from a platform developer or a technically savvy person on the feasibility of the proposed implementation. I have not studied the Platform code yet, so there may be technical difficulties that I am not aware of. I propose the solution based on high-level assumptions. If any of the explicit or implicit assumptions do not hold, the proposed solution may need to be refined or discarded.
Proposed solution
The key idea is not to store platform data on all nodes or use HPMN, but instead to distribute it randomly and deterministically to a subset of nodes from a deterministic masternode list (*).
(*) Some refer to this type of solution as "sharding", but I will avoid that term. The term is somewhat misleading and overly simplistic because in this case only the storage is sharded, the shards are distributed (the same masternode typically participates in multiple shards), the shards are not defined at the platform/network level, etc.
The subset of nodes that store platform data varies for each data contract. Each node in the subset contains all data of the respective data contract. The size of the subset, i.e. the number of redundant nodes, is defined by the owner of the contract.
How many redundant nodes?
The data contract (DC) owner chooses the number of redundant nodes DC_N when creating the data contract based on the desired level of decentralization. It knows best what the decentralization needs of the application are. If the need is low, it can reduce the fees for users and itself by choosing a low DC_N (i.e. DC_N=10). If it is a critical application that needs to be uncensorable even to governments, it can choose a high DC_N (e.g. DC_N=1000 or even DC_N=MaxInt <=> infinity <=> all nodes in the list)
Which nodes? How are the nodes selected for the subset of masternodes to contain the application data, and how are they tracked?
The nodes used for data storage are randomly and deterministically derived from the hash of the data contract. The number of redundant nodes can be obtained from the metadata of the data contract.
In Dash the list of masternodes is deterministic and their IP addresses are public. An example (in Python for simplicity) of the function that random deterministically builds a subset of masternodes for a data contract is:
Inputs: data contract id (DC_ID), number of redundant nodes
Python:
random.seed(sha256(DC_ID)) # seed te PRG with the hash of DC_ID
subset=random.sample(deterministic_masternode_list, n)
There is no need to index or track which nodes store data for which data contract.
No changes are made to the API with the proposed solution. All requests (queries) can be made to any enabled master node as before (even if it is not included in the subset for that application, i.e. does not have that application data). The additional process of finding a node that actually has the data is performed by the node that receives the query (you can think of this solution as the storage virtualization layer introduced in the middle to make access to stored data seamless and transparent). For the user, the result is as if the data were available on the node itself.
The list of nodes for a data contract can be cached to speed up node selection.
(Masternode churn) What if a node storing DC data becomes unavailable, i.e., is shut down, leaves the masternode network because the owner sold the collateral, etc.?
Do nothing. The churn rate for masternodes is low. When a node leaves, another one takes its place, but it does not initially contain any data, even if designated to do so. The key takeaway is that the new node does not need to synchronize platform data right away. The new node starts empty. When a request for data comes (because the node can be selected), the node cannot provide the data, so another node must be selected from the list.
Similar to Copy-on-Write technique and caching, the node that receives a request and cannot respond will select another node from the list of (redundant) nodes for that data contract, get the data, update its local storage, and respond to the request. Next time, the node will be ready. All of this is transparent to the user making the request.
This affects performance slightly, but is a rare occurrence given the low churn rate of masternodes and typically high data redundancy.
Recap
Two main concepts determine the implementation:
- (1) distribution of application data to a randomly selected subset of nodes
- (2) shifting the responsibility for selecting the targeted number of redundant nodes (degree of decentralization) to the data contract owner
Advantages of the proposed solution:
- less need to increase the storage requirements for masternodes (the load is distributed)
- low fees (again, thanks to the distribution of storage costs)
- no change of the masternode network
- maximum flexibility in decentralization (L1 decentralization is preserved, while L2 applications can set the desired level of decentralization as they wish)
- seamless for users (user interaction with the platform is unchanged, developer interfaces are unchanged, performance loss is likely to be small in practice)