POSE banned Evo node "is not a Regular"?

Theia · May 14, 2024

I got an Evo node POSE banned by a hard restart of the VPS. This node had payouts before, so I know the setup, pro reg tx IDs, BLS keys etc. was correct before. When it relaunched, some database was corrupt (can't this be prevented?):

2024-05-14T08:51:16Z cl-schdlr thread start
2024-05-14T08:51:16Z Fatal LevelDB error: Corruption: checksum mismatch: /home/theia/.dashcore/llmq/isdb/000406.log
2024-05-14T08:51:16Z You can use -debug=leveldb to get more complete diagnostic messages
2024-05-14T08:51:16Z Fatal LevelDB error: Corruption: checksum mismatch: /home/theia/.dashcore/llmq/isdb/000406.log
2024-05-14T08:51:16Z : Error opening block database.
Please restart with -reindex or -reindex-chainstate to recover.

I had setup the node according to this guide: https://www.dash.org/forum/index.ph...e-setup-with-systemd-auto-re-start-rfc.39460/ so instead of the regular sudo systemctl start dashd I su'ed into the dash user and called dashd manually:
/opt/dash/bin/dashd -reindex

When it was done I rebooted the server again, this time from the command line and let the systemd setup auto-relaunch. It appears the node is fully synched:

dash-cli mnsync status
{
"AssetID": 999,
"AssetName": "MASTERNODE_SYNC_FINISHED",
"AssetStartTime": 1715691591,
"Attempt": 0,
"IsBlockchainSynced": true,
"IsSynced": true
}

The log shows tens of thousands of lines like:

2024-05-14T14:13:54Z ThreadSocketHandler -- removing node: peer=42303 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0

I wasn't sure if this is part of a clean-up after the reindex, but the peer=x number just keeps counting up seemingly endlessly. Is this normal?

Anyway, I tried to unban the node and constructed the protx update_service command in Dash-QT (desktop wallet) according to https://docs.dash.org/en/stable/docs/user/masternodes/maintenance.html#proupservtx Double-and triple checked all parameters are correct, also the empty one ("") for operatorPayoutAddress. The response is:

masternode with proTxHash [hash] is not a Regular (code -1)

What does that mean? "not a regular" as in "it's an Evo node"? Does that require a different protx update_service ? What can I do?

splawik21 · May 14, 2024

YES! For Evo is different update tx.

Code:

protx update_service_evo "proTxHash" "ipAndPort" "operatorKey" "platformNodeID" platformP2PPort platformHTTPPort ( "operatorPayoutAddress" "feeSourceAddress" )

Creates and sends a ProUpServTx to the network. This will update the IP address and the Platform fields
of an EvoNode.
If this is done for an EvoNode that got PoSe-banned, the ProUpServTx will also revive this EvoNode.

Requires wallet passphrase to be set with walletpassphrase call if wallet is encrypted.

Arguments:
1. proTxHash                (string, required) The hash of the initial ProRegTx.
2. ipAndPort                (string, required) IP and port in the form "IP:PORT". Must be unique on the network.
3. operatorKey              (string, required) The operator BLS private key associated with the
                            registered operator public key.
4. platformNodeID           (string, required) Platform P2P node ID, derived from P2P public key.
5. platformP2PPort          (numeric, required) TCP port of Dash Platform peer-to-peer communication between nodes (network byte order).
6. platformHTTPPort         (numeric, required) TCP port of Platform HTTP/API interface (network byte order).
7. operatorPayoutAddress    (string, optional, default=) The address used for operator reward payments.
                            Only allowed when the ProRegTx had a non-zero operatorReward value.
                            If set to an empty string, the currently active payout address is reused.
8. feeSourceAddress         (string, optional, default=) If specified wallet will only use coins from this address to fund ProTx.
                            If not specified, payoutAddress is the one that is going to be used.
                            The private key belonging to this address must be known in your wallet.

Bridgewater · May 15, 2024

Theia said:
The log shows tens of thousands of lines like:
2024-05-14T14:13:54Z ThreadSocketHandler -- removing node: peer=42303 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
I wasn't sure if this is part of a clean-up after the reindex, but the peer=x number just keeps counting up seemingly endlessly. Is this normal?

I think this was more of an issue on some previous release versions. Recent releases don't seem to do this as much (at least not counting into the thousands like you're showing). When it does counts up to many thousands, this can start affecting dashd's resource usage (which risks PoSe scores). These peer removals are not saved permanently and are in memory only for the currently running instance. A simple restart every once in a while keeps the memory usage low. Also, make sure you've updated to the latest version of dash.

Another thing: after finishing a full reindex/resync you should stop and restart dashd to reclaim some extra memory. The goal is for it to have as much resources available when randomly selected for quorum duties.

qwizzie · May 15, 2024

@Theia

Currently Evonodes have just L1 access and basically operate as normal masternodes. Once Platform activates, Evonodes will need L2 access (Platform access), at which point you could find yourself in a position where it becomes clear that is possibly not supported through the 'System wide Masternode Setup with Systemd auto (re)start RFC' methode.

I hope you thought of this and made plans to either use Dashmate (include Docker system) or Masternode Zeus (exclude Docker system), once Platform actives on Mainnet and the new payment scheme (masternode rewards & platform credits rewards) takes effect for your Evonode.

Or be really really sure that this 'System wide Masternode Setup with Systemd auto (re)start RFC' methode will actually fully support Evonodes with both L1 & L2 access.

Theia · May 15, 2024

splawik21 said:
YES! For Evo is different update tx.

Code:

protx update_service_evo "proTxHash" "ipAndPort" "operatorKey" "platformNodeID" platformP2PPort platformHTTPPort ( "operatorPayoutAddress" "feeSourceAddress" )

Thanks! The documentation doesn't explain this:

Dash docu no protx update_service_evo.png

Is the documentation a community effort? Can I help update it somehow?

xkcd · May 15, 2024

You can do a pull request on the site. I agree that there does seem to be deficit when dealing with the evonodes.

Maintenance — Dash latest documentation

Maintaining a Dash masternode involves staying up to date with the latest version, voting and handling payments

docs.dash.org

Theia · May 15, 2024

Bridgewater said:
I think this was more of an issue on some previous release versions. Recent releases don't seem to do this as much (at least not counting into the thousands like you're showing). When it does counts up to many thousands, this can start affecting dashd's resource usage (which risks PoSe scores). These peer removals are not saved permanently and are in memory only for the currently running instance. A simple restart every once in a while keeps the memory usage low. Also, make sure you've updated to the latest version of dash.

It's at over 90,000 now:

2024-05-15T09:24:43Z ThreadSocketHandler -- removing node: peer=90118 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0

My version is 20.1.0. I see there is a 20.1.1 point release where the release notes say:

Work Queue RPC Fix / Deadlock Fix
A deadlock caused nodes to become non-responsive and RPC to report "Work depth queue exceeded". Thanks to Konstantin Akimov (knst) who discovered the cause. This previously caused masternodes to become PoSe banned.

Is that this issue?

I've had a regular MN crash every 1-2 weeks in the past if not manually rebooted in time. This happened across several major releases. Maybe it's related? Never noticed this wall of ThreadSocketHandler log messages before though.

The simple restart is ok, as long as it doesn't end up corrupting the database forcing me to reindex, which will take long enough to get me banned. That's why I was asking in the beginning if this can be prevented. I thought Dash would be resilient against reboots.

xkcd · May 15, 2024

Theia said:
It's at over 90,000 now:

2024-05-15T09:24:43Z ThreadSocketHandler -- removing node: peer=90118 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0

My version is 20.1.0. I see there is a 20.1.1 point release where the release notes say:

Work Queue RPC Fix / Deadlock Fix
A deadlock caused nodes to become non-responsive and RPC to report "Work depth queue exceeded". Thanks to Konstantin Akimov (knst) who discovered the cause. This previously caused masternodes to become PoSe banned.

Is that this issue?

I've had a regular MN crash every 1-2 weeks in the past if not manually rebooted in time. This happened across several major releases. Maybe it's related? Never noticed this wall of ThreadSocketHandler log messages before though.

This is a complete red herring. The message you are seeing is completely fine.

Bridgewater · May 17, 2024

Theia said:
It's at over 90,000 now:

2024-05-15T09:24:43Z ThreadSocketHandler -- removing node: peer=90118 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0

My version is 20.1.0. I see there is a 20.1.1 point release where the release notes say:

Work Queue RPC Fix / Deadlock Fix
A deadlock caused nodes to become non-responsive and RPC to report "Work depth queue exceeded". Thanks to Konstantin Akimov (knst) who discovered the cause. This previously caused masternodes to become PoSe banned.

Is that this issue?

I've had a regular MN crash every 1-2 weeks in the past if not manually rebooted in time. This happened across several major releases. Maybe it's related? Never noticed this wall of ThreadSocketHandler log messages before though.

The simple restart is ok, as long as it doesn't end up corrupting the database forcing me to reindex, which will take long enough to get me banned. That's why I was asking in the beginning if this can be prevented. I thought Dash would be resilient against reboots.

I have noticed greater stability (PoSe scores for no reason) with 20.1.1 compared to 20.1.0, so you should definitely upgrade. The correlation between high removed peer count could just be coincidental, but from my observation it seemed that a randomly pose-banned node would also have a high count and high memory usage when compared to a "healthy" node on the same system specs.

With regard to corruption, i think overall resiliency is good. What I've experienced is that the worst that happens is you might have to delete a corrupt sporks.dat or settings.json before the wallet will start after a system crash or if you run out of disk space.

Bridgewater · May 20, 2024

Update: It still happens with 20.1.1. Dashd seems to eventually eat up 3x its "normal" memory usage. Only fix I've found is to periodically restart. Once a month or every few weeks should be sufficient.

xkcd · May 21, 2024

Bridgewater said:
Update: It still happens with 20.1.1. Dashd seems to eventually eat up 3x its "normal" memory usage. Only fix I've found is to periodically restart. Once a month or every few weeks should be sufficient.

How much RAM usage are you seeing? My server is still seeing the same 11GB per instance.

qwizzie · May 24, 2024

Bridgewater said:
Update: It still happens with 20.1.1. Dashd seems to eventually eat up 3x its "normal" memory usage. Only fix I've found is to periodically restart. Once a month or every few weeks should be sufficient.

My Evonode just got PoSe banned and its core.log is also just full with lines like these :

2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109248 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109231 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109251 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109247 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109253 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:31Z ThreadSocketHandler -- removing node: peer=109252 nRefCount=1 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:32Z ThreadSocketHandler -- removing node: peer=109256 nRefCount=2 fInbound=1 m_masternode_connection=1 m_masternode_iqr_connection=0
2024-05-24T03:04:32Z New outbound peer connected: version: 70231, blocks=2076323, peer=109301 (full-relay)
2024-05-24T03:04:32Z ThreadSocketHandler -- removing node: peer=109119 nRefCount=1 fInbound=0 m_masternode_connection=1 m_masternode_iqr_connection=1
2024-05-24T03:04:32Z New outbound peer connected: version: 70230, blocks=2076323, peer=109299 (full-relay)

Stats from my Evonode from a few days ago (when it was operating well) : 4.28GB Memory Usage under host system (ps -opid,vsz,cmd -C dashd)
& 3.89GB Memory Usage under Docker (docker stats --no-stream)

Note : i stopped using above logging two days ago, because i worried about the relatively high CPU usage that occur with docker stats --no-stream commands on one of my server CPU cores (it has 8 cores, but i am still a bit worried about it).

So i do not really see an issue with memory, i see more an issue with Evonodes getting PoSe instant banned a lot more with v20.1.1 (at least one specific Evonode of mine is) :

2024-05-24T01:03:58Z penalty 0->2507 (max=3799)
2024-05-24T01:04:01Z 2482->3799 (max=3799)
2024-05-24T01:04:01Z Pose banned

This is the second time that Evonode got PoSe instant banned like this, after having updated to v20.1.1 (19th of April this Evonode was PoSe instant banned as well). I never got a PoSe ban with v20.1.0 but after updating to v20.1.1 i am getting far more frequent Pose instant bans with this Evonode. After which the debug.log / core.log just get littered with above mentioned lines (i suspect normal behavior after a PoSe ban).

The PoSe instant bans that hit my Evonode so far (leading to 8 missed payments) seems related to failed quorum participation (high PoSe scores), at first glance. With latest PoSe ban, the Evonode was in sync with latest block and had no problem resyncing after a restart (i just needed to clear the PoSe ban).

The sooner Evonodes get access to L2 and receive rewards / credits from there too, the better. This PoSe penalty scoring is just weighing too heavy on Evonodes right now, who run the risk of not receiving 4 payments at once because of this.

Update : I checked the logs of my VPS and it turns out the VPS had a networktraffic disruption for two hours or more last night. So that is most likely the reason behind latest PoSe ban. I am checking with my VPS provider if there was an outage on their network last night or if they were doing maintenance around that time (maintenance normally does not affect my VPS). Turns out my VPS provider actually had some maintenance done yesterday on their network, that caused downtime for my VPS.

POSE banned Evo node "is not a Regular"?

Theia

New member

splawik21

Yeah, it's me....

Bridgewater

Active member

qwizzie

Well-known member

Theia

New member

xkcd

Well-known member

Maintenance — Dash latest documentation

Theia

New member

xkcd

Well-known member

Bridgewater

Active member

Bridgewater

Active member

xkcd

Well-known member

qwizzie

Well-known member