Command times out but responds anyway

Discussion in 'C-Bus Toolkit and C-Gate Software' started by more-solutions, Jan 26, 2022.

  1. more-solutions

    more-solutions

    Joined:
    Apr 23, 2006
    Messages:
    278
    Likes Received:
    4
    Location:
    Peterborough, UK
    I'm hoping someone can explain these lines from my logs (my wrapping but otherwise directly lifted from the logs with no editing).

    A command is issued, C-Gate sends it to the network, times out awaiting a response, then returns with a response as if everything had worked as normal?

    20220126-151645 761 cmd149 - Command: ID 254/p/99 type
    20220126-151645 735 //MSL_6C/254 f290be50-60b4-103a-befc-c47c3ff2f3b3
    cc088 sent cmd (fastpci): \4663002101u (254: \4663002101u)
    20220126-151645 765 //MSL_6C/254 f290be50-60b4-103a-befc-c47c3ff2f3b3
    got packet confirm: u.
    20220126-151650 759 //MSL_6C/254 f290be50-60b4-103a-befc-c47c3ff2f3b3
    command timed out: \4663002101u after 5500ms, retry limit of 0 reached
    20220126-151650 759 //MSL_6C/254 f290be50-60b4-103a-befc-c47c3ff2f3b3
    command-fail-count=1 unit=99 reason=no response (\4663002101)
    20220126-151650 766 cmd149 - Response: 300 p/254/99 type=RELDN4
     
    more-solutions, Jan 26, 2022
    #1
    1. Advertisements

  2. more-solutions

    ashleigh Moderator

    Joined:
    Aug 4, 2004
    Messages:
    2,371
    Likes Received:
    14
    Location:
    Adelaide, South Australia
    the "u" that was returned is the PCI or CNI saying it got the packet, and it will then in its own time push that out into the bus.

    The command 4663002101 is something to look at to see what that means.

    46: This is a point to point command. This is therefore a command addressed to a specific unit.

    6300: The unit address is 63 (hex), with no further routing, so this is to a device on the locally connected network.

    2101: This is the command being sent to unit 63 (hex). From what I can remember from a long time ago, this is asking the device if it is there, There device should reply.

    FWIW, the address 0x63 corresponds to 99 decicmal.

    What this means is that you are seeing normal behaviour for testing a device that is not there - later in the log you see the timeout and the report that there was no response - in other words, its a "ping" of the device at address 99, which is not there.
     
    ashleigh, Jan 28, 2022
    #2
    Mr Mark likes this.
    1. Advertisements

  3. more-solutions

    more-solutions

    Joined:
    Apr 23, 2006
    Messages:
    278
    Likes Received:
    4
    Location:
    Peterborough, UK
    It's the final line I don't understand:
    20220126-151650 766 cmd149 - Response: 300 p/254/99 type=RELDN4​

    Immediately after the timeout (so ~5s after the ID command) I get a response to the ID command which suggests that the command was successful (and gives an answer to the ID request) despite the command having failed. As the RELDN4 response is correct (well it would be if the device had responded) I assume this is a cached value, but it doesn't seem to make sense to send a cached value after the timeout.

    What I see at the application level is:

    15:16:45 << ID 254/p/99 type
    15:16:50 >> 300 p/254/99 type=RELDN4

    .. ie the "ping" responds despite having failed (albeit that it responds 5s later than usual - after my own code has timed out in fact, which is why I discovered it, as I then saw this response incorrectly in response to a ping on a different unit that has a different type).
     
    more-solutions, Jan 29, 2022
    #3
  4. more-solutions

    ashleigh Moderator

    Joined:
    Aug 4, 2004
    Messages:
    2,371
    Likes Received:
    14
    Location:
    Adelaide, South Australia
    Ouch. Good point. It quite possibly cached. You could try shutting down c-gate and restarting it, then see what happens.

    The other thing to do is separately monitor the line using the monitoring software and a separate interface unit. Instead of relying on what C-Gate tells you... see whats going over the wire. Thats the ultimate source of truth.
     
    ashleigh, Feb 2, 2022
    #4
  5. more-solutions

    more-solutions

    Joined:
    Apr 23, 2006
    Messages:
    278
    Likes Received:
    4
    Location:
    Peterborough, UK
    The issue is that this unit seems a little intermittent so the chances of it failing immediately after a restart of C-Gate so I catch the un-cached state are pretty minimal. In any case while that might explain it, it doesn't help me fix it :)

    What I'm trying to do is a basic replication of what the diagnostics tool does but via C-Gate so it can be run on a system that has active C-Gate connections to the CNI without having to close those connections first. However, if pinging a unit will get me a healthy state even if it's gone AWOL it's not going to work. (I could monitor the events to pick that up but it all starts to get way more complicated!)

    I'm using ID since that is essentially what the diagnostics tool uses. Suggestions for a better command are welcomed! (Presumably, I could use NET PINGU? I assumed there was a reason that the diag tool didn't go that route.)
     
    more-solutions, Feb 2, 2022
    #5
  6. more-solutions

    ashleigh Moderator

    Joined:
    Aug 4, 2004
    Messages:
    2,371
    Likes Received:
    14
    Location:
    Adelaide, South Australia
    The diagnostic program hangs off the PCI because its about as close as you can get to the actual bus.

    C-Gate has many layers of magic in it, it is essentially, too far removed for performing the low level monitoring functions.
     
    ashleigh, Feb 12, 2022
    #6
  7. more-solutions

    more-solutions

    Joined:
    Apr 23, 2006
    Messages:
    278
    Likes Received:
    4
    Location:
    Peterborough, UK
    And there was me thinking it's just because it's as old as the hills! :)

    Understood. But its downside is that it requires taking the network down, so I'm looking for some kind of compromise, ie the best I can get without affecting the network (which means not closing it on C-Gate, but also not overloading the network while testing). More of a background health check I guess.

    I had considered going direct to the CNI and effectively reproducing what the diag tool does, but in a way that can be scripted (so I could automate the process of taking the network offline, running the test for say 1hr or 1000 cycles, then re-enabling the network in C-Gate); the site has redundant CNI connections so this wouldn't be too bad. But if I can keep C-Gate in the loop that's still preferable.

    I've rewritten my code to send ID commands but ignore the C-Gate sanitised responses, and to separately watch the event stream and match the responses against the requests that way. That looks like it might be more successful.
     
    more-solutions, Feb 13, 2022
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.