CBUS network being intermittently flooded with strange traffic

Discussion in 'C-Bus Wired Hardware' started by asimcox, Jul 9, 2019.

  1. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi All,

    I've been chasing a problem on my CBUS install and am starting to run out of ideas to find the issue. I'm hoping someone might be able to head me in the right direction.
    The system I have has been running well for around 6 years. For the past year or so I have also had it working in an OpenHab home automation system via a Raspberry Pi which is also running a CGATE instance.

    The problem I am having is that I am getting large amounts of strange traffic flooding the network. These include messages turning random groups on or off. Some of these groups exist in my system, some do not. When the messages turn on or off a real group it is usually already on or off, but occasionally it does something like turning my sweep fans on full when they were previously off, and once it turned on my MRA amps full blast in the middle of the night. Luckily the MRA was switched to an unused source so it was not really loud. This has only been happening for about the last couple of months. The symptoms are similar to those documented elsewhere on this forum, but tracking the cause of the issue is proving fruitless so far. The traffic appears to come from my touchscreens (a black and white and a colour) and Multi Room audio matrix switcher. The problem seems to resolve itself after a few days or after I power down the whole system, only for it to appear again, sometime days later and sometimes within minutes. The toolkit diagnostics show voltages between 28V and 32V on all devices, the network burden present, and the clock enabled on 3 devices,
    From my reading, this type of issue is often associated with a power supply problem and the origin of the messages are likely not to be the units shown in the Toolkit Application Log but rather the output units themselves.
    I contacted Clipsal support about this and they told me how to check the power supplies on my system. Unfortunately every supply is giving the correct voltage and the voltage from positive to earth and negative to earth are less then 0.1V from each other. Briefly looking at the supply with a CRO to look for noise didn't help because I'm not sure what the waveform with the cbus traffic is supposed to look like.
    I had the diagnostic software running at one point, but from my limited experience with it, I wasn't able to get any useful information.
    Having confirmed this I still went through a process of taking each of the output units with supplies out of the network one by one and in combinations. This is a long process because of the intermittent nature of the problem. Sometimes it looks like I have found the culprit only to have the issue pop up a day or so later.
    I have more recently been going through a process of disconnecting units in the network and breaking the network at that point to see if the fault is cable or unit related. While I have found that some short sections of the network appear to be working ok, I haven't been able to isolate a particular section of devices or cable which is causing the fault leading me to think the problem could still be power supply related.
    I have completely swapped out each of the dimmer output units one by one with a brand new dimmer but it didn't help the situation.

    I have 49 devices on my network (plus an additional PCI I have on while trying to fix the network). According to Toolkit current consumption is 922mA and my supplies give 1950mA. Network impedance is 55ohms and the burden is present.
    I have the following units in the network
    4 x 8 channel dimmers with PS - one of which has the network burden turned on
    4 x 12 channel Voltage Free Relays with PS
    2 x Sweep Fan Controllers
    1 x 1 gang key input unit (plastic slimline)
    9 x 4 gang key input units (plastic slimline)
    8 x 5 gang saturn DLTs
    7 x 6 gang saturn key inputs
    6 x SENPIRIB PIRs
    1 x Black and white touchscreen with logic (no logic programmed, only some schedules)
    1 x Colour C-Touch Spectrum touchscreen with logic (no logic programmed)
    1 x 1st gen wiser (currently only being used as a CNI when required)
    1 x CBUS SIM (connected to the raspberry Pi)
    1 x Multi room audio switcher
    3 x MRA 25W amplifiers.

    Another issue has appeared in the last week or so. One of the MRA 25W amplifiers which was previously working fine has stopped responding.It still shows up on a toolkit network scan, but if I try to open it in toolkit, it goes to 30% loading, briefly goes to 40% then quickly back to 10, 20, then 30% and stays there until toolkit gives an error saying it failed to load the unit programming. (error 3036). According to a toolkit scan summary the firmware on this unit is v5.4.00 which is the same as the other MRA units. I'm hoping this is not the first failure of more to come due to the issue I have been having.

    If anyone has any ideas what I can try, please let me know. I can send through more details and screen captures if that will help.

    Anthony
     
    Last edited: Jul 9, 2019
    asimcox, Jul 9, 2019
    #1
    1. Advertisements

  2. asimcox

    DarylMc

    Joined:
    Mar 24, 2006
    Messages:
    1,089
    Likes Received:
    12
    Location:
    Brisbane, QLD, Australia
    Hi Anthony
    I get the impression you are using CBus Toolkit Application log to monitor network activity.
    Have you had a look at the CGate logs on the raspberry pi?
    By default it should be always logging at a high level and located at
    /usr/local/bin/cgate/logs
    I haven't had an issue as you describe but if messages are being sent unexpectedly I think it is going to be one of your logic capable devices or whatever is talking to CGate on the RPI.
     
    DarylMc, Jul 9, 2019
    #2
    1. Advertisements

  3. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi Daryl,

    Thanks for your response.
    I also had some suspicions about the Rasberry Pi which is why I borrowed a serial PCI from my brother to monitor the network as well.
    I have tried moving the Raspberry Pi CGate to use the PCI from my wiser rather than the SIM it was connected to just in case that was causing an issue (and the SIM is now disconnected from the network). I have also completely disconnected the Pi. None of this seems to have helped. I still get the messages.

    My Raspberry Pi is back on the network using the wiser PCI and I end up with between 4 and 10 log files per day when the network is acting up. When it is not, I get one file.
    I had previously looked at the log files but they didn't seem to give much more information than the toolkit log except there were a fair few commands which appear to be from the OpenHab CBUS binding which were just NOOPs. I'm assuming those are a keep alive as every now and then the OpenHab binding does a poll of the network status as well. The logs which show the "phantom" messages seemed to show they were indeed coming from each of the "smarter" or "unusual" devices, these being the touchscreens, Wiser (when it's PCI is not being used by the Pi) and the MRA switcher and Amplifiers. The only problem is I have taken each of these out of the network one at a time and the other devices just take over and issue more messages.
    However after your prompting I have noticed one line I had missed in the cgate logs which precedes the phantom messages, but doesn't appear in the toolkit logs. This command is

    20190710-122940 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C8380001300132013363 processed by network.

    The response is different every time, but the first part of the command is the same. The only problem is I don't know cgate well enough to know what this command is. From what I can see from the cgate manual, command 734 is a "Response line:" which I had assumed was a response to another command, but since there doesn't appear to be a preceding command that doesn't make much sense. If I could find out which group/unit the 3c594f90-82df-1037-ba1e-817d69fcadce refers to it might give me a clue. I have worked out the identifier for some groups in the network by scanning both the cgate log and checking the corresponding toolkit log. (eg 3c60a290-82df-1037-ba4b-817d69fcadce appears to be group 048 GarageHallLight )

    I have copied a snippet of the cgate logs and the corresponding toolkit logs below. Hopefully someone might be able to shed some light on this a bit further.


    CGATE LOGS
    20190710-122940 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C8380001300132013363 processed by network.

    20190710-122940 730 //SIMCOX/254/56/48 3c60a290-82df-1037-ba4b-817d69fcadce new level=0 sourceunit=200 ramptime=0
    20190710-122940 730 //SIMCOX/254/56/50 3c60f0b0-82df-1037-ba4d-817d69fcadce new level=0 sourceunit=200 ramptime=0
    20190710-122940 730 //SIMCOX/254/56/51 3c6117c0-82df-1037-ba4e-817d69fcadce new level=0 sourceunit=200 ramptime=0
    20190710-122943 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C8380079107914E5 processed by network.
    20190710-122943 730 //SIMCOX/254/56/16 3c642500-82df-1037-ba7c-817d69fcadce new level=255 sourceunit=200 ramptime=0
    20190710-122943 730 //SIMCOX/254/56/20 3c64c140-82df-1037-ba81-817d69fcadce new level=255 sourceunit=200 ramptime=0
    20190710-122946 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C8380001000108790A6E processed by network.
    20190710-122946 730 //SIMCOX/254/56/0 3c5c35c0-82df-1037-ba22-817d69fcadce new level=0 sourceunit=200 ramptime=0
    20190710-122946 730 //SIMCOX/254/56/8 3c63fdf0-82df-1037-ba71-817d69fcadce new level=0 sourceunit=200 ramptime=0
    20190710-122946 730 //SIMCOX/254/56/10 3c63d6e0-82df-1037-ba75-817d69fcadce new level=255 sourceunit=200 ramptime=0
    20190710-122946 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C9380001060107EB processed by network.
    20190710-122946 730 //SIMCOX/254/56/6 3c6361b0-82df-1037-ba66-817d69fcadce new level=0 sourceunit=201 ramptime=0
    20190710-122946 730 //SIMCOX/254/56/7 3c5f6a10-82df-1037-ba2c-817d69fcadce new level=0 sourceunit=201 ramptime=0
    20190710-122949 761 cmd63 - Command: [25618] noop
    20190710-122949 766 cmd63 - Response: [25618] 200 OK.
    20190710-122952 734 //SIMCOX/254 3c594f90-82df-1037-ba1e-817d69fcadce response: 05C83800792A58 processed by network.
    20190710-122952 730 //SIMCOX/254/56/42 3c60f0b0-82df-1037-ba42-817d69fcadce new level=255 sourceunit=200 ramptime=0

    Toolkit Logs

    DateTime= 10/07/2019 12:29:39.849 App= 056 Lighting Group= 048 GarageHallLight Unit= 200 PC_CTBL/UPSTAIRS Event= Group off
    DateTime= 10/07/2019 12:29:39.890 App= 056 Lighting Group= 050 GarageLight Unit= 200 PC_CTBL/UPSTAIRS Event= Group off
    DateTime= 10/07/2019 12:29:39.893 App= 056 Lighting Group= 051 GarageExternalRear Unit= 200 PC_CTBL/UPSTAIRS Event= Group off
    DateTime= 10/07/2019 12:29:42.837 App= 056 Lighting Group= 016 StudyOuter Unit= 200 PC_CTBL/UPSTAIRS Event= Group on
    DateTime= 10/07/2019 12:29:42.878 App= 056 Lighting Group= 020 StudyInner Unit= 200 PC_CTBL/UPSTAIRS Event= Group on
    DateTime= 10/07/2019 12:29:45.852 App= 056 Lighting Group= 000 KitchenIslandLights Unit= 200 PC_CTBL/UPSTAIRS Event= Group off
    DateTime= 10/07/2019 12:29:45.894 App= 056 Lighting Group= 008 RumpusFront Unit= 200 PC_CTBL/UPSTAIRS Event= Group off
    DateTime= 10/07/2019 12:29:45.896 App= 056 Lighting Group= 010 BedRoom3Light Unit= 200 PC_CTBL/UPSTAIRS Event= Group on
    DateTime= 10/07/2019 12:29:45.938 App= 056 Lighting Group= 006 UpStairsBathroomLight Unit= 201 PC_CTDL/FAMILY Event= Group off
    DateTime= 10/07/2019 12:29:45.979 App= 056 Lighting Group= 007 EnsuiteLight Unit= 201 PC_CTDL/FAMILY Event= Group off
    DateTime= 10/07/2019 12:29:51.823 App= 056 Lighting Group= 042 WorkshopLight Unit= 200 PC_CTBL/UPSTAIRS Event= Group on
    I think I recall there is a way to get more in depth logging from CGate but I'll have to look into that a bit later. I am just on the way out to return my brother's PCI to him.

    Anthony

    Anthony
     
    asimcox, Jul 10, 2019
    #3
  4. asimcox

    DarylMc

    Joined:
    Mar 24, 2006
    Messages:
    1,089
    Likes Received:
    12
    Location:
    Brisbane, QLD, Australia
    Hello Antony
    Hopefully someone else might see something in those logs.
    CGate log settings are in the CGate config file and you can read in the CGate manual how to change them.
    They should already be at the highest level 9.
    It is interesting that there are many log files on the days which have issues because CGate will start a new log file daily, every time it restarts or also when the file gets above 5MB.
    I'd start with the Wiser but I still think you should get all the Touchscreens and RPI off the CBus network and let it run for a while if you haven't already.
     
    DarylMc, Jul 10, 2019
    #4
  5. asimcox

    DarylMc

    Joined:
    Mar 24, 2006
    Messages:
    1,089
    Likes Received:
    12
    Location:
    Brisbane, QLD, Australia
    Since you are using a remote CGate it occurs to me an easy way to break the network might be if the project xml file on the RPI was different to the xml on your Windows machine and you transferred some changes to your network recently with CBus Toolkit using the local CGate.
     
    DarylMc, Jul 10, 2019
    #5
  6. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi Daryl,

    Thanks for your ideas. I appreciate your help.

    Yes, I am aware of having a different xml on the remote and local. I don't run CGate locally on my normal windows machine. All toolkit work is done via the remote CGate. The test machine I had setup was a laptop with my brother's CNI, and it was only monitoring or for using the diagnostic software. I didn't make any changes to the network with this machine.

    The logs are all 5MB. Four days ago there were 5 files, three days ago there were 8 of these, the last two days there were 4, and so far up to mid day today only 1 which is around 2MB. So guess what?? I have some time today to track issues down and of course I can't get it to show a single fault!! According to the logs, it has not had a single issue all day.

    While looking through the logs I have noticed that OpenHab generates a lot of commands when checking network status. The version of the binding I am using seems to do a brief scan and an in depth scan. Both check addresses not in use. The in depth scan actually checks through every address for every in use application and every possible level/selector, even if they are not defined. I just had a look and some of the great folks working on the CBUS OpenHab binding have been really active lately, and it looks like there might be a newer binding available. I might try backing up my existing Pi and trying their binding. If they had made it smart enough to just look at in use addresses it might cut down the network traffic and log use, and maybe make it faster checking network status.
    I might even spin up a whole new OpenHab/CGATE since it just means using a new memory card. Having to do all the configuration in OpenHab again is likely to be a PITA though.

    Anthony
     
    asimcox, Jul 11, 2019
    #6
  7. asimcox

    ashleigh Moderator

    Joined:
    Aug 4, 2004
    Messages:
    2,350
    Likes Received:
    3
    Location:
    Adelaide, South Australia
    Dumb question: are you using a SIM to get into the C-Bus network? If you are, send me a PM,
     
    ashleigh, Jul 13, 2019
    #7
  8. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi Ashleigh

    I normally use a SIM from a Cbus alarm system I never installed, but while tracking down the issues I have switched to using the CNI from my wiser 1.
    Anthony
     
    asimcox, Jul 13, 2019
    #8
  9. asimcox

    NickD Moderator

    Joined:
    Nov 1, 2004
    Messages:
    1,381
    Likes Received:
    33
    Location:
    Adelaide
    A few comments.

    That network impedance is too low... is that a typo? Regardless, with a network that size you should not have a burden enabled.

    This scan failure at 30% is normal for an MRA unit if the unit is not powered... it can read the parameters from the PCI in the unit but can't read from the second (main) processor because it's not responding.

    This is a message from unit $C8 (200 decimal) the commands are 01 30, 01 32, and 01 33. 01 is an OFF command, and it's being sent to groups $30, $32, and $33 (48, 50, and 51).

    In the other messages you're seeing similar things on different groups from units C8 and C9 (200, 201). My guess is these are units trying to correct MMI errors.

    If you turn on event level 9 in the logs I think you should be able to see the MMIs which might be able to confirm this.

    If it is MMI errors, this could be due to poor network communications.. I would try turning off the burden and possibly removing the MRA Matrix switcher (the 1950mA suggests you have an old one with an integrated power supply).

    Nick
     
    NickD, Jul 16, 2019 at 2:01 AM
    #9
  10. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi Nick,

    Thanks for your reply.

    Yes, sorry, there is a 5 missing from the impedance. It should be 555.

    When this all started happening I did try switching off the burden, but that made the network comms so unreliable that it was hard to get it to do anything without erroring out. I actually had to isolate unit 1 just to turn the burden back on.

    At the moment, the amplifiers are being powered by the Matrix Switcher. All three are configured the same but only this one is showing this problem. This particular unit is located next to one of the others and the cbus daisy chains between them. I have tried swapping cabling etc as part of the fault finding. I have some external power supplies for these which I have never used so I will try powering this unit from a supply and see what happens. That however starts me wondering about the stability of the switcher power supply so perhaps the network issues might be related. I have never loaded the switcher up much. I have a bunch of 10W amps I haven't installed yet, but now the kids are no longer toddlers, I was intending to get them up and running. I might have to check out the power supply on the switcher before I try installing those.

    Unit 200 is one of my touch screens and 201 is the other one. In all of the logs it has been these two screens or the MRA Switcher/amps or Wiser which have been the ones generating all the traffic. However removing one of these from the network just transfers the traffic generation to one of the others.

    Unfortunately (fortunately) the network has been stable and hasn't missed a beat since last Thursday, so it has behaved for nearly a week. As soon as it starts playing up again I will try your suggestions and see if it helps the situation. I might try to get hold of a hardware burden so I can easily remove it without reprogramming. Unfortunately they only seem to sell those in packs of 10. I'll report back once the network fails again and I get a chance to test any or all of this out.

    I have been suspicious about the cabling for this install. The electrical company who originally put this in didn't inspire a lot of confidence and they had no clue when it came to programming. It was installed by the builders electrician. Seeing some of their cabling (mains and CBUS cat 5) made me shudder. I am a licenced cabler and installed all the rest of the data/comms cabling in the house while it was being built, but the builder said all the electrical had to be done by their electrician. The comment made by the electrician's apprentice when he saw my cabling was "I hope you don't expect our cables to look as neat as yours" which didn't inspire a lot of confidence. As part of this fault finding when this issue came up I have been re-terminating and fixing up a lot of the pink cable runs as they are pretty poor. However it has been working for about 6 years with only a few glitches so I guess I might just be a bit of a perfectionist, especially since it is my place.

    Anthony
     
    asimcox, Jul 16, 2019 at 2:02 PM
    #10
  11. asimcox

    asimcox

    Joined:
    Jul 19, 2017
    Messages:
    15
    Likes Received:
    0
    Location:
    Melbourne
    Hi Nick,

    So the fun continues ......

    Last night the network went haywire again after more than a week where everything was normal.

    This time all of the PIRs have ceased functioning. I can still see them on the network, but none of them are working. Even the red light in the sensor itself is not turning on. I tried resetting one of them and reprogramming it, but this made no difference.

    Then this morning none of the switches in the network functioned at all. When I looked in toolkit, I couldn't even turn loads on or off. When I looked at a physical units programming it was screwed up and all of the physical units key functions referred to the correct group, but instead of the group name, they showed the group number. The database units were all still correct. I transferred the programming for one of the dimmers from the database to the physical unit and I was then able to use toolbox to turn those loads on. I then transferred a switch programming and then I could also use that switch. So in the end I transferred the whole database to the physical units and the switches all started working again, but the PIRs remain non functional.

    I have tried turning off the burden which is on unit one (a dimmer), but the network comms becomes very bad. A network scan from toolbox only finds less than half the units, and finds some other units at addresses which are unused. Each scan shows different units and different phantom units, and most times the scan errors out. Turning the burden back on restores better communication although with all the extra traffic being generated it is slow.

    I have tried powering down the entire network a couple of times which sometimes stops the issues, but they remain.

    All of the MRA devices have been removed from the network.

    Tomorrow's job will be to pull the touchscreens from the network to see what effect that has. I am also going to try physically disconnecting the PIRs in case they are causing problems, but I am doubtful that all of them would have failed at the same time.

    I have tried to set the CGATE logging level to 9 and restarted CGATE and toolbox, but it doesn't seem to make any difference to the logging. Is the latest version of CGATE already logging at level 9?. I tried the instructions at https://www.cbusforums.com/threads/capturing-c-gate-logs.4724/. Is there some other way I should be telling CGATE to log at a higher level?

    Anthony
     
    asimcox, Jul 20, 2019 at 1:35 PM
    #11
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.