Monday, October 3, 2011

The Case of the Mysterious Computer Resets

A computer and digital signal processor for a military airborne radar system I worked on was based on mid-1970s technology. That meant multilayer printed circuit boards and through-pin DIPs, usually in 16-pin packages. In order to load, trace, and test the computer’s software, you had to connect a test console with a paper-tape reader to a test port on the outside of the computer box. The console port connected directly to the address and data pins of the CPU, so the cable was kept very short, less than eight inches, to prevent inductive/capacitive loading from degrading the signals.

In the early 1990s, our customer started to experience spurious computer resets. They were rare and would only happen on a test bench when the console was connected. The first assumption was a bad cable or circuit in the console, but swapping those produced no result. Eventually, they sent a computer that was intermittently failing back to the factory for further analysis. I was the lead test and integration engineer, but this problem strained my nascent troubleshooting skills.

I put a signal analyzer anywhere I thought we might find the problem, but that yielded no results. The signals were clean as a whistle. I rarely saw the reset condition, and it was nearly impossible to catch one on a logic analyzer.

Rare, intermittent problems like this are difficult to investigate under the best conditions. In a meeting to discuss the situation with some of our engineering managers, one of them asked me if I remembered a similar problem from the time when the system was in development. “When was that?” I asked. He replied, “Oh, around 1972 or 73.” “No”, I told him. “I was in fifth or sixth grade then.” I never quite caught the oncoming stream of epithets he muttered under his breath, but it included something about smart-Alec young engineers.

Since the problem only occurred when the console cable was connected, we decided to build a longer cable to see if that would make the problem happen more consistently. We had the technicians build a three-foot-long cable extender, and, bingo, with it connected, we had a nearly continuous string of resets. We quickly traced the problem to one of the lines that could trigger the reset pin on the CPU. The line was buffered and didn’t connect to the console, which confounded us some more.

We got the layouts for the printed circuit board and looked at the reset signal paths and the paths of nearby circuits. We found one of the CPU address lines (which WAS connected to the console interface) that ran across the board and made a U-turn right around a pin connected to the reset circuit. When the console was attached, the address line could sometimes trigger a reset. To test it, we isolated that trace on the address signal and reconnected the circuit with short point-to-point wiring soldered in place. Viola! No more resets, even with our long extension cable attached.

We also noticed the board was a recent revision, and this layout was unique to the revision. It was typical to re-layout boards as parts became unavailable and newer substitutes were inserted into the design. But board-level and assembly-level testing don’t generally use the test console, so we didn’t catch this new problem.

0 comments:

Post a Comment