Saturday, February 12, 2011

INTEL : RETURN OF THE BUG

BUGS are like a career partner for a verification engineer. Every new bug discovery makes you feel elated and satisfied. Probably a verification engineer would say BUGS more no. of times than his/her kid’s name J.
However, bugs aren’t always welcome especially when discovered late. Intel, the top semiconductor company has been bitten by another nasty bug recently.
OLD BUG – In late 1994, Intel confirmed a bug, popularly known as FDIV BUG (FDIV – x86 assembly language floating point instruction) in hardware divide block of the Pentium processor. According to Intel, the bug is rare (once every 9 to 10 billion operand pairs) and its occurrence depends upon frequency of FP instruction usage, the input operands, how output of this unit is propagated into further computation and the way in which the final results are interpreted. The bug was root caused to few missing entries in the lookup table used by the divide operation algorithm and was fixed with a mask change in the re-spin. Total cost of replacement of the processors was approximated to $475 million in 1995.
One example to test the BUG was - (824633702441.0) X (1/824633702441.0) should be exactly equal to 1 while the affected chips return 0.999999996274709702 for this calculation.
NEW BUG – A few weeks back, Intel confirmed another bug in a recently announced (at CES) Cougar point chip sets having 2 sets of SATA ports (3Gbps & 6Gbps). The problem was discovered in a transistor in 3Gbps PLL clocking tree that has a very thin gate oxide allowing it to turn on with very low voltage. This transistor has been biased with a high voltage leading to high leakage current. Continuous usage of this port will lead to transistor failure estimated to be in 2 to 3 years. The problem was confirmed in the Intel reliability lab while testing for accelerated life time performance (~time machine J). The remedy is a metal layer fix. Intel decides to replace the chip sets and has declared the approximate cost to be $700 million in worst case.

Points to ponder
-  The mean time between the 2 bugs (both ended up with customer) is 15 years.
-  Intel’s handling of this crisis equivalent situation has improved a lot since the first one.
-  The corrected samples took months for the first one but weeks for second.
-  FDIV bug could have been discovered with random verification (was still evolving at that time).
-  The new one is a reliability issue. Probably we need modeling techniques to uncover such issues early enough.
-  The cost of the bug to Intel is more than the annual revenues of many semiconductor companies.
BUGS, the inevitable part of our careers can be costly at times. The ongoing development in verification methodologies, standards, modeling and EDA technology all work towards taming those hidden defects that tend to prove Murphy’s law (If anything can go wrong, it will go wrong when you least expect it) some day.
Be cautious & Happy bug hunting!

4 comments:

  1. To my humble opinion, such bug means either there is a missing reliability check in Intel's design process or device taped out without running (or fixing) this check.
    Assuming there was a missing check - the right way to handle it is to develop a check that covers such reliability issues, run this check and fix all uncovered issues.
    If there was such check, but either ignored or has not been run, the right way is to run and fix.
    In both cases, statistically there is a good chance that similar reliability issues exist in other devices taped out in the same process technology.
    Might need to re run all these devices to make sure no such reliability issue exist.

    ReplyDelete
  2. Hi siddhakarana,

    Thanks to you and it is really a nice article. This article will help people to think twice before they think that the verification of such complex chip is over.
    Also this will help to new innovation in verification field.

    Regards,
    Jack

    ReplyDelete
  3. From Sunil Kakkar - linkedin

    http://www.linkedin.com/groupItem?view=&gid=117645&type=member&item=43557870&qid=b04c971d-049a-47cf-93ac-8b5a5b3ca6a3&goback=%2Egmp_117645

    The projected $$$$ COST to rectify a functional/structural BUG, although titanic by any standard, is still dwarfed by the opportunity cost that any company incurs when a bug is found in silicon. It used to be, that a silicon bug used to cost a $100,000 to put things back on track - about 25 years ago. That number has now grown to ONE BILLION DOLLARS. But, even this cost is miniscule when compared to the lost potential customers, the effect on the product roadmap for years to come, the increased cost from then on because of paranoid behavior on future products and the hierarchy put in place to prevent a recurrence of something which should not have happened in the first place. It has been a long standing gripe of the silicon community that the functional verification programs are of no use on the silicon since functional test benches are not portable or scalable to silicon debug. Nor do the functional verification tests adequately stress chip level concurrent activity that can be accurately measured for performance by appropriate assertions. I find that chip level tests, if run at all (not merely integration tests which combine block level tests into system level tests calling them chip level) are too simplistic often taking refuge under the excuse of cycle limitation of modern day simulators. In other words, the chip level tests are not wicked enough to grab the attention of any of the blocks on the chip. They are ignored and hence they flash a pass message. They are given what they want and hence the scoreboards have no problem declaring success. In other words, they can be handled, and hence they do not fail. But, if they were fiendish enough and did mange to create a panic amongst the blocks by creating lots of audible noise (meaning choking up the traffic on all the buses - bringing the chip to a stand still) to an extent that the chip could not handle such traffic and if this went on in a persistent way for extended periods of time, one will start seeing the individual blocks panicking and committing errors as well as so called "bursting on the seams - that is the interface signals caving in". The next challenge would be to use synthesizable test benches and assertions so that these functional test benches could be reused on the actual silicon, FPGA, emulation platforms and so called evaluation boards. When this is done consistently, no matter what verification tool/environment is used, we can rise above the fear of Murphy's law since nothing can go wrong. A design must work the first time it is taped out. There is never a second chance in this field.

    Sunil Kakkar
    Chief Architect - Verification
    SKAK INC.

    ReplyDelete
  4. Visit - linkedin group for more comments.

    http://www.linkedin.com/groupItem?view=&gid=117645&type=member&item=43557870&qid=b04c971d-049a-47cf-93ac-8b5a5b3ca6a3&goback=%2Egmp_117645

    ReplyDelete