Sunday, October 7, 2012

Verifying with JUGAAD

The total effort spent on verification is continuously on the rise (click here). This boost can be attributed to multiple reasons such as –
- Rising complexity of the design further guided by Moore’s law
- Constrained random test benches coupled with complex cross cover bins
- Incorporating multiple techniques to confirm verification closure
- Debugging RTL and the Verification environment
 
A study conducted by Wilson Research Group, commissioned by Mentor Graphics revealed that, mean time a design engineer spends in verification has increased from an average of 46% in 2007, to 50% in 2010. It also confirmed that debugging claims most part of verification engineer's bandwidth. While the effort spent on RTL debugging may rise gradually with the design size and complexity, TB debugging is showing up frequent spikes. Absence of a planned approach and limited support of the tools to enable this further adds up to the woes.  Issues in the verification environment arise mainly due to –
- Incorrect understanding of the protocol
- Limited understanding of the language and methodology features
- First timers making silly mistakes
- ‘JUGAAD’ (Hindi word for workaround)
 
Unlike design, the verification code was never subjected to area and performance optimization and the verification engineers were liberal in developing code. If something doesn’t work, find a quick workaround (jugaad) and get it working without contemplating the impact on testbench efficiency. Market dynamics now demand the faster turnaround of product and if verification is sluggish that impacts the product development schedule considerably. Below is one such case study picked from past experience wherein a complex core with parallel datapaths culminating into the memory arbiter (MA) block was to be verified.
 
GIVEN
 
CRV with Vera+RVM used to verify MA and block (XYZ) feeding MA. 100% functional coverage was achieved at block level for both modules. XYZ used complete frame structure to verify so average simulation time of test was 30 mins while MA used just a series of bytes & words long enough to fill FIFOs and simulation time was <5 mins. To stress MA further with complete frames of data and confirm it works fine with XYZ, CRV was chosen for XYZ+MA as a DUT. The rest of the datapath feeding XYZ was left to directed verification at top level as the total size of the core was quite large.
 
EXECUTION
 
The team quickly integrated the two environments and started simulating the tests. But this new env was taking ~16X more time as compared to XYZ standalone environment thereby impacting the regression time severely. This kicked off the debugging process of analyzing the bottleneck. First approach was to comment out the instances of MA monitor & scoreboard in the integrated env and rerun. If simulation time reduces then uncomment the instances and its tasks one by one to root cause the problem. On rerunning with this change there was no drop in simulation time. Damn! How was that possible?
 
Reviewing the changes, the team figured out that instead of commenting out the instances, the engineer had commented out the start of transactions. He claimed that just having an instance in the env shouldn’t affect as long as no transactions are getting processed by MA components. Made sense! But then why this Kolaveri (killer instinct)?
 
To nail down the problem multiple approaches like code review, increasing verbosity of logs and profiling were kicked off in parallel.
 
ROOT CAUSE
 
The MA TB had 2 mistakes. A thread was spawned from the new () task of scoreboard for maintaining the data structure and this code had a delay(1) to it. This was added by the MA engineer while debugging standalone env at some point in time as a JUGAAD.
 
task ma_xyz :: abc()
{
     variable declarations…
     while(1)
     {
        
        delay(1);         
     }
}
task new()
{
   
    fork
      abc();
    join none
}
 
Since this thread was spawned from new(), even though the start_xactor task was dormant this thread was still active causing the delay. Replacing this delay by posedge(clock) solved the issue and to respect guidelines this task was moved to a suitable place in the TB.
 
Lesson learnt – 'Jugaad' in the verification env of yesteryears doesn’t work so very well with modern day verification environment. Think twice while fixing you verification code or else the debugging effort on your project would again overshoot beyond average!
 
I invite you to share your experiences with such goofups! Drop an email to siddhakarana@gmail.com

No comments:

Post a Comment