Parallel Computing and Simulation and of Nanometer VLSI Systems based on Multi/Many Core and GPU platforms

August 01, 2010

Principle Investigators

Dr. Sheldon Tan (PI)

Collaborators

Dr. Yici Cai, Tsinghua University, China
Dr. Hao Yu, NTU, Singapore
Dr. Lifeng Wu, ProPlus Design Solutions Inc.

Graduate Students

Current graduate students

Kai He,
Hengyang Zhao

Graduate Students (gradudated)

Joseph Gordon (M.S.)
Xue-Xin Liu (Ph.D)

Funding support

We thank National Science Foundations for supports of this research work. Any opinions, findings, and conclusions or recommendations expressed in those materials below are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

National Science Foundation, “CAREER: Career Development Plan: Behavioral Modeling, Simulation and Optimization for Mixed-Signal System in a Chip”, (CCF-0448534, a number of REU supplements), 6/1/2005-5/31/2011. PI: Sheldon Tan.

National Science Foundation, “US-Singapore Planning Visit: Collaborative Research on Design and Verification of 60Ghz RF/MM Integrated Circuits”, (OISE-1051797), 4/1/2011-3/30/2013, PI: Sheldon Tan, co-PI: Hao Yu.

National Science Foundation, “SHF: Small: GPU-Based Many-Core Parallel Simulation of Interconnect and High-Frequency Circuits”, (CCF-1017090), 9/1/2010-8/30/2013, PI: Sheldon Tan

Project Descriptions

Background

Modern computer architecture has shifted towards designs that employ multiple processor cores on a chip, so called multicore processor or chip-multiprocessors (CMP). The leap from single-core to multi-core technology has permanently altered the course of computing. CMP enables increased productivity, powerful energy-efficient performance, and breakthrough in parallel computing capability and scalability. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. The graphic processing unit (GPU) are one of the most powerful many-core computing systems in use. For instance, Nvidia GeForce GTX 280 chip has a peak performance of over 900 GFLOPS versus 12.8 GFLOPS for 3.2 Ghz Pentium IV CPU. In addition to the primary use of GPUs in accelerating graphics rendering operations, there has been considerable interests in exploiting GPUs for general purpose computation (GPGPU). The introduction of new parallel programming interfaces for general purpose computations, such as Computer Unified Device Architecture (CUDA), Stream SDK and OpenCL, have made GPUs powerful and attractive choice for developing high-performance numerical, scientific computation and solving practice engineering problems. However, programming on GPUs remains a challenging problem. The reason is that many modern GPUs exhibit complex memory organization with multiple low latency on-chip memories in addition to the off-chip memory. The access latencies and the optimal access patterns of each of the memories vary significantly, posing a significant challenge to develop techniques that optimally utilize the various memories to tolerate the latency and improve the memory thoughtful. The memory hierarchy along with the highly parallel execution model make application optimization difficult. The challenges increase many-fold when the application to be optimized and parallelized are memory-intensive operations such as Sparse Matrix-Vector multiplication (SpMV), which is the critical kernel for most analysis and simulation tasks for VLSI chip designs.

Simulation of VLSI chips at circuit level still remain the difficult tasks as more than half of the design time are spent on the verification tasks. The percentage is growing as the technologies scale down due to exploding growth of extracted parasitics and increasing operation frequencies in high performance microprocessors, radio-frequency (RF) and microwave/millimeter-wave (MM) integrated circuits (ICs). The staggering amount of design data generated during the chip design processes poses tremendous design and verification challenges.

In this project, we focus on the parallelization of simulation and analysis algorithms for VLSI systems such as global interconnects and high frequency RF/MM integrated circuits. VLSI simulations task typically are memory-intensive operations as they needs to analyze and transform huge amount of design data. As the GPUs are becoming viable and commodity high-performance computing platforms, accelerating those critical simulation tasks on GPUs will bring tremendous and immediate benefits for the design communities.

To further promote the research, the PI also proposed and taught a new graduate level course, EE-CS 217: GPU Architecture and Parallel Programming (http://www.ee.ucr.edu/~stan/courses/eecs217/eecs217_home.htm) to promote the GPU-related research. The PI taught the course in the Winter 2011 and Winter 2012 at UCR and the course was received by UCR graduate students from both EE and CS departments. In addition, the PI also established UCR CUDA Teaching Centers (2010-2011) sponsored by Nvidia Corporation, CA, to advancing the state of parallel education using CUDA C/C++. CUDA Teaching Center comes with equipment donations, funding support for course development, course material assistance and software license from Nvidia Corporation. Nivdia also provide partial TA supports for the proposed EE-CS217 course as part of CUDA Teaching Center program. The PI used the donation to build CUDA teaching lab at UCR so that students perform the labs for the EE217 course. See http://www.marketwire.com/press-release/NVIDIA-Names-20-New-CUDA-Research-and-Training-Centers-in-Seven-Nations-NASDAQ-NVDA-1372887.htm

The PI’s group (together with our collaborators) have developed several parallel simulation techniques based on multi-core and GPU platforms in the past:

GPU friendly fast analysis for structured on-chip power grid networks

Power integration verification of on-chip power grid networks are critical for design sign-off as aggressive scaling of supply voltages and shrinking design margins. We proposed a novel simulation algorithm for large scale structured power grid networks. The new method formulates the traditional linear system as a special two-dimension Poisson equation and solves it using an analytical expressions based on FFT technique. The computation complexity of the new algorithm is O(NlgN), which is much smaller than the traditional solver’s complexity O(N^1.5) for sparse matrices, such as the SuperLU solver and the PCG solver. Also, due to the special formulation, graphic process unit (GPU) can be explored to further speed up the algorithm. Initial results show that the new algorithm is stable and can achieve 100X speed up on GPU over the widely used SuperLU solver with very little memory footprint. [C1]

Parallel shooting-based method for RF circuits based on GPU platforms

The recent multi/many-core CPUs or GPUs have provided an ideal parallel computing platform to accelerate the time-consuming analysis of radio-frequency/millimeter-wave (RF/ MM) integrated circuit (IC). In this period, we have developed a structured shooting algorithm that can fully take advantage of parallelism in periodic steady state (PSS) analysis for RF and MM ICs. Utilizing periodic structure of the state matrix of RF/ MM-IC simulation, a cyclic-block-structured shooting-Newton method has been parallelized and mapped onto recent GPU platforms. We developed the formulation of the parallel cyclic-block-structured shooting-Newton algorithm, called “periodic Arnoldi shooting” method. Then we will present its parallel implementation details on GPU. Results from several industrial examples show that the structured parallel shooting-Newton method on Tesla's GPU can lead to speedups of more than 20 X compared to the state-of-the-art implicit GMRES methods under the same accuracy on the CPU. The initial results were published in [C2]

Parallel symbolic analysis techniques for analog and mixed-signal circuit

Graph-based symbolic technique is a viable tool for calculating the behavior or the characterization of an analog circuit. Traditional symbolic analysis tools typically are used to calculate the behavior or the characteristic of a circuit in terms of symbolic parameters. The introduction of determinant decision diagrams (DDD) (by the PI and his PhD advisor) based symbolic analysis technique allows exact symbolic analysis of much larger analog circuits than all the other existing approaches. In this period, the PI’s group proposed a new parallel analysis method for large analog circuits using determinant decision diagram (DDD) based graph technique. DDD-based symbolic analysis technique enables exact symbolic analysis of vary large analog circuits. Once the circuit small-signal characteristics are presented by DDDs, evaluation of DDDs will give exact numerical values. In this paper, we develop efficient parallel DDD evaluation techniques based on general purpose GPU (GPGPU) computing platform to explore the parallelism of DDD structures. We propose two parallelization algorithms and their performance are compared. Initial results show that the new evaluation algorithm can achieve about one to two order of magnitudes speedup over the serial CPU based evaluations on some large analog circuits. [C3,B1]. The statistical analog variational analysis based on DDD structure has been proposed recently [C5]. (4) A GPU-Accelerated Envelope-Following Method for Switching Power Converter Simulation We proposed a new envelope-following parallel transient analysis method for the general switching power converters. The new method first exploits the parallelisim in the envelope-following method and parallelize the Newton update solving part, which is the most computational expensive, in GPU platforms to boost the simulation performance. To further speed up the iterative GMRES solving for Newton update equation in the envelope-following method, we apply the matrix-free Krylov basis generation technique, which was previously used for RF simulation. Last, the new method also applies more robust Gear-2 integration to compute the sensitivity matrix instead of traditional integration methods. Experimental results from several integrated on-chip power converters show that the proposed GPU envelope-following algorithm leads to about 10X speedup compared to its CPU counterpart, and 100X faster than the traditional envelop-following methods while still keeps the similar accuracy [C4].

A GPU-accelerated general minimum residual iterative solver (GRMES) has ben developed (see software download below) and has been used for thermal analysis with advanced cooling techniques [C6].

Invited Presentations by Dr. Sheldon Tan

College of Engineering Seminar (CUDA Talk), UC Riverside, CA “Parallel computing method for on-chip power grid analysis based on the multi-core computing platforms”, June 2, 2009.

International Workshop on Emerging Circuits and Systems (IWECS’11), Hangzhou, Zhejiang Provence, China, “Graph-based Parallel and Statistical Analysis of Large Analog Circuits Based on GPU Platforms", August 4, 2011.

The University of Hong Kong, Department of Electrical and Electronic Engineering, Hong Kong, China, “Graph-based Parallel and Statistical Analysis of Analog Circuits Based on GPU Platforms”, Hong Kong, Aug. 23, 2011.

Shanghai Jiaotong University, School of Microelectronics, Shanghai, China, “Graph-based Parallel and Statistical Analysis of Analog Circuits on GPU Platforms”, April 26, 2012

International Workshop on Emerging Circuits and Systems (IWECS’12), Shanghai Jiao Tong University, Shanghai, China, “Parallel Computing and Simulation for VLSI systems”, Aug. 9, 2012.

INAOE (Institute National Astrophysics, Optical and Electrics), Department of Electrical Engineering, Puebla, Mexico , “Fast GPU-accelerated sparse matrix-vector multiplication (SpMV)”, May 3, 2013.

Software Download

GLU Solver --- GPU-enabled parallel LU factorization solver package

GPU-GMRES (in github with sources codes and examples) -- Parallel GMRES (Generalized Minimal Residual) linear solver on GPU platforms for both RLC and thermal circuits.

Relevant Publications

Book or book chapters

B1. Sheldon X.-D. Tan, Xue-Xin Liu and Eric Mlinar, and Esteban Tlelo-Cuautle, “Parallel symbolic analysis of large analog circuits on GPU platforms”, Chapter in "VLSI Design", Esteban Tlelo-Cuautle and Sheldon X.-D. Tan (Editors), INTECH (www.intechweb.org), ISBN 978-953-307-884-7, January, 2012.
B2. Sheldon X.-D. Tan, Xue-Xin Liu and Eric Mlinar, and Esteban Tlelo-Cuautle, “Parallel symbolic analysis of large analog circuits on GPU platforms”, Chapter 6 in "VLSI Design", Esteban Tlelo-Cuautle and Sheldon X.-D. Tan (Editors), INTECH (www.intechweb.org), ISBN 978-953-307-884-7, January, 2012.
B3. X.-X. Liu, S. X.-D. Tan, H. Wang, and H. Yu, “GPU-accelerated envelope-following method”, Chapter 17 in “Designing Scientific Applications on GPU”, Raphael Couturier (Editor), CRC Press /Taylor & Francis Group, Nov. 2013. ISBN 9781466571648
B4. Guoyong Shi, Sheldon X.-D. Tan, Esteban Tlelo-Cuautle, “Advanced Symbolic Analysis for VLSI Systems -- Methods and Applications”, Springer Publisher, 2014, ISBN 978-1-4939-1103-5

Journal publications

J1. X. Liu, S. X.-D. Tan, H. Yu, “A GPU-accelerated parallel shooting algorithm for analysis of radio frequency and microwave integrated circuits”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), vol. 23, no. 3, March 2015.http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6777551
J2. X. Liu, K. Zhai, Z. Liu, K. He, S. X.-D. Tan, and W. Yu, “Parallel thermal analysis of 3D integrated circuits with liquid cooling on CPU-GPU platforms”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), vol. 23, no. 3, pp. 575-579, March 2015.
J3 K. He, S. X.-D. Tan, H. Wang and G. Shi, “GPU-accelerated parallel sparse LU factorization method for fast circuit analysis”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), (in press).

Conference publications

C1. J. Shi, Y. Cai, W. Hou, L. Ma, S. X.-D. Tan, P.-H. Ho and X. Wang, “GPU friendly fast Poisson solver for structured power grid network analysis”, Proc. IEEE/ACM Design Automation Conference (DAC’09), pp.178--183, San Francisco, CA, 2009. (Best Paper Award Nomination (7 out of 682 submissions, 1%))

C2. X. Liu, H. Yu, J. Relles, S. X-.D. Tan, “ A structured parallel periodic Arnoldi shooting algorithm for RF-PSS analysis based on GPU platforms”, Proc. Asia South Pacific Design Automation Conference (ASP-DAC’11), pp13-18, Yokohama, Japan, Jan. 2011.

C3. J. Lu, Z. Hao, S. X.-D. Tan, “Graph-based parallel analysis of large analog circuits based on GPU platforms”, ACM International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (TAU Workshop), April 2011.

C4. X. Liu, S. X.-D. Tan, H. Wang and H. Yu, “A GPU-accelerated envelope-following method for switching power converter simulation”, Proc. Design, Automation and Test in Europe (DATE'12), Dresden, Germany, March 2012.

C5. X. Liu, S. X.-D. Tan, and H. Wang, “Parallel statistical analysis of analog Circuits by GPU-accelerated Graph-based Approach”, Proc. Design, Automation and Test in Europe (DATE'12), Dresden, Germany, March 2012.

C6. X. Liu, Z. Liu, S. X.-D. Tan, J. Gordon, “Full-chip thermal analysis of 3D ICs with liquid cooling by GPU-accelerated GMRES method”, Proc. Int. Symposium on Quality Electronic Design (ISQED’12), San Jose, CA, March 2012.

C7. X. Liu, S. X.-D. Tan, Z. Liu, H. Wang, T. Xu, “Transient analysis of large linear dynamic networks on hybrid GPU-multicore platforms”, 10th IEEE International NEWCAS Conference, Montreal, Canada, pp. 173-176, June, 2012.

C8. X. Liu, H. Wang, and S. X.-D. Tan, “Parallel power grid analysis using preconditioned GMRES solvers on CPU-GPU platforms”, Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD’13), pp.561-568, San Jose, CA, Nov. 2013.

C9. K. He, S. X.-D. Tan, E. Tlelo-Cuautle, H. Wang and H. Tang, “A new segmentation-based GPU-accelerated sparse matrix-vector multiplication”, Proc. Int. Midwest Symposium on Circuits and Systems (MWSCAS’14), College Station, TX, August, 2014.

C10. Y. Zhu, S. X.-D. Tan, “GPU-accelerated parallel Monte Carlo analysis of analog circuits by hierarchical graph-based solver”, Proc. Asia South Pacific Design Automation Conference (ASP-DAC’15), Chiba, Japan, Jan. 2015.

VLSI System and Computation Lab