Skip to content

Memory clean up to placer #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

litghost
Copy link
Collaborator

Description

This PR is 3 patch sets together:

Combined with a set of patches designed to lower heap threshing between the start of rr graph construction and up to when the placer runs.

Related Issue

#1079
#1081
#1084
#1087

Motivation and Context

The changes made after #1085 are all aimed at lower the number of allocations made inside of hot loop, by hoisting the allocation to the root of the loops. This prevents allocation patterns that cause heap fragmentation.

How Has This Been Tested?

  • CI is green
  • Nightly QoR is good
  • Weekly QoR is good

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code libarchfpga Library for handling FPGA Architecture descriptions libvtrutil tests VPR VPR FPGA Placement & Routing Tool labels Jan 30, 2020
@litghost litghost requested a review from kmurray January 30, 2020 21:05
@litghost
Copy link
Collaborator Author

Initial results from stratixiv_arch.timing.xml/directrf_stratixiv_arch_timing.blif:

baseline:

# Loading Architecture Description took 1.09 seconds (max_rss 72.8 MiB, delta_rss +60.3 MiB)
# Building complex block graph took 10.34 seconds (max_rss 614.2 MiB, delta_rss +541.5 MiB)
# Load circuit took 79.28 seconds (max_rss 5094.1 MiB, delta_rss +4479.8 MiB)
# Clean circuit took 3.36 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Compress circuit took 5.32 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Verify circuit took 3.89 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Build Timing Graph took 21.95 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Load Timing Constraints took 0.07 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Packing took 1452.81 seconds (max_rss 14395.1 MiB, delta_rss +9301.0 MiB)
Finished loading packed FPGA netlist file (took 133.974 seconds).
# Load Packing took 141.15 seconds (max_rss 14395.1 MiB, delta_rss +0.0 MiB)
## Build Device Grid took 2.53 seconds (max_rss 14395.1 MiB, delta_rss +0.0 MiB)
## Build routing resource graph took 332.31 seconds (max_rss 15703.3 MiB, delta_rss +1308.2 MiB)
# Create Device took 336.40 seconds (max_rss 15703.3 MiB, delta_rss +1308.2 MiB)
### Computing router lookahead map took 992.65 seconds (max_rss 15703.3 MiB, delta_rss +0.0 MiB)
### Computing delta delays took 4240.56 seconds (max_rss 29025.0 MiB, delta_rss +13321.8 MiB)
## Computing placement delta delay look-up took 5234.54 seconds (max_rss 29025.0 MiB, delta_rss +13321.8 MiB)
# Placement took 41168.18 seconds (max_rss 31270.2 MiB, delta_rss +15566.9 MiB)
# Routing took 13430.25 seconds (max_rss 34871.0 MiB, delta_rss +571.3 MiB)
Timing analysis took 102.721 seconds (73.9237 STA, 28.797 slack) (188 full updates: 172 setup, 0 hold, 16 combined).
The entire flow of VPR took 56739.72 seconds (max_rss 36061.0 MiB)

litghost:memory_clean_up_to_placer:

# Loading Architecture Description took 1.21 seconds (max_rss 74.2 MiB, delta_rss +61.6 MiB)
# Building complex block graph took 9.94 seconds (max_rss 614.3 MiB, delta_rss +540.1 MiB)
# Load circuit took 81.11 seconds (max_rss 5099.6 MiB, delta_rss +4485.3 MiB)
# Clean circuit took 3.19 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Compress circuit took 5.22 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Verify circuit took 3.27 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Build Timing Graph took 20.42 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Load Timing Constraints took 0.07 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Packing took 1476.49 seconds (max_rss 14398.7 MiB, delta_rss +9299.1 MiB)
Finished loading packed FPGA netlist file (took 139.889 seconds).
# Load Packing took 148.74 seconds (max_rss 14398.7 MiB, delta_rss +0.0 MiB)
## Build Device Grid took 2.54 seconds (max_rss 14398.7 MiB, delta_rss +0.0 MiB)
## Build routing resource graph took 278.05 seconds (max_rss 17705.7 MiB, delta_rss +3307.0 MiB)
# Create Device took 281.66 seconds (max_rss 17705.7 MiB, delta_rss +3307.0 MiB)
### Computing router lookahead map took 867.78 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
### Computing delta delays took 4129.70 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
## Computing placement delta delay look-up took 4998.85 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
# Placement took 41101.84 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
# Routing took 15465.16 seconds (max_rss 20087.5 MiB, delta_rss +585.2 MiB)
Timing analysis took 101.113 seconds (72.6615 STA, 28.4519 slack) (188 full updates: 172 setup, 0 hold, 16 combined).
The entire flow of VPR took 58693.73 seconds (max_rss 21421.1 MiB)

@litghost
Copy link
Collaborator Author

Overall CPU-wise most phases are the same, with a 15% CPU hit to the router. @kmurray Please review what is here already, as I expect I suspect I can recover most of the CPU time.

Memory wins are ~40% of max_rss.

@litghost litghost force-pushed the memory_clean_up_to_placer branch from 42b115d to 05bf83a Compare January 30, 2020 23:10
This was referenced Feb 3, 2020
@litghost litghost force-pushed the memory_clean_up_to_placer branch 3 times, most recently from 9199d13 to bfc5e11 Compare February 5, 2020 19:54
This should have a negliable performance impact, but this enables future
changes to modify how rr nodes and rr edges are storaged.

Signed-off-by: Keith Rothman <[email protected]>
This changes edge storage from an allocation array of struct per node to
struct of array for all edge data.

Several algorithms over edges that were previous per node per edge, but
were actually just iteration over edges are now part of rr_node_storage.

Signed-off-by: Keith Rothman <[email protected]>
This enables 16-byte alignment (4 nodes per cache line).

Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Actually owning the string data is unneed as all data outlasts the
formula data structure.

Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
The refactored functions:
 - get_rr_node_indices
 - label_wire_muxes
 - label_incoming_wires
all are invoked inside of nested loops, resulting in a large (~10M+)
number of repeated allocations and deallocations.  The heap may be able
to avoid fragmentation, but small code changes could result in fragmentation
where none existed before. The change is to hoist the storage container
(std::vector) outside of the loops so that once a high watermark is
reached, no further allocations will result.

Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
@litghost litghost force-pushed the memory_clean_up_to_placer branch from bfc5e11 to 9aa4137 Compare February 5, 2020 20:54
@litghost litghost closed this Mar 6, 2020
@litghost litghost deleted the memory_clean_up_to_placer branch March 6, 2020 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code libarchfpga Library for handling FPGA Architecture descriptions libvtrutil tests VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant