Memory clean up to placer #1096

litghost · 2020-01-30T20:44:05Z

Description

This PR is 3 patch sets together:

Move call location of ClockRRGraphBuilder and use alloc_and_load_edges. #1081 Move call location of ClockRRGraphBuilder and use alloc_and_load_edges.
Proxy rr node #1084 Proxy rr node
Initial refactoring of edge storage #1085 Initial refactoring of rr edge storage

Combined with a set of patches designed to lower heap threshing between the start of rr graph construction and up to when the placer runs.

Related Issue

#1079
#1081
#1084
#1087

Motivation and Context

The changes made after #1085 are all aimed at lower the number of allocations made inside of hot loop, by hoisting the allocation to the root of the loops. This prevents allocation patterns that cause heap fragmentation.

How Has This Been Tested?

CI is green
Nightly QoR is good
Weekly QoR is good

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

litghost · 2020-01-30T21:10:47Z

Initial results from stratixiv_arch.timing.xml/directrf_stratixiv_arch_timing.blif:

baseline:

# Loading Architecture Description took 1.09 seconds (max_rss 72.8 MiB, delta_rss +60.3 MiB)
# Building complex block graph took 10.34 seconds (max_rss 614.2 MiB, delta_rss +541.5 MiB)
# Load circuit took 79.28 seconds (max_rss 5094.1 MiB, delta_rss +4479.8 MiB)
# Clean circuit took 3.36 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Compress circuit took 5.32 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Verify circuit took 3.89 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Build Timing Graph took 21.95 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Load Timing Constraints took 0.07 seconds (max_rss 5094.1 MiB, delta_rss +0.0 MiB)
# Packing took 1452.81 seconds (max_rss 14395.1 MiB, delta_rss +9301.0 MiB)
Finished loading packed FPGA netlist file (took 133.974 seconds).
# Load Packing took 141.15 seconds (max_rss 14395.1 MiB, delta_rss +0.0 MiB)
## Build Device Grid took 2.53 seconds (max_rss 14395.1 MiB, delta_rss +0.0 MiB)
## Build routing resource graph took 332.31 seconds (max_rss 15703.3 MiB, delta_rss +1308.2 MiB)
# Create Device took 336.40 seconds (max_rss 15703.3 MiB, delta_rss +1308.2 MiB)
### Computing router lookahead map took 992.65 seconds (max_rss 15703.3 MiB, delta_rss +0.0 MiB)
### Computing delta delays took 4240.56 seconds (max_rss 29025.0 MiB, delta_rss +13321.8 MiB)
## Computing placement delta delay look-up took 5234.54 seconds (max_rss 29025.0 MiB, delta_rss +13321.8 MiB)
# Placement took 41168.18 seconds (max_rss 31270.2 MiB, delta_rss +15566.9 MiB)
# Routing took 13430.25 seconds (max_rss 34871.0 MiB, delta_rss +571.3 MiB)
Timing analysis took 102.721 seconds (73.9237 STA, 28.797 slack) (188 full updates: 172 setup, 0 hold, 16 combined).
The entire flow of VPR took 56739.72 seconds (max_rss 36061.0 MiB)

litghost:memory_clean_up_to_placer:

# Loading Architecture Description took 1.21 seconds (max_rss 74.2 MiB, delta_rss +61.6 MiB)
# Building complex block graph took 9.94 seconds (max_rss 614.3 MiB, delta_rss +540.1 MiB)
# Load circuit took 81.11 seconds (max_rss 5099.6 MiB, delta_rss +4485.3 MiB)
# Clean circuit took 3.19 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Compress circuit took 5.22 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Verify circuit took 3.27 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Build Timing Graph took 20.42 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Load Timing Constraints took 0.07 seconds (max_rss 5099.6 MiB, delta_rss +0.0 MiB)
# Packing took 1476.49 seconds (max_rss 14398.7 MiB, delta_rss +9299.1 MiB)
Finished loading packed FPGA netlist file (took 139.889 seconds).
# Load Packing took 148.74 seconds (max_rss 14398.7 MiB, delta_rss +0.0 MiB)
## Build Device Grid took 2.54 seconds (max_rss 14398.7 MiB, delta_rss +0.0 MiB)
## Build routing resource graph took 278.05 seconds (max_rss 17705.7 MiB, delta_rss +3307.0 MiB)
# Create Device took 281.66 seconds (max_rss 17705.7 MiB, delta_rss +3307.0 MiB)
### Computing router lookahead map took 867.78 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
### Computing delta delays took 4129.70 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
## Computing placement delta delay look-up took 4998.85 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
# Placement took 41101.84 seconds (max_rss 17705.7 MiB, delta_rss +0.0 MiB)
# Routing took 15465.16 seconds (max_rss 20087.5 MiB, delta_rss +585.2 MiB)
Timing analysis took 101.113 seconds (72.6615 STA, 28.4519 slack) (188 full updates: 172 setup, 0 hold, 16 combined).
The entire flow of VPR took 58693.73 seconds (max_rss 21421.1 MiB)

litghost · 2020-01-30T21:14:41Z

Overall CPU-wise most phases are the same, with a 15% CPU hit to the router. @kmurray Please review what is here already, as I expect I suspect I can recover most of the CPU time.

Memory wins are ~40% of max_rss.

Signed-off-by: Keith Rothman <[email protected]>

This should have a negliable performance impact, but this enables future changes to modify how rr nodes and rr edges are storaged. Signed-off-by: Keith Rothman <[email protected]>

This changes edge storage from an allocation array of struct per node to struct of array for all edge data. Several algorithms over edges that were previous per node per edge, but were actually just iteration over edges are now part of rr_node_storage. Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

This enables 16-byte alignment (4 nodes per cache line). Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

Actually owning the string data is unneed as all data outlasts the formula data structure. Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

…ons. Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

The refactored functions: - get_rr_node_indices - label_wire_muxes - label_incoming_wires all are invoked inside of nested loops, resulting in a large (~10M+) number of repeated allocations and deallocations. The heap may be able to avoid fragmentation, but small code changes could result in fragmentation where none existed before. The change is to hoist the storage container (std::vector) outside of the loops so that once a high watermark is reached, no further allocations will result. Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

probot-autolabeler bot added lang-cpp C/C++ code libarchfpga Library for handling FPGA Architecture descriptions libvtrutil tests VPR VPR FPGA Placement & Routing Tool labels Jan 30, 2020

litghost requested a review from kmurray January 30, 2020 21:05

litghost force-pushed the memory_clean_up_to_placer branch from 42b115d to 05bf83a Compare January 30, 2020 23:10

kmurray mentioned this pull request Jan 31, 2020

Move call location of ClockRRGraphBuilder and use alloc_and_load_edges. #1081

Merged

7 tasks

This was referenced Feb 3, 2020

Proxy rr node #1084

Closed

Initial refactoring of edge storage #1085

Closed

litghost force-pushed the memory_clean_up_to_placer branch 3 times, most recently from 9199d13 to bfc5e11 Compare February 5, 2020 19:54

litghost added 15 commits February 5, 2020 12:53

Move rr node storage behind an object.

5f83798

Signed-off-by: Keith Rothman <[email protected]>

Convert t_rr_node to a fly-weight object.

291f0ea

This should have a negliable performance impact, but this enables future changes to modify how rr nodes and rr edges are storaged. Signed-off-by: Keith Rothman <[email protected]>

Add support for custom allocator to vtr::vector.

0b15b11

Signed-off-by: Keith Rothman <[email protected]>

Split node ptc data away from core storage.

359f142

This enables 16-byte alignment (4 nodes per cache line). Signed-off-by: Keith Rothman <[email protected]>

Rename t_rr_node_storage to t_rr_graph_storage.

5c1b331

Signed-off-by: Keith Rothman <[email protected]>

Add comment around state flags.

db9eed4

Signed-off-by: Keith Rothman <[email protected]>

Add missing flag check.

eb13bd0

Signed-off-by: Keith Rothman <[email protected]>

Add comments around edge sorting.

058192e

Signed-off-by: Keith Rothman <[email protected]>

Used function form of size_t().

98a6d3e

Signed-off-by: Keith Rothman <[email protected]>

Integrate schema based reader with edge refactoring.

e96a3ba

Signed-off-by: Keith Rothman <[email protected]>

Remove sources of garbage during initialization.

9f407fc

Signed-off-by: Keith Rothman <[email protected]>

Assorted small fixes to avoid copies/allocations.

7bbbda0

Signed-off-by: Keith Rothman <[email protected]>

Changes to avoid allocations.

a31d500

Signed-off-by: Keith Rothman <[email protected]>

Use vtr::string_view instead of copying strings.

0d79929

Signed-off-by: Keith Rothman <[email protected]>

litghost added 13 commits February 5, 2020 12:54

Use string view for formula data instead of std::string.

c91a4f6

Actually owning the string data is unneed as all data outlasts the formula data structure. Signed-off-by: Keith Rothman <[email protected]>

Disable heap profiler.

5783450

Signed-off-by: Keith Rothman <[email protected]>

Refactor count_rr_switches to avoid allocations.

8c87d17

Signed-off-by: Keith Rothman <[email protected]>

Hoist t_routing_cost_map out of inner loop to avoid repeated allocati…

a3ae48a

…ons. Signed-off-by: Keith Rothman <[email protected]>

Refactor build_switchblocks to avoid thrashing heap.

acf52a3

Signed-off-by: Keith Rothman <[email protected]>

Avoid reallocating the heap if not needed.

8fcbad5

Signed-off-by: Keith Rothman <[email protected]>

Avoid repeatedly allocating modified_rr_node_inf array.

1e128c5

Signed-off-by: Keith Rothman <[email protected]>

Avoid repeatedly allocating structures in the map lookahead.

c6e7861

Signed-off-by: Keith Rothman <[email protected]>

Avoid copying equivalent_tiles in is_tile_compatible.

2ea538d

Signed-off-by: Keith Rothman <[email protected]>

Remove heap profiler logic.

8a0b7ca

Signed-off-by: Keith Rothman <[email protected]>

Run make format.

26eafd1

Signed-off-by: Keith Rothman <[email protected]>

Fix GCC 5/6 warning.

9aa4137

Signed-off-by: Keith Rothman <[email protected]>

litghost force-pushed the memory_clean_up_to_placer branch from bfc5e11 to 9aa4137 Compare February 5, 2020 20:54

litghost mentioned this pull request Feb 6, 2020

Add vtr_reg_weekly_no_he that drops the high effort Titan tests and gaussian blur #1118

Closed

7 tasks

litghost closed this Mar 6, 2020

litghost deleted the memory_clean_up_to_placer branch March 6, 2020 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory clean up to placer #1096

Memory clean up to placer #1096

Uh oh!

litghost commented Jan 30, 2020

Uh oh!

litghost commented Jan 30, 2020

Uh oh!

litghost commented Jan 30, 2020

Uh oh!

Uh oh!

Memory clean up to placer #1096

Memory clean up to placer #1096

Uh oh!

Conversation

litghost commented Jan 30, 2020

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

litghost commented Jan 30, 2020

Uh oh!

litghost commented Jan 30, 2020

Uh oh!

Uh oh!