-
Notifications
You must be signed in to change notification settings - Fork 78
Deep match amb child memoization #2310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/error-tree-support
Are you sure you want to change the base?
Conversation
// There will only be 3 matches if deep matches are memoized, 6 if they are not. | ||
return count == 3; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a test where we match on the amb node itself (0 | it + 1 | /amb(alts) <- ambTree)
, because it's a corner case.
private void pushAmb(ITree amb) { | ||
if (debug) System.err.println("pushAmb: " + amb); | ||
spine.push(amb); | ||
for (IValue alt : TreeAdapter.getAlternatives(amb)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not hash on the entire amb cluster? then there are 50% (or more) fewer nodes to cache.
@@ -123,7 +147,12 @@ private void pushConcreteSyntaxNode(ITree tree){ | |||
if (TreeAdapter.isAmb(tree)) { | |||
// only recurse | |||
for (IValue alt : TreeAdapter.getAlternatives(tree)) { | |||
pushConcreteSyntaxNode((ITree) alt); | |||
if (!visitedAmbChildren.contains(alt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we could factor "visitedAmbs.contains(tree)" for the entire cluster and not recurse into the children if we did this already before.
@@ -35,6 +36,7 @@ | |||
public class DescendantReader implements Iterator<IValue> { | |||
|
|||
Stack<Object> spine = new Stack<Object>(); | |||
private Set.Transient<IValue> visitedAmbChildren = Set.Transient.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why whould a normal HashSet
not be enough? We're not sharing these sets, we're not doing any immutable updates on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash set has a huge footprint of empty array cells compared to this trieset. We have a very very sparse set here so the trieset is the optimal datastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is a mutable trieset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this is the transient set, I was just wondering why it's used (I see the value when it's used to in the end make an immutable version).
If only for the empty case, we could special case it to start with null, and initialize it the first time we encounter a amb node?
Anyway, I was just wondering why in this specific case this set was used, while the interpreter is mostly full of standard jdk containers. (Or vallang ofcourse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simply conscious choice for the most optimal hash set implementation; we need a hash set but we don't need to pay for all the zeroes. The empty hash trie set is already one or more orders of magnitude (10 times to 100 times) smaller than then it's array-based counterpart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the trick with a null table is not even necessary if we use this transient set. The chances are very big that we encounter only one amb node, which is when the trieset outshines the table implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but it could be optimized by caching the entire cluster instead of each individual alternative within the cluster. Wdyt @PieterOlivier ?
I remember seeing cases where multiple ambiguity clusters had children that where mostly identical but not all of them. In that case caching the children allows more memoization. I do not know how common this is. Maybe we should focus on what is most natural for the user? The "cache the ambiguity cluster itself" solution probably wins here. I will make some measurements with the current solution and then implement memoization at the amb node level and measure again. |
This PR implements amb child node memoization for deep (descendent) matches.
This is needed because trees with error nodes often have nested ambiguity nodes with a lot of identical nodes.
By using memoization during deep matches, identitical nodes are only visited once.