Skip to content

Commit 3de0293

Browse files
authored
Merge pull request #36 from harikb/master
Add the lightning talk - Cardinality Estimates Using HLL in Go - Tuesday ~5:15pm
2 parents 91afd23 + b137bf3 commit 3de0293

File tree

2 files changed

+199
-0
lines changed

2 files changed

+199
-0
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Cardinality Estimates Using HLL in Go
2+
3+
This presentation on using HyperLogLog++ from Golang was given as a lightning talk at GopherCon 2016.
4+
On Tuesday July 12, 2016
5+
6+
Slides viewable at http://go-talks.appspot.com/github.com/gophercon/2016-talks/HariBhaskaran-CardinalityEstimatesUsingHLL/main.slide
7+
8+
--
9+
Hari Bhaskaran
10+
[@yetanotherfella](https://twitter.com/yetanotherfella)
11+
https://www.linkedin.com/in/haribhaskaran
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
Cardinality Estimates Using HLL in Go
2+
Large dataset, HyperLogLog++, and Go
3+
4+
Hari Bhaskaran
5+
Adobe
6+
7+
@yetanotherfella
8+
https://www.linkedin.com/in/haribhaskaran
9+
10+
11+
* Goal
12+
13+
- Answer questions like 'How many unique visitors did my site have yesterday?'
14+
- How many unique visitors did my site have last week?
15+
- From data stored in database or plain text TSV files. Solution applies to both, but I will be limiting my talk to command line analysis of plain text files
16+
17+
* Accurate, but slow solutions
18+
19+
- select count(distinct(key)) from events where date between X and Y;
20+
- cut -f .. file | sort -u | wc -l
21+
- .... | awk '{ c[$1]++} END{for (i in c) print i, c[i]}'
22+
23+
* Sample dataset
24+
25+
$ find logs -type f | \
26+
parallel -X 'wc -l {} | grep total' | \
27+
awk '{a+=$1}END{print a}'
28+
29+
92634675 # Number of lines, 92 Million
30+
31+
$ find logs -type f | wc -l
32+
33+
4392 # Number of files
34+
35+
.caption O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
36+
.caption ;login: The USENIX Magazine, February 2011:42-47.
37+
38+
* Sample dataset (cont'd)
39+
40+
$ cat $(find logs -type f | sort -R | head -n 1) | cut -f1,4 | head -n 2
41+
T8 VsmEKQAAAB77S@xU
42+
T8 Vy5XnQAABfrFEtEt
43+
44+
$ find logs -type f | \
45+
parallel -P20 -L200 'LC_ALL=C cut -f1,4 {} | wc -c' | \
46+
awk '{a+=$1}END{print a}'
47+
48+
1817322784 # Only 1.8 G is relevant data (to be distincted)
49+
50+
$ find logs -type f | parallel -P20 -L200 'wc -c {} | grep total' | \
51+
awk '{a+=$1}END{print a}'
52+
53+
78185624553 # Total data size is 75G
54+
55+
.caption O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
56+
.caption ;login: The USENIX Magazine, February 2011:42-47.
57+
58+
* Let us calculate the uniques
59+
60+
$ time (find logs -type f | sort -R | \
61+
parallel -P20 -L200 'LC_ALL=C cut -f1,4 {}' | \
62+
LC_ALL=C sort -u | \
63+
awk '{ c[$1]++} END{for (i in c) print i, c[i]}')
64+
65+
T11 1841
66+
T1 1
67+
T2 2250490
68+
T3 2593651
69+
T4 4267297
70+
T5 30833645
71+
T6 1348228
72+
b7027a8ada48bfa673aae37a9b9f96 1 # <-- blame Java for this :P
73+
T7 4675100
74+
T8 5126977
75+
T9 857155
76+
T10 16
77+
78+
real 1m50.093s
79+
user 5m49.232s
80+
sys 0m30.670s
81+
82+
* HyperLogLog++
83+
84+
- LogLog -> HyperLogLog -> HyperLogLog++
85+
- HLL used here is HyperLogLog++, a few modifications to an earlier HyperLogLog algorithm by a few researchers at Google (Stefan Heule, Marc Nunkesser, Alexander Hall)
86+
- HLL supports Unions which allows us to break the estimation problem into smaller chunks that can even be run in parallel
87+
88+
MapReduce style
89+
90+
- Still scans all data
91+
- Avoids a gigantic sort.
92+
93+
Incremental Update style
94+
95+
- Save daily uniques
96+
- Generate weekly/monthly uniques from it.
97+
98+
* My HLL based CLI utility 'dice'
99+
100+
$ dice -h
101+
Usage of dice (Version: 20160618.122920):
102+
-d, --delimiter string Delimiter character.
103+
-i, --hll-in Input has HLL json-data instead of column value
104+
-o, --hll-out Ouput HLL json-data instead of cardinality
105+
-u, --uniq-col int One-based column number to distinct. Zero implies whole line
106+
107+
Typical usage would involve invoking dice with -i or -o depending on whether we are in the Map phase or Reduce phase
108+
109+
.caption A less relevant sibling utility was first named 'slice'. Then this had to be 'dice'
110+
111+
* Let us calculate the uniqs using HLL
112+
113+
$ time (find logs -type f | sort -R | \
114+
parallel -P20 -L200 'LC_ALL=C cut -f1,4 {} | ~/bin/dice --hll-out -u2' | \
115+
~/bin/dice --hll-in -u2)
116+
117+
T1 1
118+
T3 2607555
119+
T9 864695
120+
T4 4276099
121+
T6 1358564
122+
T8 5116414
123+
T5 31155583
124+
T10 16
125+
T7 4667669
126+
T2 2236266
127+
T11 1734
128+
b7027a8ada48bfa673aae37a9b9f96 1 # <- I still blame Java :P
129+
130+
real 0m22.295s
131+
user 4m45.871s
132+
sys 0m25.819s
133+
134+
* Why is it so fast?
135+
136+
- Avoided the sort
137+
- It used more cores of the machine since chunks were processed in parallel
138+
139+
* How does HLL do this magic?
140+
141+
In a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12.5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8. HLL is calculated on a hash of the input to create the randomness
142+
143+
This is over simplified for a 6-minute talk. In reality HLL does lot more than this. It actually splits the 64 bit range in to many sub-streams, calculates the estimate for each sub-stream and does an average across them
144+
145+
146+
* HLL Related Code
147+
148+
There seems to be many HLL++ Go implementations. I have used
149+
https://github.com/lytics/hll
150+
151+
h, ok := uniqs[groupKey]
152+
if !ok {
153+
h = hll.NewHll(14, 25)
154+
uniqs[groupKey] = h
155+
}
156+
if hllIn {
157+
h2 := hll.NewHll(14, 25)
158+
err := h2.UnmarshalJSON(line[keyStart:keyEnd])
159+
if err != nil {
160+
log.Printf("Unable to Unmarshall JSON-HLL: %s (key = %v)", err, groupKey)
161+
continue
162+
}
163+
h.Combine(h2)
164+
} else {
165+
h.Add(murmur3.Sum64(line[keyStart:keyEnd]))
166+
}
167+
168+
* HLL Related Code (contd)
169+
170+
if hllOut {
171+
b, err := v.MarshalJSON()
172+
if err != nil {
173+
log.Fatalf("Unabled to marshal HLL to JSON: %s", err)
174+
}
175+
writer.Write(b)
176+
} else {
177+
writer.WriteString(fmt.Sprintf("%d", v.Cardinality()))
178+
}
179+
180+
* Intermediate results
181+
182+
T7 {"M":"gWAwDAAIAAACAAAFAQEBAgUKHAQBBAAAAgYA....DXUFeywCAAAABAAAAgMAAAA=","p":14,"pp":25}
183+
184+
185+
Intermediate result may appear larger than the original content in that column.
186+
This HLL intermediate JSON-formatted result is a long string of about 10k
187+
bytes. However, note that we only create one such line per run of the job (200
188+
files in this case) per unique value of the column-1 in this case.

0 commit comments

Comments
 (0)