|
| 1 | +Cardinality Estimates Using HLL in Go |
| 2 | +Large dataset, HyperLogLog++, and Go |
| 3 | + |
| 4 | +Hari Bhaskaran |
| 5 | +Adobe |
| 6 | + |
| 7 | +@yetanotherfella |
| 8 | +https://www.linkedin.com/in/haribhaskaran |
| 9 | + |
| 10 | + |
| 11 | +* Goal |
| 12 | + |
| 13 | +- Answer questions like 'How many unique visitors did my site have yesterday?' |
| 14 | +- How many unique visitors did my site have last week? |
| 15 | +- From data stored in database or plain text TSV files. Solution applies to both, but I will be limiting my talk to command line analysis of plain text files |
| 16 | + |
| 17 | +* Accurate, but slow solutions |
| 18 | + |
| 19 | +- select count(distinct(key)) from events where date between X and Y; |
| 20 | +- cut -f .. file | sort -u | wc -l |
| 21 | +- .... | awk '{ c[$1]++} END{for (i in c) print i, c[i]}' |
| 22 | + |
| 23 | +* Sample dataset |
| 24 | + |
| 25 | + $ find logs -type f | \ |
| 26 | + parallel -X 'wc -l {} | grep total' | \ |
| 27 | + awk '{a+=$1}END{print a}' |
| 28 | + |
| 29 | + 92634675 # Number of lines, 92 Million |
| 30 | + |
| 31 | + $ find logs -type f | wc -l |
| 32 | + |
| 33 | + 4392 # Number of files |
| 34 | + |
| 35 | +.caption O. Tange (2011): GNU Parallel - The Command-Line Power Tool, |
| 36 | +.caption ;login: The USENIX Magazine, February 2011:42-47. |
| 37 | + |
| 38 | +* Sample dataset (cont'd) |
| 39 | + |
| 40 | + $ cat $(find logs -type f | sort -R | head -n 1) | cut -f1,4 | head -n 2 |
| 41 | + T8 VsmEKQAAAB77S@xU |
| 42 | + T8 Vy5XnQAABfrFEtEt |
| 43 | + |
| 44 | + $ find logs -type f | \ |
| 45 | + parallel -P20 -L200 'LC_ALL=C cut -f1,4 {} | wc -c' | \ |
| 46 | + awk '{a+=$1}END{print a}' |
| 47 | + |
| 48 | + 1817322784 # Only 1.8 G is relevant data (to be distincted) |
| 49 | + |
| 50 | + $ find logs -type f | parallel -P20 -L200 'wc -c {} | grep total' | \ |
| 51 | + awk '{a+=$1}END{print a}' |
| 52 | + |
| 53 | + 78185624553 # Total data size is 75G |
| 54 | + |
| 55 | +.caption O. Tange (2011): GNU Parallel - The Command-Line Power Tool, |
| 56 | +.caption ;login: The USENIX Magazine, February 2011:42-47. |
| 57 | + |
| 58 | +* Let us calculate the uniques |
| 59 | + |
| 60 | + $ time (find logs -type f | sort -R | \ |
| 61 | + parallel -P20 -L200 'LC_ALL=C cut -f1,4 {}' | \ |
| 62 | + LC_ALL=C sort -u | \ |
| 63 | + awk '{ c[$1]++} END{for (i in c) print i, c[i]}') |
| 64 | + |
| 65 | + T11 1841 |
| 66 | + T1 1 |
| 67 | + T2 2250490 |
| 68 | + T3 2593651 |
| 69 | + T4 4267297 |
| 70 | + T5 30833645 |
| 71 | + T6 1348228 |
| 72 | + b7027a8ada48bfa673aae37a9b9f96 1 # <-- blame Java for this :P |
| 73 | + T7 4675100 |
| 74 | + T8 5126977 |
| 75 | + T9 857155 |
| 76 | + T10 16 |
| 77 | + |
| 78 | + real 1m50.093s |
| 79 | + user 5m49.232s |
| 80 | + sys 0m30.670s |
| 81 | + |
| 82 | +* HyperLogLog++ |
| 83 | + |
| 84 | +- LogLog -> HyperLogLog -> HyperLogLog++ |
| 85 | +- HLL used here is HyperLogLog++, a few modifications to an earlier HyperLogLog algorithm by a few researchers at Google (Stefan Heule, Marc Nunkesser, Alexander Hall) |
| 86 | +- HLL supports Unions which allows us to break the estimation problem into smaller chunks that can even be run in parallel |
| 87 | + |
| 88 | +MapReduce style |
| 89 | + |
| 90 | +- Still scans all data |
| 91 | +- Avoids a gigantic sort. |
| 92 | + |
| 93 | +Incremental Update style |
| 94 | + |
| 95 | +- Save daily uniques |
| 96 | +- Generate weekly/monthly uniques from it. |
| 97 | + |
| 98 | +* My HLL based CLI utility 'dice' |
| 99 | + |
| 100 | + $ dice -h |
| 101 | + Usage of dice (Version: 20160618.122920): |
| 102 | + -d, --delimiter string Delimiter character. |
| 103 | + -i, --hll-in Input has HLL json-data instead of column value |
| 104 | + -o, --hll-out Ouput HLL json-data instead of cardinality |
| 105 | + -u, --uniq-col int One-based column number to distinct. Zero implies whole line |
| 106 | + |
| 107 | +Typical usage would involve invoking dice with -i or -o depending on whether we are in the Map phase or Reduce phase |
| 108 | + |
| 109 | +.caption A less relevant sibling utility was first named 'slice'. Then this had to be 'dice' |
| 110 | + |
| 111 | +* Let us calculate the uniqs using HLL |
| 112 | + |
| 113 | + $ time (find logs -type f | sort -R | \ |
| 114 | + parallel -P20 -L200 'LC_ALL=C cut -f1,4 {} | ~/bin/dice --hll-out -u2' | \ |
| 115 | + ~/bin/dice --hll-in -u2) |
| 116 | + |
| 117 | + T1 1 |
| 118 | + T3 2607555 |
| 119 | + T9 864695 |
| 120 | + T4 4276099 |
| 121 | + T6 1358564 |
| 122 | + T8 5116414 |
| 123 | + T5 31155583 |
| 124 | + T10 16 |
| 125 | + T7 4667669 |
| 126 | + T2 2236266 |
| 127 | + T11 1734 |
| 128 | + b7027a8ada48bfa673aae37a9b9f96 1 # <- I still blame Java :P |
| 129 | + |
| 130 | + real 0m22.295s |
| 131 | + user 4m45.871s |
| 132 | + sys 0m25.819s |
| 133 | + |
| 134 | +* Why is it so fast? |
| 135 | + |
| 136 | +- Avoided the sort |
| 137 | +- It used more cores of the machine since chunks were processed in parallel |
| 138 | + |
| 139 | +* How does HLL do this magic? |
| 140 | + |
| 141 | +In a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12.5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8. HLL is calculated on a hash of the input to create the randomness |
| 142 | + |
| 143 | +This is over simplified for a 6-minute talk. In reality HLL does lot more than this. It actually splits the 64 bit range in to many sub-streams, calculates the estimate for each sub-stream and does an average across them |
| 144 | + |
| 145 | + |
| 146 | +* HLL Related Code |
| 147 | + |
| 148 | +There seems to be many HLL++ Go implementations. I have used |
| 149 | +https://github.com/lytics/hll |
| 150 | + |
| 151 | + h, ok := uniqs[groupKey] |
| 152 | + if !ok { |
| 153 | + h = hll.NewHll(14, 25) |
| 154 | + uniqs[groupKey] = h |
| 155 | + } |
| 156 | + if hllIn { |
| 157 | + h2 := hll.NewHll(14, 25) |
| 158 | + err := h2.UnmarshalJSON(line[keyStart:keyEnd]) |
| 159 | + if err != nil { |
| 160 | + log.Printf("Unable to Unmarshall JSON-HLL: %s (key = %v)", err, groupKey) |
| 161 | + continue |
| 162 | + } |
| 163 | + h.Combine(h2) |
| 164 | + } else { |
| 165 | + h.Add(murmur3.Sum64(line[keyStart:keyEnd])) |
| 166 | + } |
| 167 | + |
| 168 | +* HLL Related Code (contd) |
| 169 | + |
| 170 | + if hllOut { |
| 171 | + b, err := v.MarshalJSON() |
| 172 | + if err != nil { |
| 173 | + log.Fatalf("Unabled to marshal HLL to JSON: %s", err) |
| 174 | + } |
| 175 | + writer.Write(b) |
| 176 | + } else { |
| 177 | + writer.WriteString(fmt.Sprintf("%d", v.Cardinality())) |
| 178 | + } |
| 179 | + |
| 180 | +* Intermediate results |
| 181 | + |
| 182 | + T7 {"M":"gWAwDAAIAAACAAAFAQEBAgUKHAQBBAAAAgYA....DXUFeywCAAAABAAAAgMAAAA=","p":14,"pp":25} |
| 183 | + |
| 184 | + |
| 185 | +Intermediate result may appear larger than the original content in that column. |
| 186 | +This HLL intermediate JSON-formatted result is a long string of about 10k |
| 187 | +bytes. However, note that we only create one such line per run of the job (200 |
| 188 | +files in this case) per unique value of the column-1 in this case. |
0 commit comments