You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/index.md
+32-34Lines changed: 32 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,8 @@ end
8
8
```
9
9
10
10
# CIGARStrings.jl
11
-
CIGARStrings.jl provide functionality for parsing and working with Concise Idiosyncratic Gapped Alignment Report - or CIGAR - strings.
12
-
CIGARs were popularized by the [SAM format](https://en.wikipedia.org/wiki/SAM_(file_format)), and are a compact runlength encoding notation to represent pairwise alignments.
11
+
CIGARStrings.jl provides functionality for parsing and working with Concise Idiosyncratic Gapped Alignment Report (CIGAR) strings.
12
+
CIGARs were popularized by the [SAM format](https://en.wikipedia.org/wiki/SAM_(file_format)), and are a compact run-length encoding notation used to represent pairwise alignments.
13
13
They can be found in the SAM, BAM, PAF, and GFA formats.
14
14
15
15
For example, the following pairwise alignment of a query to a reference:
@@ -19,17 +19,17 @@ For example, the following pairwise alignment of a query to a reference:
19
19
|||| || | |
20
20
R: TAGAACCATA--TGC
21
21
```
22
-
Can be represented by the CIGAR `5M3D2M2I3M`, representing:
22
+
can be represented by the CIGAR `5M3D2M2I3M`, representing:
23
23
1. 5 matches/mismatches
24
24
2. Then, 3 deletions
25
25
3. Then, 2 matches/mismatches
26
26
4. Then, 2 insertions
27
27
5. Finally, 3 matches/mismatches.
28
28
29
-
A CIGAR strings is always written in terms of the _query_, and not the reference.
29
+
A CIGAR string is always written in terms of the _query_, not the reference.
30
30
31
31
## Individual alignment operations
32
-
One run of identical alignment operations, e.g. "5 matches/mismatches" are represented
32
+
One run of identical alignment operations, e.g. "5 matches/mismatches," is represented
33
33
by a single `CIGARElement`.
34
34
Conceptually, a `CIGARElement` is an alignment operation (represented by a `CIGAROp`) and a length:
35
35
@@ -41,16 +41,16 @@ CIGAROp
41
41
## CIGARs
42
42
A CIGAR string is represented by an `AbstractCIGAR`, which currently has two subtypes: `CIGAR` and `BAMCIGAR`.
43
43
These types differ in their memory layout: The former stores the CIGAR as its ASCII representation (as used in the SAM format), and the latter stores it in a binary format (as used in the BAM format).
44
-
Both typs store its underlying data as an `ImmutableMemoryView{UInt8}`.
44
+
Both types store their underlying data as an `ImmutableMemoryView{UInt8}`.
45
45
46
46
```@docs
47
47
AbstractCIGAR
48
48
```
49
49
50
-
The API for these two types are almost interchangeable, so examples below will use `CIGAR`, since its plaintext representation makes examples easier.
50
+
The API for these two types is almost interchangeable, so examples below use `CIGAR`, since its plaintext representation makes examples easier to read.
51
51
See [BAMCIGAR section](@ref bamcigar) for a list of all differences between the two types.
52
52
53
-
CIGAR strings are validated upon construction
53
+
CIGAR strings are validated upon construction.
54
54
55
55
```jldoctest
56
56
julia> CIGAR("2M1D3M")
@@ -64,7 +64,7 @@ ERROR: Error around byte 4: Invalid operation. Possible values are "MIDNSHP=X".
64
64
Since CIGAR strings occur in various bioinformatics file formats, it is expected
65
65
that users of CIGARStrings.jl will construct `CIGAR`s from a view into a buffer storing a chunk of the file.
66
66
67
-
This is zero-copy, and will not to allocate on Julia 1.14 and forward.
67
+
This is zero-copy, and does not allocate on Julia 1.14 and later.
68
68
For example:
69
69
70
70
```jldoctest
@@ -80,7 +80,7 @@ CIGAR("15M9D18M")
80
80
CIGAR
81
81
```
82
82
83
-
`CIGAR`s are iterable, and returns its`CIGARElement`s, in order:
83
+
`CIGAR`s are iterable, and return their`CIGARElement`s in order:
84
84
85
85
```jldoctest
86
86
julia> collect(CIGAR("2M1D3M"))
@@ -129,8 +129,6 @@ alignment length is 15.
129
129
R: TAGAACCATA--TGC
130
130
```
131
131
132
-
We always have `aln_length(c) ≥ max(query_length(c), ref_length(c))`
133
-
134
132
```jldoctest
135
133
julia> c = CIGAR("5M3D2M2I3M");
136
134
@@ -144,19 +142,19 @@ julia> aln_length(c)
144
142
15
145
143
```
146
144
147
-
Since the CIGAR operation `M` (`OP_M`) is ambiguous to whether is represents matches,
145
+
Since the CIGAR operation `M` (`OP_M`) is ambiguous about whether it represents matches,
148
146
mismatches, or a combination of these, the function [`count_matches`](@ref) can be used to
149
147
count the number of matches in a CIGAR given the number of mismatches.
150
148
151
-
The number of mismatches are typically output by mappers, making this information
152
-
handily accessible:
149
+
Mismatch counts are typically output by mappers, making this information
150
+
readily accessible.
153
151
154
-
The alignment identity (number of matches, not mismatches divided by alignment length)
152
+
Alignment identity (number of matches, excluding mismatches, divided by alignment length)
155
153
can be obtained with [`aln_identity`](@ref).
156
-
Like [`count_matches`](@ref), this takes the number of mismatches as an argument:
154
+
Like [`count_matches`](@ref), this takes the number of mismatches as an argument.
157
155
158
156
## Comparing CIGARs
159
-
When comparing `CIGAR`s using `==`, it will check if the `CIGAR`s are literally identical, in the
157
+
When comparing `CIGAR`s using `==`, Julia checks whether the `CIGAR`s are literally identical, in the
160
158
sense that they are composed of the same bytes:
161
159
162
160
```jldoctest compare
@@ -176,7 +174,7 @@ However, in the above example, since the CIGAR operation `M` signifies a match o
176
174
CIGARs are indeed compatible, since `10M` is also a valid CIGAR annotation for the same alignment
177
175
as `4=1X5=`.
178
176
179
-
This notion of compatibility tested with `is_compatible`:
177
+
This notion of compatibility can be tested with `is_compatible`:
180
178
181
179
```@docs
182
180
is_compatible
@@ -204,10 +202,10 @@ are also written in this alignment.
204
202
We can see that query position 6 aligns to reference position 9, which is also
205
203
alignment position 9.
206
204
207
-
These position translation can be obtained using the function [`pos_to_pos`](@ref),
205
+
These position translations can be obtained using the function [`pos_to_pos`](@ref),
208
206
specifying the source and destination coordinate systems [`query`](@ref), [`ref`](@ref)
209
207
or [`aln`](@ref).
210
-
When passed an integer, this function returns `Translation` object that contains two properties: `.pos` and `.kind`.
208
+
When passed an integer, this function returns a `Translation` object with two properties: `.pos` and `.kind`.
211
209
212
210
When a position translation has a straightforward answer, the `.kind` property is
213
211
`CIGARStrings.pos`, and the `.pos` field is the corresponding position:
The CIGAR format is redundant, in that the same alignment can be written in multiple different ways. In particular:
254
252
255
-
* The `P` and `H` operations means nothing w.r.t the query and reference.
256
-
`P` is only used to pad w.r.t a third sequence, and `H` signifies that part of
253
+
* The `P` and `H` operations mean nothing with respect to the query and reference.
254
+
`P` is only used to pad with respect to a third sequence, and `H` signifies that part of
257
255
the true query is missing from the input query sequence.
258
256
* The `=` and `X` operations are usually redundant with `M`, since the information of matches/mismatches is not given by the alignment itself, but can be determined from the input sequences given the alignment.
259
-
* Consecutive runs of the same operation is allowed, such as `1M1M`, but is better written `2M`
257
+
* Consecutive runs of the same operation are allowed, such as `1M1M`, but are better written as `2M`.
260
258
261
-
This package provides the functions [`normalize`](@ref), [`normalize!`](@ref) and [`unsafe_normalize`](@ref) which creates new cigars written in the canonical form.
262
-
In the canonical form, each of the points above are addressed: `H`is converted to `S`, `P` is removed, `=` and `X`is converted to `M`, and consecutive identical operations are merged.
259
+
This package provides the functions [`normalize`](@ref), [`normalize!`](@ref), and [`unsafe_normalize`](@ref), which create new CIGARs written in canonical form.
260
+
In canonical form, each of the points above is addressed: `H`and `P` is removed, `=` and `X`are converted to `M`, and consecutive identical operations are merged.
263
261
264
-
Note that the normalized form of a cigar corresponds to the _same_ pairwise alignment.
262
+
Note that the normalized form of a CIGAR corresponds to the _same_ pairwise alignment.
265
263
Therefore, it is guaranteed that if `is_compatible(a, b)`, then `normalize(a) == normalize(b)` (though not the other way around).
266
-
It is also guaranteed that the result of position translation is identical for a cigar and its normalized version.
264
+
It is also guaranteed that the result of position translation is identical for a CIGAR and its normalized version.
267
265
268
266
## Errors and error recovery
269
-
CIGARStrings.jl allows you to parse a poential CIGAR string without throwing an exception if the data is invalid, using the function [`CIGARStrings.try_parse`](@ref).
267
+
CIGARStrings.jl allows you to parse a potential CIGAR string without throwing an exception if the data is invalid, using the function [`CIGARStrings.try_parse`](@ref).
270
268
271
269
```@docs
272
270
CIGARStrings.CIGARError
@@ -282,14 +280,14 @@ However, in order to make zero-copy CIGARs possible, the `BAMCIGAR` type is back
282
280
CIGARStrings.BAMCIGAR
283
281
```
284
282
285
-
A `BAMCIGAR` can be constructed from its binary representation, using any type which implements `MemoryViews.MemoryView`:
283
+
A `BAMCIGAR` can be constructed from its binary representation using any type that implements `MemoryViews.MemoryView`:
286
284
287
285
```jldoctest
288
286
julia> BAMCIGAR("\x54\4\0\0\x70\4\0\0")
289
287
BAMCIGAR(CIGAR("69S71M"))
290
288
```
291
289
292
-
This is not zero-cost: Like`CIGAR` the type contains some metadata and is validated upon construction.
290
+
This is not zero-cost: like`CIGAR`, the type contains some metadata and is validated upon construction.
293
291
294
292
Like `CIGAR`, the `try_parse` function can be used:
0 commit comments