Skip to content

Releases: iipc/jwarc

v0.34.0

16 Jan 08:20
@ato ato

Choose a tag to compare

Run the jar: java -jar jwarc-0.34.0.jar
Or the container: docker run -it --rm iipc/jwarc:0.34.0
Or add a Maven dependency:

<dependency>
  <groupId>org.netpreserve</groupId>
  <artifactId>jwarc</artifactId>
  <version>0.34.0</version>
</dependency>

New features

  • Added view command: interactive TUI for exploring WARC files
    • view captures, WARC and HTTP headers
    • filter captures by type, status, method or url
    • save payload to a file, open in browser or external editor

Fixed

  • HttpParser: lenient mode now accepts "0" as a status code for compatibility with Browsertrix WARCs

v0.33.0

23 Dec 15:19
@ato ato

Choose a tag to compare

Run the jar: java -jar jwarc-0.33.0.jar
Or the container: docker run -it --rm iipc/jwarc:0.33.0
Or add a Maven dependency:

<dependency>
  <groupId>org.netpreserve</groupId>
  <artifactId>jwarc</artifactId>
  <version>0.33.0</version>
</dependency>

New features

  • CdxRecord: surt(), format(), values() and toString()
  • CdxWriter
    • CDXJ output support
    • sort option
  • HttpMessage: Content-Encoding: zstd support
  • HttpRequest: Transfer-Encoding: chunked support
  • WarcReader: Zstandard compressed WARC Files support
  • WarcServer: resource record support

Fixed

  • URIs.toNormalizedSurt(): Improved compatibility with Python surt.

v0.32.0

16 Jul 10:24
@ato ato

Choose a tag to compare

New features

  • HeaderValidator with WARC/1.1 standard ruleset
  • ExtractTool: can now extract sequential concurrent records (--concurrent option)
  • DedupeTool
    • In-memory cache for cross-URL digest-based deduplication (--cache-size option)
    • Now prints deduplication statistics (--dry-run and --quiet options)
    • Multi-threaded deduplication (--threads option)
  • ValidateTool
    • Multi-threaded validation (--threads option)
  • ParsingException message is now annotated with the source filename and record offset when available

Bugs fixed

  • RFC5952 canonical form is now used for IPv6 addresses in WARC-IP-Address
  • HttpParser in lenient mode now:
    • accepts responses missing version number
    • ignores header lines missing :
    • ignores folded status lines
  • WarcParser: treats alexa/dat ARC records as not HTTP type

v0.31.1: Release 0.31.1

20 Nov 04:11
@ato ato

Choose a tag to compare

Bugs fixed

  • Fixed URIs.parseLeniently() returning a different value to new URI() if the path was empty or the input contained percent encoded characters #90 #91
  • Replaced some internal usages of record.targetURI() with record.target() to reduce the chance of runtime exceptions and preserve the exact original value

v0.31.0: Release 0.31.0

14 Nov 01:59
@ato ato

Choose a tag to compare

New features

  • Added optional support for brotli content encoding #88
  • Added HttpMessage.bodyDecoded() #88
  • WarcTool: Added dedupe subcommand
  • DedupeTool: Added --verbose option and silenced default logging

Bug fixes

  • GunzipChannel: Fixed incorrect record length calculation when gzip footer aligns with the end of the buffer
  • ValidateTool: Fixed digest validation #87
  • DedupeTool: Used matchType=exact to properly handle CDX queries for URLs ending with *
  • DedupeTool: Fixed record copying when transferTo copies fewer bytes than requested
  • DedupeTool: Prevented appending of an empty gzip member when no records were deduplicated
  • DedupeTool: Fixed exception when input files are in the current working directory

v0.30.0: Release 0.30.0

28 Jun 07:36
@ato ato

Choose a tag to compare

New features

  • WarcReader and WarcParser gained a lenient parsing mode which:
    • permits ASCII control characters in header field names and values
    • allows lines to end with LF instead of CRLF
    • permits multi-digit WARC minor versions like "0.18"

v0.29.0: Release 0.29.0

14 Feb 04:43
@ato ato

Choose a tag to compare

New features

  • Added MediaType.parseLeniently() and .isValid()

Changes

  • Message.contentType() and other methods that internally call it now use the lenient MediaType parser instead of throwing IllegalArgumentException #83

v0.28.6: Release 0.28.6

09 Feb 07:15
@ato ato

Choose a tag to compare

Bugs fixed

  • Improved compatibility with ARC variants (version-block length off by one, v2 version-block, spurious linefeeds) #82
  • WarcParser: Context in parse error messages was incorrectly using the parser (file) position instead of buffer position

v0.28.5: Release 0.28.5

13 Dec 05:34
@ato ato

Choose a tag to compare

Bugs fixed

  • Fixed ClosedChannelException when reading a WarcRevisit body after closing a previous one due to reuse of empty MessageBody. #80

v0.28.4: Release 0.28.4

13 Dec 05:33
@ato ato

Choose a tag to compare

Bugs fixed

  • CDX formatting now percent encodes spaces, newlines and null characters in all string fields. This is non-standard but at least prevents us outputting invalid CDX lines.
  • CdxRequestEncoder now handles requests with an invalid content-type header