Grammar-Based Parser

Versionator uses a formal EBNF grammar to parse version strings. This page explains the approach, its benefits, and trade-offs.

Why a Formal Grammar?

Most version parsers use ad-hoc string manipulation or regular expressions. Versionator takes a different approach: we define a formal grammar based on the SemVer 2.0.0 specification and generate a parser from it.

The Problem with Ad-Hoc Parsing

Traditional regex-based version parsing has issues:

// Common regex approach - fragile and hard to maintain
var semverRegex = regexp.MustCompile(
    `^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)

Problems:

Hard to extend - Adding support for new formats requires regex surgery
Difficult to validate - Edge cases (leading zeros, empty identifiers) need separate checks
Poor error messages - Regex failures give no context about what went wrong
Specification drift - Easy to deviate from the SemVer spec without noticing

The Grammar Approach

Instead, we define the grammar formally:

version = [ prefix ], version-core ;
prefix = "v" | "V" ;

version-core = semver | partial-version ;

semver = major, ".", minor, ".", patch, [ pre-release ], [ build-metadata ] ;

pre-release = "-", pre-release-identifier, { ".", pre-release-identifier } ;
pre-release-identifier = alphanumeric-identifier | numeric-identifier ;

numeric-identifier = "0" | positive-digit, { digit } ;

This grammar is:

Readable - Maps directly to the SemVer specification
Verifiable - Can be reviewed against the spec
Self-documenting - The grammar IS the documentation

Implementation

Versionator uses Participle, a parser generator for Go that builds parsers from struct tags.

Lexer

The lexer tokenizes input into meaningful units:

var VersionLexer = lexer.MustSimple([]lexer.SimpleRule{
    {Name: "Prefix", Pattern: `[vV]`},
    {Name: "Number", Pattern: `[0-9]+`},
    {Name: "Dot", Pattern: `\.`},
    {Name: "Dash", Pattern: `-`},
    {Name: "Plus", Pattern: `\+`},
    {Name: "Ident", Pattern: `[a-zA-Z][a-zA-Z0-9-]*`},
    {Name: "Mixed", Pattern: `[0-9]+[a-zA-Z][a-zA-Z0-9-]*`},
})

AST

The grammar maps to Go structs:

type Version struct {
    Prefix        string         `parser:"@Prefix?"`
    Core          *VersionCore   `parser:"@@"`
    PreRelease    []*Identifier  `parser:"('-' @@ ('.' @@)* )?"`
    BuildMetadata []*Identifier  `parser:"('+' @@ ('.' @@)* )?"`
    Raw           string
}

type VersionCore struct {
    Major    *int `parser:"@Number"`
    Minor    *int `parser:"('.' @Number)?"`
    Patch    *int `parser:"('.' @Number)?"`
    Revision *int `parser:"('.' @Number)?"`
}

Validation

After parsing, semantic validation ensures SemVer compliance:

No leading zeros in numeric identifiers (except 0 itself)
Pre-release identifiers contain only [0-9A-Za-z-]
Build metadata identifiers contain only [0-9A-Za-z-]
Only v or V prefixes allowed

Benefits

1. Correctness by Construction

The grammar enforces structure that regex can't:

2.3-alpha.1      ✓ Valid
2.3-alpha..1     ✗ Empty identifier (caught by grammar)
2.3-alpha.01     ✗ Leading zero (caught by validation)

2. Better Error Messages

Grammar-based parsing knows exactly where parsing failed:

parse error: 1:6: unexpected token "."
              ^
         1.2..3

3. Easy Extension

Adding support for new formats requires adding grammar rules, not debugging regex:

(* Add 4-component .NET assembly versions *)
assembly-version = major, ".", minor, ".", build, ".", revision ;

4. Testability

Grammar rules can be tested in isolation:

func TestParse_PreRelease(t *testing.T) {
    tests := []struct{
        input    string
        expected []string
    }{
        {"1.0.0-alpha", []string{"alpha"}},
        {"1.0.0-alpha.1", []string{"alpha", "1"}},
        {"1.0.0-alpha.beta.1", []string{"alpha", "beta", "1"}},
    }
    // ...
}

5. Documentation as Code

The grammar file serves as executable documentation:

docs/grammar/version.ebnf - Complete formal grammar
docs/grammar/railroad.html - Visual railroad diagrams

Trade-offs

Complexity

A grammar-based parser is more complex than a simple regex:

Approach	Lines of Code	Dependencies
Regex	~50	None
Grammar	~300	participle

Build-Time Cost

The parser is constructed at init time, adding ~5ms to startup. This is negligible for CLI usage but measurable.

Learning Curve

Contributors need to understand:

EBNF grammar notation
Participle's struct tag syntax
Lexer token definitions

When This Matters

The grammar approach pays off when:

Parsing complex formats - SemVer has many edge cases
Strict compliance needed - We claim SemVer 2.0.0 compliance
Multiple formats supported - Go pseudo-versions, assembly versions, partial versions
Good error messages matter - CLI users need actionable feedback

Philosophy: The Joy of Excess

Let's be honest: a regex would do.

var semverRegex = regexp.MustCompile(
    `^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)

That's 80 characters. It would parse 99% of version strings correctly. Ship it.

Instead, versionator has a 370-line formal EBNF grammar covering SemVer, Go pseudo-versions, .NET assembly versions, NuGet, npm, CalVer, and template variables. Complete with ISO/IEC 14977 notation, proper CC BY 3.0 licensing attribution for the SemVer spec, and railroad diagram generation.

This is, by any reasonable measure, excessive.

And that's exactly the point.

Versionator is a hobby project solving a real problem. The problem could be solved with less. But hobby projects aren't about less. They're about:

Doing it right - Not "right enough," but right. The SemVer spec is prose; a formal grammar removes ambiguity.
Learning by building - Writing an EBNF grammar teaches you things a regex never will.
Joy in craft - There's satisfaction in a grammar that could serve as a community reference, even if it only serves one CLI tool.
Permissible digressions - In professional work, you ship the regex. In passion projects, you write the grammar, generate the railroad diagrams, and document the precedence rules.

The grammar exists because it was fun to write, correct to use, and excessive in exactly the right way for a project where the journey matters as much as the destination.

If you're evaluating versionator for production use: the grammar works, it's well-tested, and it handles edge cases gracefully. If you're evaluating versionator as a codebase to learn from: welcome, fellow enthusiast.

Grammar Reference

The complete grammar is in docs/grammar/version.ebnf.

Key rules:

Rule	Description
`version`	Top-level: optional prefix + version-core
`prefix`	Only `v` or `V` per SemVer convention
`semver`	Full Major.Minor.Patch with optional pre-release/metadata
`partial-version`	Major only or Major.Minor (defaults missing to 0)
`pre-release`	Dash-prefixed, dot-separated identifiers
`build-metadata`	Plus-prefixed, dot-separated identifiers
`numeric-identifier`	No leading zeros (except "0")
`alphanumeric-identifier`	Contains at least one letter

Community Use

The EBNF grammar in docs/grammar/version.ebnf is available for the broader SemVer community.

As Implemented by the Community

This grammar represents SemVer as implemented by the community, not just the formal specification. We've tried to capture:

SemVer 2.0.0 core - The official specification
Common extensions - Go module pseudo-versions, npm conventions
Practical variations - Partial versions (1.2), optional prefix
Adjacent ecosystems - Microsoft Assembly versions (4-component)

We explicitly chose not to make unilateral modifications to the SemVer core. Where the community has established patterns (like the v prefix), we document and support them. Where ecosystems have their own conventions (like .NET's 4-component versions), we include them as separate grammar rules rather than mixing them with SemVer.

Why This Matters

The SemVer specification uses prose descriptions that leave room for interpretation. A formal grammar removes ambiguity and provides a reference that can be:

Ported to other languages - The EBNF notation is parser-generator agnostic
Used as a test oracle - Validate your parser against the grammar
Extended for your needs - Add custom rules while maintaining SemVer compliance
Referenced in discussions - Point to specific grammar rules when debating edge cases

Contributing

If you find the grammar useful, consider adopting it in your project. If you find bugs, ambiguities, or community conventions we've missed, we welcome issues and pull requests.

Licensing

The SemVer 2.0.0 specification by Tom Preston-Werner is licensed under Creative Commons Attribution 3.0 (CC BY 3.0).

Our EBNF grammar is a formalization of that specification. The grammar file includes:

SemVer rules - Derived from the CC BY 3.0 specification (attribution preserved)
Extensions - Go pseudo-versions, Assembly versions, partial versions (BSD 3-Clause)
Original formalization work - The EBNF encoding itself (BSD 3-Clause)

Important: Our release of this grammar under BSD 3-Clause does not remove or supersede the CC BY 3.0 attribution requirements of the underlying SemVer specification. When using the grammar, you must retain attribution to Tom Preston-Werner and the SemVer project as required by CC BY 3.0.

Why a Formal Grammar?​

The Problem with Ad-Hoc Parsing​

The Grammar Approach​

Implementation​

Lexer​

AST​

Validation​

Benefits​

1. Correctness by Construction​

2. Better Error Messages​

3. Easy Extension​

4. Testability​

5. Documentation as Code​

Trade-offs​

Complexity​

Build-Time Cost​

Learning Curve​

When This Matters​

Philosophy: The Joy of Excess​

Grammar Reference​

Community Use​

As Implemented by the Community​

Why This Matters​

Contributing​

Licensing​

See Also​