URLにマッチする正規表現をRFC3986を考慮して実装する【JavaScript】

1. JavaScript
2023/11/03

やりたいこと

Node.jsのバリデーション処理で正常なURLかを判定したい時があり、そのためだけにライブラリを使うまでもなかったので自前で正規表現を考える。要件は以下の通り。

プロトコル（http or https）が必須
末尾のスラッシュを許容
サブドメイン（ホスト名）を許容
再帰的なサブディレクトリを許容
TLDが必須
クエリパラメータ・ハッシュ・ポート番号それぞれを許容
非予約文字・予約文字それぞれを許容

特殊文字について

pchar = unreserved / pct-encoded / sub-delims / ”:” / ”@”

query = *( pchar / ”/” / ”?” )

fragment = *( pchar / ”/” / ”?” )

pct-encoded = ”%” HEXDIG HEXDIG

unreserved = ALPHA / DIGIT / ”-” / ”.” / ”_” / ”~“
reserved = gen-delims / sub-delims
gen-delims = ”:” / ”/” / ”?” / ”#” / ”[” / ”]” / ”@“
sub-delims = ”!” / ”$” / ”&” / ”’” / ”(” / ”)”
/ ”*” / ”+” / ”,” / ”;” / ”=”

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

許容する特殊文字はRFC3986に準拠した上記のもののみに限定する。

正規表現（結論）

/^https?:\/\/(www\.)?[a-zA-Z0-9:?#/@\-._~%!$&'()*+,;=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([a-zA-Z0-9:?#/@\-._~%!$&'()*+,;=]*)$/

ユニットテストで検証

const URL_PATTERN =
  /^https?:\/\/(www\.)?[a-zA-Z0-9:?#/@\-._~%!$&'()*+,;=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([a-zA-Z0-9:?#/@\-._~%!$&'()*+,;=]*)$/;

function isUrl(str: string): boolean {
  return URL_PATTERN.test(str);
}

上記のような関数を作成しユニットテストで検証してみる。

describe('isUrl', () => {
  describe('正常系', () => {
    it.each<string>([
      'http://example.com',
      'https://example.com',
      'https://example.com/',
      'https://example.jp',
    ])('ルートパス_成功', (str: string) => assertIsValid(str));

    it.each<string>([
      'https://www.example.com',
      'https://example.example.com',
      'https://example.example.example.com',
    ])('サブドメイン_成功', (str: string) => assertIsValid(str));

    it.each<string>([
      'https://example.com/example',
      'https://example.com/example/example',
      'https://example.com/example/example/example',
    ])('サブディレクトリ_成功', (str: string) => assertIsValid(str));

    it.each<string>([
      'https://example.com?example=example',
      'https://example.com/?example=example',
      'https://example.com?example1=example&example2=example',
      'https://example.com#example',
      'https://example.com:3000',
    ])('クエリパラメータ・ハッシュ・ポートあり_成功', (str: string) =>
      assertIsValid(str),
    );

    it.each<string>([
      'https://example.com/example-example',
      'https://example.com/example.example',
      'https://example.com/example_example',
      'https://example.com/example~example',
    ])('非予約文字を含む_成功', (str: string) => assertIsValid(str));

    it.each<string>([
      'https://example.com/123',
      "https://example.com/:?#/@%!$&'()*+,;=",
    ])('数字・予約文字を含む_成功', (str: string) => assertIsValid(str));
  });

  describe('異常系', () => {
    it.each<string>([
      'example.com',
      'htt://example.com',
      'https//example.com',
      'https:example.com',
      'https:/example.com',
      'ftp://example.com',
    ])('不正なプロトコル_エラー', (str: string) => assertIsInvalid(str));

    it.each<string>(['https://example', 'https://example.'])(
      'TLDなし_エラー',
      (str: string) => assertIsInvalid(str),
    );

    it.each<string>(['https://example.com/><', 'https://example.com/😃'])(
      '不正な特殊文字を含む_エラー',
      (str: string) => assertIsInvalid(str),
    );

    it.each<string>([
      '',
      'abc',
      '123',
      "<script>alert('hoge')</script>",
      "javascript:alert('hoge')",
    ])('不正なURL_エラー', (str: string) => {
      assertIsInvalid(str);
    });
  });

  function assertIsValid(str :string): void {
    expect(isUrl(str)).toBe(true);
  }

  function assertIsInvalid(str :string): void {
    expect(isUrl(str)).toBe(false);
  }
});

全てパスしたので問題なさそう。