Arthur de Jong

Open Source / Free Software developer

summaryrefslogtreecommitdiffstats
path: root/CONTRIBUTING.md
blob: f6ba7a3fe16a4627acb8f390dda74ee686872141 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
Contributing to python-stdnum
=============================

This document describes general guidelines for contributing new formats or
other enhancement to python-stdnum.


Adding number formats
---------------------

Basically any number or code that has some validation mechanism available or
some common formatting is eligible for inclusion into this library. If the
only specification of the number is "it consists of 6 digits" implementing
validation may not be that useful.

Contributions of new formats or requests to implement validation for a format
should include the following:

* The format name and short description.
* References to (official) sources that describe the format.
* A one or two paragraph description containing more details of the number
  (e.g. purpose and issuer and possibly format information that might be
  useful to end users).
* If available, a link to an (official) validation service for the number,
  reference implementations or similar sources that allow validating the
  correctness of the implementation.
* A set of around 20 to 100 "real" valid numbers for testing (more is better
  during development but only around 100 will be retained for regression
  testing).
* If the validation depends on some (online) list of formats, structures or
  parts of the identifier (e.g. a list of region codes that are part of the
  number) a way to easily update the registry information should be
  available.


Code contributions
------------------

Improvements to python-stdnum are most welcome. Integrating contributions
will be done on a best-effort basis and can be made easier if the following
are considered:

* Ideally contributions are made as GitHub pull requests, but contributions
  by email (privately or through the python-stdnum-users mailing list) can
  also be considered.
* Submitted contributions will often be reformatted and sometimes
  restructured for consistency with other parts.
* Contributions will be acknowledged in the release notes.
* Contributions should add or update a copyright statement if you feel the
  contribution is significant.
* All contribution should be made with compatible applicable copyright.
* It is not needed to modify the NEWS, README.md or files under docs for new
  formats; these files will be updated on release.
* Marking valid numbers as invalid should be avoided and are much worse than
  marking invalid numbers as valid. Since the primary use case for
  python-stdnum is to validate entered data having an implementation that
  results in "computer says no" should be avoided.
* Number format implementations should include links to sources of
  information: generally useful links (e.g. more details about the number
  itself) should be in the module docstring, if it relates more to the
  implementation (e.g. pointer to reference implementation, online API
  documentation or similar) a comment in the code is better
* Country-specific numbers and codes go in a country or region package (e.g.
  stdnum.eu.vat or stdnum.nl.bsn) while global numbers go in the toplevel
  name space (e.g. stdnum.isbn).
* All code should be well tested and achieve 100% code coverage.
* Existing code structure conventions (e.g. see README for interface) should
  be followed.
* Git commit messages should follow the usual 7 rules.
* Declarative or functional constructs are preferred over an iterative
  approach, e.g.::

      s = sum(int(c) for c in number)

  over::

      s = 0
      for c in number:
          s += int(c)


Testing
-------

Tests can be run with `tox`. Some basic code style tests can be run with `tox
-e flake8` and most other targets run the test suite with various supported
Python interpreters.

Module implementations have a couple of smaller test cases that also serve as
basic documentation of the happy flow.

More extensive tests are available, per module, in the tests directory. These
tests (also doctests) cover more corner cases and should include a set of
valid numbers that demonstrate that the module works correctly for real
numbers.

The normal tests should never require online sources for execution. All
functions that deal with online lookups (e.g. the EU VIES service for VAT
validation) should only be tested using conditional unittests.


Finding test numbers
--------------------

Some company numbers are commonly published on a company's website contact
page (e.g. VAT or other registration numbers, bank account numbers). Doing a
web search limited to a country and some key words generally turn up a lot of
pages with this information.

Another approach is to search for spreadsheet-type documents with some
keywords that match the number. This sometimes turns up lists of companies
(also occasionally works for personal identifiers).

For information that is displayed on ID cards or passports it is sometimes
useful to do an image search.

For dealing with numbers that point to individuals it is important to:

* Only keep the data that is needed to test the implementation.
* Ensure that no actual other data relation to a person or other personal
  information is kept or can be inferred from the kept data.
* The presence of a number in the test set should not provide any information
  about the person (other than that there is a person with the number or
  information that is present in the number itself).

Sometimes numbers are part of a data leak. If this data is used to pick a few
sample numbers from the selection should be random and the leak should not be
identifiable from the picked numbers. For example, if the leaked numbers
pertain only to people with a certain medical condition, membership of some
organisation or other specific property the leaked data should not be used.


Reverse engineering
-------------------

Sometimes a number format clearly has a check digit but the algorithm is not
publicly documented. It is sometimes possible to reverse engineer the used
check digit algorithm from a large set of numbers.

For example, given numbers that, apart from the check digit, only differ in
one digit will often expose the weights used. This works reasonably well if
the algorithm uses modulo 11 is over a weighted sums over the digits.

See https://github.com/arthurdejong/python-stdnum/pull/203#issuecomment-623188812


Registries
----------

Some numbers or parts of numbers use validation base on a registry of known
good prefixes, ranges or formats. It is only useful to fully base validation
on these registries if the update frequency to these registries is very low.

If there is a registry that is used (a list of known values, ranges or
otherwise) the downloaded information should be stored in a data file (see
the stdnum.numdb module). Only the minimal amount of data should be kept (for
validation or identification).

The data files should be able to be created and updated using a script in the
`update` directory.