Contributing¶

To build woohoo pDNS (and the documentation), three additional dependencies exist:

pip-tools
nose
sphinx-rtd-theme

Caution

woohoo pDNS has pinned its dependencies! This means that the exact version is specified in dev-requirements.txt for all dependencies. This might have undesired side effects when installing in a non-empty environment where one of the packages woohoo pDNS depends on is already installed.

You can easily install them using the following pip command in your development environment:

$ pip install -r dev-requirements.txt

To build the documentation after cloning the repository, run the following command in the woohoo_pdns/docs directory:

$ make html

Note: do not run:

$ sphinx-quickstart

To run the tests, issue the following command:

$ python setup.py test

And to see test coverage:

$ pytest [--cov-report html] --cov=woohoo_pdns woohoo_pdns/tests/

Managing dependencies¶

Following the advice of people with (much) more experience in that field (namely Vincent Driessen and Hynek Schlawack) woohoo pDNS pins its dependencies.

The tool used is pip-tools, for the runtime dependencies in Hash-Checking Mode, and here’s how.

Runtime dependencies¶

Dependencies required to run woohoo pDNS are listed in the install_requires variable in setup.py:

setup(
    <snip>
    install_requires = [
        "alembic",
        "flask",
        <snap>
    ]
)

If you want to add a new (run time) dependency for woohoo pDNS, this is the place to do so.

Build dependencies¶

Dependencies required to develop woohoo pDNS are listed in the dev-requirements.in file:

pip-tools
...

Using `pip-tools` for woohoo pDNS¶

To generate a requirements.txt file (i.e. a requirements.txt file that listing the runtime dependencies), run the following command (you have pip-tools installed, right?):

$ pip-compile --generate-hashes

This will overwrite the current requirements.txt file with the most recent version available on PiPI for every package and will add new dependencies also.

To check if there are newer versions of dependencies available in PyPI, use the following command:

$ pip-compile --upgrade --generate-hashes

This will overwrite the current requirements.txt file with the most recent version available on PiPI for every package. It will not add new dependencies though.

Note: pip-compile has a dry-run command line switch.

To generate the ``dev-requirements.txt`` file (i.e. a file listing the build dependencies), run the following command:

$ pip-compile --allow-unsafe --output-file=dev-requirements.txt dev-requirements.in

This will overwrite the current dev-requirements.txt file with the most recent version available on PiPI for every package and will add new dependencies also.

To check if there are newer versions of build dependencies available in PyPI, use the following command:

$ pip-compile --upgrade --allow-unsafe --output-file=dev-requirements.txt dev-requirements.in

This will overwrite the current dev-requirements.txt file with the most recent version available on PiPI for every package. It will not add new dependencies though.

References:

Implementing an Importer¶

I consider the need for a custom woohoo_pdns.load.Importer the most likely scenario of extending woohoo pDNS. Therefore this process is extensively documented here.

Overview¶

While the Importer is the workhorse of the data loading, it relies on another component called woohoo_pdns.load.Source to provide one record that should be loaded at a time.

There are currently two types of sources implemented: both read from files, but one just reads one line after the other (skipping empty lines) while the other expects to read YAML documents and therefore keeps reading until the YAML document separator (---) is encountered (or the file ends).

The former is woohoo_pdns.load.SingleLineFileSource while the latter is woohoo_pdns.load.YamlFileSource. Because both sources do read data from files on disk, they are both subclasses of woohoo_pdns.load.FileSource.

Source¶

A custom/new source is only required if the existing sources do not cover your needs. Otherwise, just writing an Importer is enough.

Requirements¶

If a new Source is implemented, it should subclass woohoo_pdns.load.Source.

Sources must be context managers (i.e. be able to be used with with) and must have a method called woohoo_pdns.load.Source.get_next_record() that does not take any argument and returns a string. That string should be something the Importer can then work with.

In addition, they must implement the woohoo_pdns.load.Source.state property which allows the Importer to retrieve and restore the source’s state between batches of data loading.

The Importer subclass¶

Importers must be subclasses of woohoo_pdns.load.Importer. There are two important methods that every Importer must provide:

The first one is called for every ‘raw’ record (i.e. whatever is returned by the Source’s woohoo_pdns.load.Source.get_next_record()) and must return a list of woohoo_pdns.util.record_data named tuples. This method can filter the record by returning an empty list.

The return value is a list because (depending on the source) a single ‘raw’ entry can lead to multiple records (e.g. when a query has multiple responses).

The second function is called for every entry in the list returned by _tokenise_record. It is mainly meant to ‘polish’ the entries, for example by parsing dates, etc.

Why this complexity?¶

The main reason for the differentiation between the two steps in loading data is that the second might depend on information that is only available after at least one record was read from the source (per batch).

Imagine for example that the exact format of the dates (timestamps) is unknown but consistent within one batch.

In a situation like this, _tokenise_record would probably not be concerned with the date format. But _parse_tokenised_record would have to (re-)determine the format for every single record, which would be inefficient.

That is why there are two more methods that can be implemented in an Importer:

These methods are called for every first record of a batch. In the _inspect_tokenised_record method, the Importer could establish the timestamp format which could then be used for all remaining records of the batch.

Similar, _inspect_raw_record could be used to do an operation on the first raw record of a batch, if required .