Contributing¶
To build woohoo pDNS (and the documentation), three additional dependencies exist:
pip-tools
nose
sphinx-rtd-theme
Caution
woohoo pDNS has pinned its dependencies! This means that the exact version
is specified in dev-requirements.txt
for all dependencies.
This might have undesired side effects when installing in a non-empty
environment where one of the packages woohoo pDNS depends on is already
installed.
You can easily install them using the following pip command in your development environment:
$ pip install -r dev-requirements.txt
To build the documentation after cloning the repository, run the following
command in the woohoo_pdns/docs
directory:
$ make html
Note: do not run:
$ sphinx-quickstart
To run the tests, issue the following command:
$ python setup.py test
And to see test coverage:
$ pytest [--cov-report html] --cov=woohoo_pdns woohoo_pdns/tests/
Managing dependencies¶
Following the advice of people with (much) more experience in that field (namely Vincent Driessen and Hynek Schlawack) woohoo pDNS pins its dependencies.
The tool used is pip-tools, for the runtime dependencies in Hash-Checking Mode, and here’s how.
Runtime dependencies¶
Dependencies required to run woohoo pDNS are listed in the
install_requires
variable in setup.py
:
setup(
<snip>
install_requires = [
"alembic",
"flask",
<snap>
]
)
If you want to add a new (run time) dependency for woohoo pDNS, this is the place to do so.
Build dependencies¶
Dependencies required to develop woohoo pDNS are listed in the
dev-requirements.in
file:
pip-tools
...
Using pip-tools
for woohoo pDNS¶
To generate a requirements.txt file (i.e. a requirements.txt
file that
listing the runtime dependencies), run the following command (you have
pip-tools
installed, right?):
$ pip-compile --generate-hashes
This will overwrite the current requirements.txt file with the most recent version available on PiPI for every package and will add new dependencies also.
To check if there are newer versions of dependencies available in PyPI, use the following command:
$ pip-compile --upgrade --generate-hashes
This will overwrite the current requirements.txt file with the most recent version available on PiPI for every package. It will not add new dependencies though.
Note: pip-compile
has a dry-run
command line switch.
To generate the ``dev-requirements.txt`` file (i.e. a file listing the build dependencies), run the following command:
$ pip-compile --allow-unsafe --output-file=dev-requirements.txt dev-requirements.in
This will overwrite the current dev-requirements.txt
file with the most
recent version available on PiPI for every package and will add new
dependencies also.
To check if there are newer versions of build dependencies available in PyPI, use the following command:
$ pip-compile --upgrade --allow-unsafe --output-file=dev-requirements.txt dev-requirements.in
This will overwrite the current dev-requirements.txt
file with the most
recent version available on PiPI for every package. It will not add new
dependencies though.
References:
Implementing an Importer¶
I consider the need for a custom woohoo_pdns.load.Importer
the most
likely scenario of extending woohoo pDNS. Therefore this process is extensively
documented here.
Overview¶
While the Importer is the workhorse of the data loading, it relies on another
component called woohoo_pdns.load.Source
to provide one record that
should be loaded at a time.
There are currently two types of sources implemented: both read from files, but
one just reads one line after the other (skipping empty lines) while the other
expects to read YAML documents and therefore keeps reading until the YAML
document separator (---
) is encountered (or the file ends).
The former is woohoo_pdns.load.SingleLineFileSource
while the latter
is woohoo_pdns.load.YamlFileSource
. Because both sources do read data
from files on disk, they are both subclasses of
woohoo_pdns.load.FileSource
.
Source¶
A custom/new source is only required if the existing sources do not cover your needs. Otherwise, just writing an Importer is enough.
Requirements¶
If a new Source is implemented, it should subclass
woohoo_pdns.load.Source
.
Sources must be context managers (i.e. be able to be used with with
) and
must have a method called woohoo_pdns.load.Source.get_next_record()
that
does not take any argument and returns a string. That string should be
something the Importer can then work with.
In addition, they must implement the woohoo_pdns.load.Source.state
property which allows the Importer to retrieve and restore the source’s state
between batches of data loading.
The Importer subclass¶
Importers must be subclasses of woohoo_pdns.load.Importer
. There are
two important methods that every Importer must provide:
The first one is called for every ‘raw’ record (i.e. whatever is returned by
the Source’s woohoo_pdns.load.Source.get_next_record()
) and must return
a list of woohoo_pdns.util.record_data
named tuples. This method can
filter the record by returning an empty list.
The return value is a list because (depending on the source) a single ‘raw’ entry can lead to multiple records (e.g. when a query has multiple responses).
The second function is called for every entry in the list returned by
_tokenise_record
. It is mainly meant to ‘polish’ the entries, for example
by parsing dates, etc.
Why this complexity?¶
The main reason for the differentiation between the two steps in loading data is that the second might depend on information that is only available after at least one record was read from the source (per batch).
Imagine for example that the exact format of the dates (timestamps) is unknown but consistent within one batch.
In a situation like this, _tokenise_record
would probably not be concerned
with the date format. But _parse_tokenised_record
would have to
(re-)determine the format for every single record, which would be inefficient.
That is why there are two more methods that can be implemented in an Importer:
These methods are called for every first record of a batch. In the
_inspect_tokenised_record
method, the Importer could establish the
timestamp format which could then be used for all remaining records of the
batch.
Similar, _inspect_raw_record
could be used to do an operation on the first
raw record of a batch, if required .