woohoo_pdns package

Submodules

woohoo_pdns.load module

class woohoo_pdns.load.DNSLogFileImporter(source_name, **kwargs)[source]

Bases: woohoo_pdns.load.FileImporter

Importer capable of reading a different source file format (JSON based).

__init__(source_name, **kwargs)[source]

To correctly initialise a file source a config dict must be supplied (see ‘cfg’ argument documentation).

Parameters
  • source_name (str) – Either the name of a file to be read or the name of a directory to scan for files to load.

  • cfg (dict) – A config dictionary that contains the following two keys: * file_pattern (str): The glob pattern to use when reading files from a directory. * rename (bool): Whether or not files should be renamed (by appending ‘.1’) after they are read.

__module__ = 'woohoo_pdns.load'
_parse_tokenised_record(tokenised_rec)[source]

Convert unix timestamps into aware datetime objects and convert string-type rrtype into their integer based pendants.

Parameters

tokenised_rec (record_data) – The record_data as tokenised by _tokenise_record()

Returns

A single element list of record_data named tuple.

_tokenise_record(rec)[source]

Split a line into tokens:

{"rrclass": "IN", "ttl": 3600, "timestamp": "1562845812", "rrtype": "PTR", "rrname": "24.227.156.213.in-addr.arpa.", "rdata": "mx2.mammut.ch.", "sensor": 37690}

becomes:

tokens[0] = "1562845812"                     # first_seen
tokens[1] = "1562845812"                     # last_seen
tokens[2] = "PTR"                            # DNS type
tokens[3] = "24.227.156.213.in-addr.arpa."   # rrname
tokens[4] = "1"                              # hitcount
tokens[5] = "mx2.mammut.ch."                 # rdata

respectively:

entry.first_seen
entry.last_seen
entry.rrtype
entry.rrname
entry.hitcount
entry.rdata
Parameters

rec (str) – A record returned from the source object.

Returns

A single entry list of record_data named tuple.

class woohoo_pdns.load.DNSTapFileImporter(source_name, **kwargs)[source]

Bases: woohoo_pdns.load.FileImporter

An importer capable of reading YAML based dnstap log files.

__init__(source_name, **kwargs)[source]

To correctly initialise a file source a config dict must be supplied (see ‘cfg’ argument documentation).

Parameters
  • source_name (str) – Either the name of a file to be read or the name of a directory to scan for files to load.

  • cfg (dict) – A config dictionary that contains the following two keys: * file_pattern (str): The glob pattern to use when reading files from a directory. * rename (bool): Whether or not files should be renamed (by appending ‘.1’) after they are read.

__module__ = 'woohoo_pdns.load'
_parse_tokenised_record(tokenised_rec)[source]

Loop through all answers in the record and turn the datetimes into aware objects (using the default timezone).

Parameters

tokenised_rec (record_data) – The record_data as tokenised by _tokenise_record()

Returns

A list of record_data named tuple.

_tokenise_record(rec)[source]

Extract from YAML document:

type: MESSAGE
identity: dns.host.example.com
version: BIND 9.11.3-RedHat-9.11.3-6.el7.centos
message:
  type: RESOLVER_RESPONSE
  message_size: 89b
  socket_family: INET6
  socket_protocol: UDP
  query_address: 203.0.113.56
  response_address: 203.0.113.53
  query_port: 49824
  response_port: 53
  response_message_data:
    opcode: QUERY
    status: NOERROR
    id:  44174
    flags: qr aa
    QUESTION: 1
    ANSWER: 2
    AUTHORITY: 0
    ADDITIONAL: 0
    QUESTION_SECTION:
      - clients6.google.com. IN AAAA
    ANSWER_SECTION:
      - clients6.google.com. 300 IN CNAME clients.l.google.com.
      - clients.l.google.com. 300 IN AAAA 2a00:1450:4002:807::200e
  response_message: |
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id:  44174
    ;; flags: qr aa    ; QUESTION: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
    ;; QUESTION SECTION:
    ;clients6.google.com.       IN  AAAA

    ;; ANSWER SECTION:
    clients6.google.com.    300 IN  CNAME   clients.l.google.com.
    clients.l.google.com.   300 IN  AAAA    2a00:1450:4002:807::200e

becomes:

tokens[0] = "2018-06-18T19:22:56Z"   # first_seen
tokens[1] = "2018-06-18T19:22:56Z"   # last_seen
tokens[2] = "CNAME"                  # DNS type
tokens[3] = "clients6.google.com."   # rrname
tokens[4] = "1"                      # hitcount
tokens[5] = "clients.l.google.com."  # rdata

respectively:

entry.first_seen
entry.last_seen
entry.rrtype
entry.rrname
entry.hitcount
entry.rdata
Parameters

rec (str) – A record (YAML document as string) returned from the source object.

Returns

A list of record_data named tuple.

class woohoo_pdns.load.FileImporter(source_name, cfg=None, **kwargs)[source]

Bases: woohoo_pdns.load.Importer

An abstract class to handle loading data from files.

The ‘source_name’ can be a filename or a directory name on disk. If it is a file name, that file will be read. If it is a directory, all files matching the glob pattern in cfg[“file_pattern”] will be read. An exception will be thrown if the file (or directory) does not exist).

Note

Errors will be written to a file called like the source file, but with ‘_err’ in the name (if ‘source_name’ is a file) or to a file in the parent directory of the directory to load files from, also with an ‘_err’ in the name, if ‘source_name’ is a directory.

Throws:

FileNotFoundException if ‘source_name’ does not exist.

__init__(source_name, cfg=None, **kwargs)[source]

To correctly initialise a file source a config dict must be supplied (see ‘cfg’ argument documentation).

Parameters
  • source_name (str) – Either the name of a file to be read or the name of a directory to scan for files to load.

  • cfg (dict) – A config dictionary that contains the following two keys: * file_pattern (str): The glob pattern to use when reading files from a directory. * rename (bool): Whether or not files should be renamed (by appending ‘.1’) after they are read.

__module__ = 'woohoo_pdns.load'
_parse_tokenised_record(tokenised_rec)[source]

After a record was tokenised, it is passed to this method for parsing (e.g. turn a unix timestamp into a datetime, or similar).

Note

This method must be implemened by the concrete subclasses of Importer.

Parameters

tokenised_rec (record_data) – The record as tokenised by _tokenise_record().

Returns

A record_data named tuple representing the final record to load. The importer also works if this method or returns None (i.e. nothing is loaded in to the database and the loading process continues) but the record is still considered ‘loaded’ by the statistics.

_tokenise_record(rec)[source]

After a raw record is read from the source object, this method is called and passed the raw record to split it into the parts required for a pDNS record.

Note

This method must be implemened by the concrete subclasses of Importer.

Parameters
  • rec (str) – The record as it was returned by the source object. This string must now be splitted into the

  • parts of a record_data named tuple. (different) –

Returns

A record_data named tuple that represents the record to load or None if the record could not be parsed (or should be ignored).

class woohoo_pdns.load.FileSource(config, **kwargs)[source]

Bases: woohoo_pdns.load.Source

A source that reads data from files on disk.

This source can either read a single file or scan a directory for files that match a glob pattern and process all matching files from the given directory. If the filename passed in is a file, this file will be processed. If filename is a directory, the glob pattern in file_pattern will be used to find files to process in that directory.

If the optional configuration option rename is set to true (the default), RENAME_APPENDIX will be appended to the current file name after processing.

Note

The config dictionary (config) must contain the following keys:

  • filename

And the following keys are optional in the config dictionary:

  • file_pattern

  • rename

RENAME_APPENDIX = '1'

If files should be renamed after processing, this is what is appended to the current filename.

__enter__()[source]
__exit__(exc_type, exc_val, exc_tb)[source]
__init__(config, **kwargs)[source]
Parameters
  • config (dict) – A dictionary that can hold data the source requires to configure itself.

  • kwargs (kwargs) – These are mainly to make this a “cooperative class” according to super() considered super.

__module__ = 'woohoo_pdns.load'
__str__()[source]

Return str(self).

_open_next_file()[source]

Try to open the next file to process.

First, the currently open file will be closed and renamed, if requested. After this, the next file in the list is opened (if any).

Raises

IndexError

Returns

Nothing.

get_next_record()[source]

This method is called by the importer whenever it is ready to load the next record. What is returned will be passed into Importer._tokenise_record().

Note

Subclasses must implement this method as it is not implemented here.

Returns

A raw record from the source as string.

property state

Return a dictionary that describes the current state of the source.

The setter of this property expects a dictionary that was created by this getter and then restores the state of the source to what it was when state was retrieved.

Returns

A dictionary containing the current file list file_list (list of all files pending processing, excluding the current file), the name of the currently being processed file file_name and the offset (index) into the currently being processed file (as retrieved by tell()).

class woohoo_pdns.load.Importer(source_name, data_timezone='UTC', strict=False, **kwargs)[source]

Bases: object

Importers are used to import new data into the pDNS database.

This is the super class for all importers. Different importers can import data from different sources. If no importer for a specific source is available, woohoo pDNS tries to make it simple to write a new importer for that particular source (format).

The main method of an importer is load_batch(). This method reads up to ‘batch_size’ records from the source, processes them into a list of record_data named tuples, adds some statistics and returns it.

To access the source data it uses a Source object. This object’s job is to provide a single source record at a time to the importer. This can mean reading one or several lines from a file or a record from a Kafka topic or whatever produces a source record. The importer then processes this record (possibly into multiple entries, for example if the source record contained a single query that produced multiple answers).

This base class handles the fetching of records from the source (up to a maximum of batch_size), calling the respective hooks (_inspect_raw_record(), _inspect_tokenised_record(), _tokenise_record() and _parse_tokenised_record()) which implement the actual logic for the importer (i.e. these are the methods that must be overridden in the child classes), minimal cleansing of the data and handling errors (including writing an error logfile).

IGNORE_TYPES = [0]

DNS types that we want to ignore completely (0 for example does not exist)

ILLEGAL_CHARS = ['/', '\\', '&', ':']

If any of these characters is present in rname the record will not be loaded as these characters are not expected in rrname (they can, however, be present in rdata, for example in TXT records).

__dict__ = mappingproxy({'__module__': 'woohoo_pdns.load', '__doc__': "\n Importers are used to import new data into the pDNS database.\n\n This is the super class for all importers. Different importers can import data from different sources. If no\n importer for a specific source is available, woohoo pDNS tries to make it simple to write a new importer for that\n particular source (format).\n\n The main method of an importer is :meth:`load_batch`. This method reads up to 'batch_size' records from the source,\n processes them into a list of record_data named tuples, adds some statistics and returns it.\n\n To access the source data it uses a :class:`Source` object. This object's job is to provide a single source record\n at a time to the importer. This can mean reading one or several lines from a file or a record from a Kafka topic or\n whatever produces a source record. The importer then processes this record (possibly into multiple entries, for\n example if the source record contained a single query that produced multiple answers).\n\n This base class handles the fetching of records from the source (up to a maximum of batch_size), calling the\n respective hooks (:meth:`_inspect_raw_record`, :meth:`_inspect_tokenised_record`, :meth:`_tokenise_record` and\n :meth:`_parse_tokenised_record`) which implement the actual logic for the importer (i.e. these are the methods that\n must be overridden in the child classes), minimal cleansing of the data and handling errors (including writing an\n error logfile).\n ", 'load_batch_result': <class 'woohoo_pdns.load.load_batch_result'>, 'ILLEGAL_CHARS': ['/', '\\', '&', ':'], 'IGNORE_TYPES': [0], '__init__': <function Importer.__init__>, 'has_more_data': <property object>, 'load_batch': <function Importer.load_batch>, '_is_valid': <function Importer._is_valid>, '_inspect_raw_record': <function Importer._inspect_raw_record>, '_inspect_tokenised_record': <function Importer._inspect_tokenised_record>, '_tokenise_record': <function Importer._tokenise_record>, '_parse_tokenised_record': <function Importer._parse_tokenised_record>, '__dict__': <attribute '__dict__' of 'Importer' objects>, '__weakref__': <attribute '__weakref__' of 'Importer' objects>})
__init__(source_name, data_timezone='UTC', strict=False, **kwargs)[source]

Constructor for an importer.

Parameters
  • source_name (str) – A name that is passed to the source; can be a file name or directory name for a FileSource or, for a hypothetical KafkaSource, it could be the name of the Kafka topic to use.

  • data_timezone (str) – The name of the timezone that should be used if the source data does not provide the timezone for the dates and times (first_seen, last_seen).

  • strict (bool) – If set to true, the importer will throw an exception if something ‘odd’ is encountered in in the source data. If it is set to false, the importer will write an entry in the error log and continue loading data.

  • kwargs (kwargs) – These are mainly to make this a “cooperative class” according to super() considered super.

__module__ = 'woohoo_pdns.load'
__weakref__

list of weak references to the object (if defined)

_inspect_raw_record(raw_record)[source]

For the first record of every batch this method will be called and the raw record is passed to it. This can be used if ‘something’ must be determined from source data (e.g. the datetime format).

Note

This is a NOP in Importer and meant to be overridden by subclasses if required.

Parameters

raw_record (str) – the record as it was returned from the source object.

Returns

Nothing.

_inspect_tokenised_record(tokenised_rec)[source]

For every record that was successfully tokenised (i.e. splitted into the required parts), this method will be called. Can be used to decide on further processing for example.

Note

This is a NOP in Importer and meant to be overridden by subclasses if required.

Parameters

tokenised_rec (record_data) – The record as it was tokenised by _tokenise_record().

Returns

Nothing.

_is_valid(entry)[source]

Check if the given entry is considered to be valid.

Entries with an empty rrname or rdata field are considered invalid, for example.

Parameters

entry (record_data) – the entry to check for validity.

Returns

True if the entry passed validation, False otherwise.

_parse_tokenised_record(tokenised_rec)[source]

After a record was tokenised, it is passed to this method for parsing (e.g. turn a unix timestamp into a datetime, or similar).

Note

This method must be implemened by the concrete subclasses of Importer.

Parameters

tokenised_rec (record_data) – The record as tokenised by _tokenise_record().

Returns

A record_data named tuple representing the final record to load. The importer also works if this method or returns None (i.e. nothing is loaded in to the database and the loading process continues) but the record is still considered ‘loaded’ by the statistics.

_tokenise_record(rec)[source]

After a raw record is read from the source object, this method is called and passed the raw record to split it into the parts required for a pDNS record.

Note

This method must be implemened by the concrete subclasses of Importer.

Parameters
  • rec (str) – The record as it was returned by the source object. This string must now be splitted into the

  • parts of a record_data named tuple. (different) –

Returns

A record_data named tuple that represents the record to load or None if the record could not be parsed (or should be ignored).

property has_more_data

Indicating if the importer is (potentially) able to produce more data. Mainly means that the source can fetch at least one more record; does not include any validity check(s) of that data though.

Returns

True if there is more source data available, false otherwise.

load_batch(batch_size, max_failed_inarow=0)[source]

The workhorse method of Importer.

The source object (self.source) will be initialised with its config (self.src_config) and for subsequent iterations the source’s state will be restored (to what was returned by Source.state in the last iteration). Then, records will be loaded until either no more data is available or ‘batch_size’ records are ready for loading into the database.

For the first record in every batch, _inspect_raw_record() and _inspect_tokenised_record() will be called. For every record _tokenise_record() and _parse_tokenised_record() are called.

_tokenise_record() is meant to be the place where filtering of source records can occur (return None).

Parameters
  • batch_size (int) – The maximum number of records to process at once.

  • max_failed_inarow (int) – The maximum number of records that fail to import in a row before aborting the processing of this batch.

Returns

A load_batch_result named tuple. This contains some statistics and a list of record_data named tuples.

class load_batch_result(converted, loaded, ignored, records)

Bases: tuple

A named tuple that is used to pass back some statistics as well as a list of record_data

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__module__ = 'woohoo_pdns.load'
static __new__(_cls, converted, loaded, ignored, records)

Create new instance of load_batch_result(converted, loaded, ignored, records)

__repr__()

Return a nicely formatted representation string

__slots__ = ()
_asdict()

Return a new OrderedDict which maps field names to their values.

_fields = ('converted', 'loaded', 'ignored', 'records')
_fields_defaults = {}
classmethod _make(iterable)

Make a new load_batch_result object from a sequence or iterable

_replace(**kwds)

Return a new load_batch_result object replacing specified fields with new values

property converted

Alias for field number 0

property ignored

Alias for field number 2

property loaded

Alias for field number 1

property records

Alias for field number 3

class woohoo_pdns.load.SilkFileImporter(source_name, **kwargs)[source]

Bases: woohoo_pdns.load.FileImporter

Importer to read files produced by the SiLK security suite.

Note

This is a subclass of FileImporter as it reads data from files on disk. There are many ways to get the files, for example with the ‘rwsender’ program included in the SiLK suite.

__init__(source_name, **kwargs)[source]

To correctly initialise a file source a config dict must be supplied (see ‘cfg’ argument documentation).

Parameters
  • source_name (str) – Either the name of a file to be read or the name of a directory to scan for files to load.

  • cfg (dict) – A config dictionary that contains the following two keys: * file_pattern (str): The glob pattern to use when reading files from a directory. * rename (bool): Whether or not files should be renamed (by appending ‘.1’) after they are read.

__module__ = 'woohoo_pdns.load'
_inspect_tokenised_record(tokenised_rec)[source]

Sometimes, the time in the input as millisecond resolution (for the whole source file). If so, adjust the parsing format to account for this.

Parameters

tokenised_rec (record_data) – The record as tokenised by _tokenise_record().

Returns

Nothing.

_parse_tokenised_record(tokenised_rec)[source]

Mainly convert the date and time (strings) into aware datetime objects.

Parameters

tokenised_rec (record_data) – The record_data as tokenised by _tokenise_record()

Returns

A single element list of record_data named tuple.

_tokenise_record(rec)[source]

Split a line into tokens:

2019-05-13 18:12:44.374|2019-05-13 18:12:44.374|28|gateway.fe.apple-dns.net|1|2a01:b740:0a41:0603::0010

becomes:

tokens[0] = "2019-05-13 18:12:44.374"    # first_seen
tokens[1] = "2019-05-13 18:12:44.374"    # last_seen
tokens[2] = "28"                         # DNS type
tokens[3] = "gateway.fe.apple-dns.net"   # rrname
tokens[4] = "1"                          # hitcount
tokens[5] = "2a01:b740:0a41:0603::0010"  # rdata

respectively:

entry.first_seen
entry.last_seen
entry.rrtype
entry.rrname
entry.hitcount
entry.rdata
Parameters

rec (str) – A record returned from the source object.

Returns

A single entry list of record_data named tuple.

class woohoo_pdns.load.SingleLineFileSource(config, **kwargs)[source]

Bases: woohoo_pdns.load.FileSource

A file source that reads a single line from a file at a time.

__module__ = 'woohoo_pdns.load'
get_next_record()[source]

Read a single line from a source file (skipping empty lines).

If no line is left, try to open the next file (if available) and read a line from there.

Returns

see FileSource.get_next_record().

class woohoo_pdns.load.Source(config, **kwargs)[source]

Bases: object

Source object(s) abstract the logic of fetching a ‘single record’ from a source.

For files, this can mean reading one or several lines (e.g. a YAML document), for other sources (e.g. an imaginary Kafka source) this could mean querying a service or calling an API or …

__dict__ = mappingproxy({'__module__': 'woohoo_pdns.load', '__doc__': "\n Source object(s) abstract the logic of fetching a 'single record' from a source.\n\n For files, this can mean reading one or several lines (e.g. a YAML document), for other sources (e.g. an imaginary\n Kafka source) this could mean querying a service or calling an API or ...\n ", '__init__': <function Source.__init__>, '__enter__': <function Source.__enter__>, '__exit__': <function Source.__exit__>, 'state': <property object>, 'get_next_record': <function Source.get_next_record>, '__dict__': <attribute '__dict__' of 'Source' objects>, '__weakref__': <attribute '__weakref__' of 'Source' objects>})
__enter__()[source]
__exit__(exc_type, exc_val, exc_tb)[source]
__init__(config, **kwargs)[source]
Parameters
  • config (dict) – A dictionary that can hold data the source requires to configure itself.

  • kwargs (kwargs) – These are mainly to make this a “cooperative class” according to super() considered super.

__module__ = 'woohoo_pdns.load'
__weakref__

list of weak references to the object (if defined)

get_next_record()[source]

This method is called by the importer whenever it is ready to load the next record. What is returned will be passed into Importer._tokenise_record().

Note

Subclasses must implement this method as it is not implemented here.

Returns

A raw record from the source as string.

property state

A source can have ‘state’ which allows it to resume at the correct next record after a batch of data was processed.

Note

Importers will request state from the source when a batch is about to be finished and will pass whatever the source provided back to the source before starting the next batch.

For a source reading from a file this can for example mean to return the value of tell() and then seek() to this position when state is passed in again.

exception woohoo_pdns.load.WoohooImportError[source]

Bases: Exception

__module__ = 'woohoo_pdns.load'
__weakref__

list of weak references to the object (if defined)

class woohoo_pdns.load.YamlFileSource(config, **kwargs)[source]

Bases: woohoo_pdns.load.FileSource

Read a YAML document from a file on disk.

__module__ = 'woohoo_pdns.load'
get_next_record()[source]

This method is called by the importer whenever it is ready to load the next record. What is returned will be passed into Importer._tokenise_record().

Note

Subclasses must implement this method as it is not implemented here.

Returns

A raw record from the source as string.

woohoo_pdns.meta module

class woohoo_pdns.meta.LookupDict(name=None)[source]

Bases: dict

Dictionary lookup object.

TODO: understand this… https://github.com/kennethreitz/requests/blob/master/requests/structures.py

__dict__ = mappingproxy({'__module__': 'woohoo_pdns.meta', '__doc__': '\n Dictionary lookup object.\n\n TODO: understand this...\n https://github.com/kennethreitz/requests/blob/master/requests/structures.py\n ', '__init__': <function LookupDict.__init__>, '__repr__': <function LookupDict.__repr__>, '__getitem__': <function LookupDict.__getitem__>, 'get': <function LookupDict.get>, '__dict__': <attribute '__dict__' of 'LookupDict' objects>, '__weakref__': <attribute '__weakref__' of 'LookupDict' objects>})
__getitem__(key)[source]

x.__getitem__(y) <==> x[y]

__init__(name=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__module__ = 'woohoo_pdns.meta'
__repr__()[source]

Return repr(self).

__weakref__

list of weak references to the object (if defined)

get(key, default=None)[source]

Return the value for key if key is in the dictionary, else default.

woohoo_pdns.meta._init()[source]

woohoo_pdns.pdns module

class woohoo_pdns.pdns.Database(db_url)[source]

Bases: object

The Database object is the interface to the database holding pDNS records.

This object is designed as a context manager, it can be used with with.

__dict__ = mappingproxy({'__module__': 'woohoo_pdns.pdns', '__doc__': '\n The Database object is the interface to the database holding pDNS records.\n\n This object is designed as a context manager, it can be used with ``with``.\n ', '__init__': <function Database.__init__>, '__enter__': <function Database.__enter__>, '__exit__': <function Database.__exit__>, 'close': <function Database.close>, 'records': <property object>, 'count': <property object>, 'most_recent': <property object>, 'query': <function Database.query>, 'add_record': <function Database.add_record>, 'find_record': <function Database.find_record>, '_query_for_name': <function Database._query_for_name>, '_query_for_ip': <function Database._query_for_ip>, 'load': <function Database.load>, '__dict__': <attribute '__dict__' of 'Database' objects>, '__weakref__': <attribute '__weakref__' of 'Database' objects>})
__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
__init__(db_url)[source]

Initialise the connection to the database.

Parameters

db_url (string) – The URL to the database, e.g. postgresql+psycopg2://user:password@hostname/database_name

__module__ = 'woohoo_pdns.pdns'
__weakref__

list of weak references to the object (if defined)

_query_for_ip(q)[source]

Query the “rdata” for an IP address.

Parameters

q (str) – the IP address (as a string) to search for.

Returns

A list of Record objects for records found (can be empty)

_query_for_name(q, rdata)[source]

Query the “rrname” or the “rdata” in the DB.

Note

This is for string queries only (no IP address queries).

Parameters
  • q (str) – The search term, can contain “*” as a wildcard.

  • rdata (bool) – If True and the query is a text query, search the right hand side instead of the left hand

  • side.

Returns

A list of Record objects for records found (can be empty)

add_record(rrtype, rrname, rdata, first_seen=None, last_seen=None, num_hits=1)[source]

Add a (new) record to the database.

If a record with that rrtype, rrname, rdata already exists in the database, the hitcount is increased by num_hits, first_seen or last_seen are updated if necessary and the existing object is returned. Otherwise a new object will be created and returned (with hitcount 1, fist_seen = last_seen = sighted_at (or “now” if sighted_at is not provided)).

Parameters
  • rrtype (int) – the id for the DNS record type (e.g. 1 for A, 28 for AAAA, etc. See https://en.wikipedia.org/wiki/List_of_DNS_record_types)

  • rrname (string) – the “left hand side” of the record; a trailing dot will be removed

  • rdata (string) – the “right hand side” of the record; a trailing dot will be removed

  • first_seen (datetime) – the date and time of the first (oldest) sighting; if omitted and also no last_seen is provided “now” will be used

  • last_seen (datetime) – the date and time of the most recent sighting; if omitted and also no first_seen is provided “now” will be used

  • num_hits (int) – the number of times this record was seen (will be added to an existing records hitcount)

Returns

A Record object representing this record.

close()[source]

Close the connection to the database. It is important to call this method after you are done. Will be called automagically when used with the context manager.

property count

The total number of pDNS records in the database.

find_record(rrtype, rrname, rdata=None)[source]

Search for a record (by type and left hand side, optionally also right hand side).

Parameters
Returns

The Record object representing the record.

Raises

NoResultFound

load(source_name, batch_size=10000, cfg=None, data_timezone='UTC', strict=False, loader='woohoo_pdns.load.SilkFileImporter')[source]

Load data into the database.

The actual work is done by the class referenced in the “loader” argument.

Parameters
  • source_name (str) – The directory or filename or other reference to the source (e.g. a Kafka topic name) where data should be loaded from.

  • batch_size (int) – For more efficient loading into the database, records are inserted/updated in batches; this defines the maximum number of records to process at once.

  • cfg (dict) – A dictionary with config items that will be passed to the constructor of the Importer.

  • strict (bool) – If true, abort loading if “errors” are detected in the input. If false, try to “fix” the error(s) and/or to continue loading remaining data. Default is False.

  • data_timezone (timezone string) – If source data without a timezone specification is found, assume the timezone is this.

  • loader (Importer) – Defines what class is used for the actual loading of data.

property most_recent

The most recent record in the database, i.e. the one with the most recent “last_seen” datetime.

query(q, rdata=False)[source]

Issue a query against the database.

When

Parameters
  • q (str) – the query, can be an IP address (v4 or v6) or text.

  • rdata (bool) – text queries look for matches on the “left hand side” (rrname) unless this option is set which makes the query search for matches on the “right hand side”. Use it for example to search for domains that share a common name server (NS record). For IP address queries, this is ignored; defaults to False.

Returns

A list of records that matches the query term.

Throws:

MissingEntry if no match is found for the query.

property records

A list of all pDNS records in the database.

exception woohoo_pdns.pdns.InvalidEntry[source]

Bases: ValueError

When SQLAlchemy fails to commit a record to the database, this exception is raised.

The details produced by SQLAlchemy will be included in the exceptions description.

__module__ = 'woohoo_pdns.pdns'
__weakref__

list of weak references to the object (if defined)

exception woohoo_pdns.pdns.MissingEntry[source]

Bases: ValueError

When a query does not yield any result, this exception is raised.

__module__ = 'woohoo_pdns.pdns'
__weakref__

list of weak references to the object (if defined)

class woohoo_pdns.pdns.Record(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Database representation of a record in the pDNS system.

A record can be of any DNS type (A, AAAA, TXT, PTR, …) and has a “left side” (rrname) and a “right side” (rdata). More information about “left hand side” and “right hand side” is available on the Farsight website for example.

first_seen

The date and time (incl. timezone) when a record was first seen by this pDNS system.

Type

DateTime

last_seen

The date and time (incl. timezone) when a record was last seen by this pDNS system (i.e. the most recent “sighting”).

Type

DateTime

rrtype

The type of the record (A, AAAA, TXT, …) according to the official list of DNS types.

Type

int

hitcount

The number of times this record was “sighted” by this pDNS system.

Type

int

__init__(**kwargs)

The init method is just setting up a logger for the class.

The kwargs just make it a “cooperative class” according to super() considered super.

__mapper__ = <Mapper at 0x7f13d26b2dd8; Record>
__module__ = 'woohoo_pdns.pdns'
__repr__()[source]

Return repr(self).

__table__ = Table('record', MetaData(bind=None), Column('first_seen', DateTime(timezone=True), table=<record>, nullable=False), Column('last_seen', DateTime(timezone=True), table=<record>, nullable=False), Column('rrtype', Integer(), table=<record>, primary_key=True, nullable=False), Column('_rrname', String(length=270), table=<record>, primary_key=True, nullable=False), Column('hitcount', Integer(), table=<record>, nullable=False, default=ColumnDefault(1)), Column('_rdata', String(length=300), table=<record>, primary_key=True, nullable=False), schema=None)
__tablename__ = 'record'
_rdata
_rrname
_sa_class_manager = {'_rdata': <sqlalchemy.orm.attributes.InstrumentedAttribute object>, '_rrname': <sqlalchemy.orm.attributes.InstrumentedAttribute object>, 'first_seen': <sqlalchemy.orm.attributes.InstrumentedAttribute object>, 'hitcount': <sqlalchemy.orm.attributes.InstrumentedAttribute object>, 'last_seen': <sqlalchemy.orm.attributes.InstrumentedAttribute object>, 'rrtype': <sqlalchemy.orm.attributes.InstrumentedAttribute object>}
ensure_aware_dt()[source]

When reconstructing a Record from the database, ensure that the datetimes (first_seen and last_seen) are “aware” objects (i.e. have a timezone).

This is mainly an issue when using sqlite (e.g. for testing) as sqlite does not store timezone information. In case the timezone information is missing, UTC is assumed and added.

first_seen
hitcount
last_seen
property rdata

The “rdata”, i.e. the “right hand side” of the record (cf. class attribute documentation).

property rrname

The “rrname”, i.e. the “left hand side” of the record (cf. class attribute documentation).

Note

When setting this property, the value will be sanitized by woohoo_pdns.util.sanitise_input(); this means that a trailing dot will be removed unless the value is just a dot.

rrtype
to_dict()[source]

Convert the record object to a dictionary representation that is suitable for SQLAlchemy bulk operations.

to_jsonable()[source]

Convert the record object to a JSON-friendly dictionary representation.

Note

This dict is compatible with the Passive DNS - Common Output Format.

update(rec)[source]

Update a record with (potentially) new information from a different record.

This means updating (adding) the hitcount as well as updating first_seen and/or last_seen if required.

Parameters

rec (Record) – The record to take the new information from.

woohoo_pdns.util module

class woohoo_pdns.util.LoaderCache[source]

Bases: object

This class implements the cache used when loading entries into the database.

Because pDNS databases have to ingest high volumes of data with high redundancy (never seen before entries are comparatively rare) it can be expected that caching substantially improves performance.

The cache internally holds values in dictionaries with a key derived from the actual data. To add records to the cache the named tuple ‘record_data’ should be used.

Note

When adding a record to the cache, four modes are available: cache_only, new, updated and auto. For a description of the modes, see the documentation of the “modes” named tuple.

MODES = cache_modes(cache_only=1, new=2, updated=4, auto=8)
__contains__(item)[source]

Checks if the cache contains the entry represented by the named tuple (or dict) passed in.

Parameters

item (record_data) – The record which should be checked for presence in the cache.

Returns

True if the item is in the cache, false otherwise.

__dict__ = mappingproxy({'__module__': 'woohoo_pdns.util', '__doc__': '\n This class implements the cache used when loading entries into the database.\n\n Because pDNS databases have to ingest high volumes of data with high redundancy (never seen before entries are\n comparatively rare) it can be expected that caching substantially improves performance.\n\n The cache internally holds values in dictionaries with a key derived from the actual data. To add records to the\n cache the named tuple \'record_data\' should be used.\n\n Note:\n When adding a record to the cache, four modes are available: cache_only, new, updated and auto. For a\n description of the modes, see the documentation of the "modes" named tuple.\n ', 'modes': <class 'woohoo_pdns.util.cache_modes'>, 'MODES': cache_modes(cache_only=1, new=2, updated=4, auto=8), '__init__': <function LoaderCache.__init__>, '__contains__': <function LoaderCache.__contains__>, 'get_new_entries': <function LoaderCache.get_new_entries>, 'get_to_update': <function LoaderCache.get_to_update>, 'add': <function LoaderCache.add>, 'rollover': <function LoaderCache.rollover>, 'clear': <function LoaderCache.clear>, '_dictionise': <staticmethod object>, '_tupelise': <staticmethod object>, '_add_to_new': <function LoaderCache._add_to_new>, '_add_to_update': <function LoaderCache._add_to_update>, '_add_to_cache_only': <function LoaderCache._add_to_cache_only>, 'merge': <staticmethod object>, '__dict__': <attribute '__dict__' of 'LoaderCache' objects>, '__weakref__': <attribute '__weakref__' of 'LoaderCache' objects>})
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

__module__ = 'woohoo_pdns.util'
__weakref__

list of weak references to the object (if defined)

_add_to_cache_only(item)[source]
_add_to_new(item)[source]
_add_to_update(item)[source]
static _dictionise(item)[source]

Convert ‘item’ (named tuple class:record_data) into a dictionary { key: item }

Parameters

(class (item) – record_data): the item to ‘convert’ into a dictionary

Returns

dict with one key (item.key, if it was set) and item as its value

static _tupelise(item_key, item_value)[source]

Convert ‘item_key, item_value’ (value of type dictionary) into a named tuple class:record_data

Parameters
  • item_key (str) – the key that is used for the ‘value dict’

  • item_value (dict) – a dictionary that holds the relevant data to create a class:record_data

Returns

record_data): the item that results from ‘converting’ the dictionary

Return type

item (class

add(item, mode=8)[source]

Add a new item to the cache.

Parameters
  • item (record_data) – The representation of the item to add to the cache.

  • mode (mode) – What mode to use (see documentation of mode for details) if the record is not yet in the cache.

Returns

Nothing.

clear()[source]

Clear the cache (i.e. remove all entries).

Returns

Nothing.

get_new_entries(for_bulk=False)[source]

Return the list of items that are considered not to be present in the pDNS database yet.

Note

The main reason for differentiating between new and updated entries (with respect to the pDNS database, not the cache) is to allow bulk operations in SQLAlchemy; it must be known if ‘INSERT’ or ‘UPDATE’ statements should be used.

Parameters

for_bulk (bool) –

If true, a list of dictionaries will be returned (suitable for SQLAlchemy bulk operations), if false, a list of record_data named tuples will be returned. For more information about SQLAlchemy bulk operations, see the SQLAlchemy documentation on bulk operations.

Returns

A list of either named tuples or dictionaries (see ‘for_bulk’ argument).

get_to_update(for_bulk=False)[source]

The same as get_new_entries() but for entries considered to already be present in the pDNS database (not necessarily the cache).

Parameters

for_bulk (bool) – see argument with the same name documented for get_new_entries()

Returns

A list of either named tuples or dictionaries (see ‘for_bulk’ argument).

static merge(existing, new)[source]

Merge two cached items by updating the new item’s hitcount (add the existing item’s count to it) and set the new item’s first_seen and last_seen to the minimum (maximum) of the new and the existing item’s values.

Parameters
  • existing (dict) – An item present in the cache

  • new (dict) – An item that should be updated with the info already present in the cache.

Returns

Nothing, the new item will be updated in place.

modes

The mode is relevant when adding records to the cache that are not already present in the cache. ‘cache_only’ should only be used to (pre) populate the cache. This is mainly useful if the ‘auto’ mode should be used later on. ‘auto’ assumes that the cache already holds all relevant entries; therefore, when adding an entry it will be cached as ‘new’ if it was not present in the cache before and as ‘updated’ if it already was in the cache. If the mode is set to ‘new’, the entry will be considered to be new (i.e. returned by get_new_entries()) whereas with the mode set to ‘updated’ it will be considered to already be known by the pDNS database (but not necessarily the cache, i.e. it will be returned by get_to_update()).

alias of cache_modes

rollover()[source]

Should be called after the pDNS database is updated with the currently cached entries (i.e. after the bulk operations are done for the lists returned by get_new_entries() and get_to_update()).

This will ‘move’ all cached entries into the ‘cache_only’ status, indicating that they are ‘known’ but not ‘dirty’ (in a cache’s way of using that word).

Returns

Nothing.

woohoo_pdns.util.record_data

A named tuple holding the values of a single entry in the pDNS database.

Note

first_seen and last_seen can be both, datetimes or timestamps (integers) but it must be consistent. The ‘key’ field can be left empty; it will then be auto-populated by the cache in a consistent way. If it is non-empty the passed in key is kept and it is the caller’s responsibility to guarantee uniqueness of the key(s).

alias of woohoo_pdns.util.pdns_entry_tokens

woohoo_pdns.util.record_to_nt(rec)[source]

Takes a dictionary style cache entry and returns a corresponding named tuple.

Note

The “key” is not set because the other functions in this module do not require it (DRY).

woohoo_pdns.util.sanitise_input(str_in)[source]

DNS entries technically end in a dot but for pDNS purposes the dot is mainly cruft, so we remove it.

Note

If the input string to this function is just a dot, it is kept. While a single dot might be ‘surprising’ it is still better than an empty string.

Module contents