This document is intended for anyone who wishes to build a reader for our proxy detection database in a currently unsupported language or who wishes to better understand how our data is structured. If you would like us to support your language feel free to contact us. Our flat file database is broken down into 4 sections (in order): headers, tree, records and record data. Each section has a specific purpose within the format.
This format was written so that the format could easily be updated or extended to meet a wide variety of needs while shrinking the file size of the data on disk dramatically. Readers written against this format should attempt to be flexible in allowing for previously undefined columns, new bitmasks, or changes to existing bitmasks.
Our database makes frequent use of bitmasks. A bitmask is a byte in which each individual bit represents a true or a false. Thus each byte can contain 8 "binary options". As an example 00000001 would be 7 "false" values and 1 "true" value.
RecordsThe documentation will frequently reference "records". Records represent a blob of data about an IP address or multiple IP addresses. Each record will consist of 1 to 3 bytes of bitmasks and a variable number of columns as defined in the header section.
ColumnsThe documentation will frequently reference "columns". Columns are defined in the header section and a variable number of these exist in each record. Each column represents one piece of data in the file such as Country or ISP. The order, number and defined data type of columns defined in the header section exactly matches the order, number and data type found in each record. If the header defines a column that column will appear in the same spot on every record and will always be the same data type (string, int, etc.).
PointersThe documentation will frequently reference "pointers". A pointer in the context of the database is an unsigned 32 bit integer that tells the reader to jump to another section in the file. The number corresponds to the number of bytes from the start of the file to the point the reader should jump to.
EnumThe documentation will reference "enums". An enum is a numerical map where a number represents a string or piece of data. Eg: 1 = IP, 2 = Email, etc...
The header section of the file describes the data found in the file and aids the reader in knowing where and how to look for records in the file. The header section of the database file consists of a few different components. The first 11 bytes of every DB file contain data required for the reader to function:
Byte 0 | Bitmask designating the Block Type of this file (IPv4/IPv6) and if the individual records start with 1 or 3 bytes of bitmasks. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Byte 1 | Unsigned one byte integer representing the format version this file was intended for. |
---|---|
The current file version is 1. Readers should refuse to read files that are not made for the version they were written for. While backward compatibility of the format will attempt to be maintained going forward there is no guarantee any version of the format will be compatible with any other version. |
Byte 2 - 4 | Unsigned three byte integer containing the total number of bytes in the header section. (Header Size) |
---|---|
These bytes, when converted to an integer, are used to tell the reader how many bytes to read to get the full header section. This is almost always 11 + (24 * number of columns). |
Byte 5 - 6 | Unsigned two byte integer containing the total number of bytes in each record. |
---|---|
These bytes are used to tell the reader how many bytes to read to get the full record length. Each record is identically sized. |
Byte 7 - 10 | Unsigned four byte integer containing the total number of bytes in the entire file. |
---|---|
This is used to tell your reader if the next position read will be outside the bounds of the file. |
Byte 11+ | Column descriptions and column data type pairs. | ||||||
---|---|---|---|---|---|---|---|
The remainder of the header section contains a variable number of Column Pairs. Each Column Pair is 24 bytes in length. You can calculate the total number column pairs by taking the Header Size from bytes 1 - 3 and subtracting 11 (the number of guaranteed header bytes) then divide by 24. (AKA: Header Size - 10 / 24) As an example if the Header Size is 130, there would be 5 columns for every record in the file. Each column pair is broken down as follows:
|
After the last header byte is where the tree starts. It is recommended for performance reasons to cache the Header Size and Column Pairs in memory to reduce reading the header on each lookup. This way your reader can jump straight to the first tree byte to start a lookup.
The tree is a binary tree. Your code will make a series of left or right decisions to locate the corresponding record for a given IP. The tree consists of 5 bytes of header data followed by a series of 8 byte branches. The tree's format is detailed below:
Byte 0 | Bitmask designating that this block type is a tree. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
This byte is mostly just used for formatting reasons and can typically be ignored unless you're writing a database validator.
|
Byte 1 - 4 | Unsigned four byte integer containing the total number of bytes in the tree section. (Tree Bytes) |
---|---|
Used for calculating if the next pointer is in the tree section. |
Byte 5 - 12 | First tree pointer pair. | ||||
---|---|---|---|---|---|
The tree section from this point onward can be broken up into 8 byte segments or "nodes". Each node can be divided into two, 4 byte, unsigned integers. Each integer is a pointer to another position in the file. If the integer is less than the highest possible byte in the tree section (Tree Start + Tree Bytes) then the next position is another binary tree node. If the position is greater than the highest possible byte in the tree section then the next position is either a Record (if less than total bytes in the file) or an invalid IP (integer = 0 or higher than total bytes in file). You can traverse this binary tree by converting any IP address into it's big endian binary format. Iterate over each bit in the IP from left to right. If the bit equals 0 follow the LEFT pointer (first 4 bytes) and move to the next bit. If the bit equals 1 follow the RIGHT pointer (last 4 bytes) and move to the next bit. Remember each pointer is the total number of bytes from the start of the file to the next read position.
As an example the IPv4 sample file contains the IP address 8.8.0.0 which is 00001000000010000000000000000000 in binary. Thus when traversing the tree the first 4 decisions will use the LEFT (or first 4 bytes) of each node, then the 5th decision will use the RIGHT (or last 4 bytes) of it's node. This process is repeated until your next pointer equals zero or exceeds the total (Tree Start + Tree Bytes). If you reach a pointer that points to zero then the exact IP isn't in the file. If this file has been designated as a "BlacklistFile" by the first header byte you should error and abort the lookup. If the file is not a blacklist file you should move the pointer back up the tree to the nearest "1", use the zero branch and assume all forward decisions are "1"s until you either land at a record or hit another 0. For a code example of this process please see the Golang Flat File DB Reader implimentation. |
Each record starts with 1 or 3 bytes of Bit Mask for various options. If Block Type bit 7 is true in the file's headers then the record will start with 3 bytes of bitmasks as detailed below:
Byte 0 | Bitmask | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This byte contains data about the usage of this IP address.
|
Byte 1 | Bitmask | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This byte contains data about the usage of this IP address.
|
Byte 2 | Bitmask | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This byte contains data about the type of IP address and it's recent abuse.
|
If Block Type bit 7 is false in the file's headers then the first and only byte of bitmasks will be as follows:
Byte 0 | Bitmask | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This byte contains data about the type of IP address and it's recent abuse.
|
After the bitmasks a variable number of Columns based on the header section will be found. The order of these columns will exactly match the header. Use the Data Type value from the column's header to determine the length of each set of data as described below:
# | Bitmask |
---|---|
0 - 1 | Reserved for future use. |
2 | Reserved for tree block use. |
3 | This column is a string pointer. This will be represented on the record as a unsigned 4 byte integer. Follow the pointer to a piece of string data. See String Data below. |
4 | This column is a small integer. Small integers are one byte unsigned integers stored directly on the record. |
5 | This column is a integer. Integers are four byte unsigned integers stored directly on the record. |
6 | This column is a float. Floats are four byte unsigned floats corresponding to the IEEE 754 binary representation b, with the sign bit of b and the result in the same bit position. Floats are stored directly on the record. |
7 | Reserved for future use. |
At the moment there is only one type of record data not directly stored on the record: String Data. Changes to the format in the future would potentially allow for up to 4 other kinds of data to be stored in this section of the file.
The first byte of each String is a unsigned 8-bit integer. This integer specifies the length of the String in question (n). The following n bytes are the string itself.
Most reader implementations should support a default set of columns and expected value types. For reference they're listed below:
Name | Type | Description |
---|---|---|
Country | String Data | Two character country code of IP address or "N/A" if unknown. Technically this could be stored on the record for a two byte savings, however it's possible users may want a full string in the future so this has been left out. |
City | String Data | City of IP address if available or "N/A" if unknown. |
Region | String Data | Region (state) of IP address if available or "N/A" if unknown. |
ISP | String Data | ISP if one is known. Otherwise "N/A". |
Organization | String Data | Organization if one is known. Can be parent company or sub company of the listed ISP. Otherwise "N/A". |
Timezone | String Data | Timezone of IP address if available or "N/A" if unknown. |
ASN | Integer Data | Autonomous System Number if one is known. Null if nonexistent. Stored on the record. |
Zero Fraud Score | Integer Data | The "strictness" = 0 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record. |
One Fraud Score | Integer Data | The "strictness" = 1 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record. |
Two Fraud Score | Integer Data | The "strictness" = 2 fraud score for this IP address. (See proxy documentation for details about strictness.) Stored on the record. |
Latitude | Float Data | Latitude of IP address if available or "N/A" if unknown. Stored on the record. |
Longitude | Float Data | Longitude of IP address if available or "N/A" if unknown. Stored on the record. |
A sample database file is available for testing purposes here. The file is gzipped so please decompress it before use. The file contains an IPv4 and an IPv6 database. The IPv4 sample file contains the 8.8.0.0 IP address. The IPv6 sample file contains the 2001:4860:4860::8844 IP address. The sample file has a SHA256 of d2e4e0e68c84ca69cdd51af22ba43be8fb29d1a20b2e02cfe66210d4bd9aedad.