delaware

Get company registration data for the State of Delaware.

Project description

Some people think it would be good for data about company registrations to be freely distributed and free of charge. The State of Delaware apparently doesn’t think so; Delaware blocks computers from accessing the General Information Name Search site if they make more than a few hundred (I’m not sure of the exact number.) requests in a short time.

In order to download all of the data, we are thus using a swarm of computers from different IP addresses, each making very few requests to the site. You can help!

You just need to install a program and let it keep running. It will periodically contact a central server for directions, and it will query Delaware’s General Information Name Search accordingly. It is very careful to avoid being blocked. But if it detects that you are on a new IP address it will take advantage of that.

Installing

The installation process involves running things in a terminal. Remind me to put some directions here about how to do that in Mac and Windows.

If you already have Python and Pip installed, you can just do this.

sudo pip install delaware

If you have Python but not Pip, you can download a standalone package (I have to make this.) and run the setup like so.

tar xzf delaware.tar.gz
cd delaware
sudo python setup.py

If you don’t have Python installed, follow these directions.

If you are on any operating system other than Windows, you probably already have Python installed.

Add a note about Enthought, Continuum, &c.

Running

Once you’ve installed the program, type this into a terminal.

deleworker

It’ll ask you a few questions the first time you run it, but you can totally ignore it after that.

If errors come up

If the program stops running, please send the error message to _@thomaslevine.com. Also, please save the ~/.delaware directory, as it contains files that can be helpful for figuring out what went wrong.

How it works

I went with worker-manager architecture, but maybe I should have gone with something less classist? Peer-to-peer connections are annoying because of port blocking of various sorts, but that would be nice because then I don’t need to be responsible. Well anyway, here’s how it works.

Asking for directions

The worker contacts the manager asking for a job. It provides the following information.

Username: Chosen by the user
Password-like thing: Hash of a salted installation ID, which is created when the program is first run
IP address (implicitly): The manager is able to determine the IP address from which the request came.

The username is there so that the person can be recognized for her efforts.

The password-like thing is there to trace provenance of the data. This is mainly here in case someone fakes the data, so that I can figure out which data not to trust. It could also be helpful for debugging issues specific to certain systems.

The IP address is used for determining whether the rate limit is close to being reached. The manager directs workers not to query the Delaware site if they are approaching rate limit. The IP address is wholy separate from username and installation ID, as the same IP address can be accessed by multiple devices associated with the same user and by devices associated with multiple users.

Receiving work orders

In response to the above directions request, the worker will receive either a status code of 429 (too many requests) or a status code of 200. The manager decides which one based on how many requests have come from this IP address recently.

If the manager provides a status code of 200, it also provides the following information.

File number: The company to look up
An IP address: This will be passed back to the manager for rate limiting purposes.

The IP address is the worker’s own IP address, but it needed to contact the manager to figure that out.

The file number is chosen randomly (with uniform weights) from the file numbers with the lowest amount of responses so far.

For example, all file numbers (0 to 8 million) are possible when we start because there have been zero responses so far. Soon, some file numbers will be selected, so there will be some file numbers with zero responses and some with one response. Once all file numbers have been chosen at least once, the manager will begin repeating file numbers. By repeating file numbers, we check for consistency between different responses (in case someone is trying to fake data), and we continue to update the data (in case companies change).

I chose this approach so that we can be intelligent about which file numbers we query without assigning jobs to particular workers.

Querying the website

Once the bot has been directed to look up a particular file number, it queries the Deleware corporations site accordingly. It goes to the starting page for the General Information Name Search (called home in the code). It enters the file number and receives a list of up to one company. (This page is called a search in the code.) It then goes to this maybe-company page (called result in the code).

At every step, the bot

minimally parses the web page so that it may advance to the next step,
sends information about the HTTP response to the manager
pauses randomly for a time on the order of a second to avoid looking so obviously like a bot

When it sends the response information to the manager,

“Before” IP address: The previous IP address that the manager told the worker
Current IP address (implicitly): The IP address that the manager currently detects from the worker
Simplified HTTP response from Delaware: This the main information that we are looking for.
Whether the request appeared successful: Based on a rough parse, the worker says whether the request was successful. The manager uses this for selecting file numbers for job assignments (in the first step of the process)

Saving information on the manager

XXX FIX THIS SECTION XXX

When the manager recieves a response, it first needs to determine an additional piece of information. The worker has provided the “before” IP address; the manager now determines the “after” IP address.

Having determined this, it writes the following stuff to a simple log file.

username
installation id
before ip address
after ip address
serialized request

It also saves the IP address(es) in an IP address table. We maintain this table so we can avoid exceeding thresholds for IP blocking. If the before and after IP addresses are different, we conservatively count the request as having come from both addresses.

Finally, it parses the file number from the response and updates the sampling weights for the file number selection.

A separate process comes along later, reads the log files, and reads more information from the response. The involved parsing is moved to a separate task for two main reasons. First, this reduces the load of the manager. Second, we can reuse the separate task for loading backups; we don’t need to write a separate thing for that.

Waiting

The worker waits a random time on the order of seconds before repeating the above process. This way, the bots may look a bit less like bots and thus be harder to block.

Questions you might have

Why not just in-browser Javascript?: We can’t make cross-domain requests, so we’d have to inject something into the Deleware page, and that’s annoying, especially for this site.
Doesn’t OpenCorporates already have it?: OpenCorporates doesn’t have it.
Have people done similar things in terms of this distibuted API?: Probably
Why Python rather than something that people with Windows can run?: Because it’s easier
Has anyone tried talking to Delaware?: Dunno
How many companies?: Dunno, but less than 600,000

Other references

To do

In order to avoid faking of data, enforce that the worker only complete work that it has been ordered to. This could happen through some form of encryption or just by looking for strange patterns in the server logs.

The rate-limit query on the database isn’t working. Fix it.

Figure out what the actual rate limit is.

Project details

Release history Release notifications | RSS feed

0.0.5

Aug 26, 2014

0.0.4

Aug 13, 2014

0.0.3

Aug 6, 2014

0.0.2

Jul 31, 2014

This version

0.0.1

Jul 29, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delaware-0.0.1.tar.gz (10.7 kB view details)

Uploaded Jul 29, 2014 Source

File details

Details for the file delaware-0.0.1.tar.gz.

File metadata

Download URL: delaware-0.0.1.tar.gz
Upload date: Jul 29, 2014
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for delaware-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`d7f6f427deffa4805c295927bb4dd8953595f5c0baebc1ec0230641d8257bd98`
MD5	`03ad9082522e35d98ec4b2024cd2818f`
BLAKE2b-256	`0aa815770d0341e76496cea3f2ada7dbe48dd30e07e1ecfd9f459f285315578b`