Downloading files for a jurisdiction will return two types of files: raw and cleaned. Raw files contain the text in its original format as collected from that jurisdiction, prior to any cleaning or other processing. This could be, for instance, the text scraped from a webpage, a text file, a Word document, or a PDF.
Cleaned files are the final, cleaned text that QuantGov uses for its word count, restriction counts, and other analysis. These will always be simple text files, converted from the original, raw format and cleaned. Cleaning involves stripping out unnecessary spaces, special characters, and punctuation, with the exception of periods, commas, exclamation marks, and question marks. We clean these files to make them as standardized as possible and prevent introduction of noise into text classification and other types of analysis.
You may also download the document-level metadata for each one of these jurisdictions that will provide our word counts, restriction counts, and additional information for each file included in that jurisdiction. The metadata will download as a .csv file and includes the following information:
- documentName identifies the exact file in the file download for that jurisdiction that the row is associated with.
- documentTitle refers to the title of the regulation itself.
- documentReference provides information on the location of the document in the organizational scheme of that jurisdiction, as represented on the website where the documents were collected. For instance, Title 20, Part X, Chapter 1 means you can find this document in that particular chapter, part, and title of the jurisdiction's organization scheme for their regulations.
- collectedDate identifies the date on which these documents were collected.
- Total Restrictions represents the summation of all restrictive term counts.
- Total Words represents the total count of all words in the document.
- Terms - May Not: count of "may not" restrictive term.
- Terms - Required: count of "required" restrictive term.
- Terms - Must: count of "must" restrictive term.
- Terms - Shall: count of "shall" restrictive term.
- Terms - Prohibited: count of "prohibited" restrictive term.
- jurisdictionName identifies the name of the jurisdiction.
- agencyName identifies the agency, office, or department that this document is associated with.
- sourceCitation identifies the citation information for this document's associated RegData project.
- sourceName identifies the RegData data series that this document is included in.
- sourceOrganization identifies the organization responsible for the RegData data series of this document.
- sourceUrl provides a Url reference for the organization responsible for this data series, as well as a Url reference for the domain from which the text in this series was collected.
- documentationUrl provides a Url reference for the RegData User's Guide where you can find additional information on these data.