Methodology
Data Sources
Every dataset ingested by Water Utility Report, with source URLs, update cadences, data formats, and license status. Updated when ingestion pipelines change.
8
Data Sources
7/8
Public Domain
4
Refresh Cadences
5
Prohibited Sources
EPA SDWIS — Safe Drinking Water Information System
Update Cadence
Quarterly
Format
CSV / API
License
Public Domain — federal government data
Fields Ingested
Primary source for utility identity and compliance data. All utility pages use PWSID as the canonical system identifier.
EPA ECHO — Enforcement and Compliance History Online
Update Cadence
Weekly (ECHO refreshes frequently)
Format
CSV / API
License
Public Domain — federal government data
Fields Ingested
Used for violation severity classification and health-based vs. monitoring violation distinction.
Consumer Confidence Reports (CCRs)
Update Cadence
Annual (utilities publish by July 1 each year)
Format
PDF / HTML (utility-published)
License
Public record — annual reports required by SDWA
Fields Ingested
Primary source for contaminant level data shown on utility pages. Ingestion requires PDF parsing; confidence score reflects parse quality.
EPA Water Quality Portal
Update Cadence
Continuous (portal aggregates from federal + state agencies)
Format
API / CSV
License
Public Domain — federal/state government data
Fields Ingested
Used as supplemental source for contaminant data where CCR parsing yields low confidence. Cross-referenced against CCR data.
State Drinking Water Program Datasets
Update Cadence
Varies by state — annual to real-time
Format
Varies by state (CSV, API, GIS)
License
Public records — verified per state terms before ingestion
Fields Ingested
Not all states publish granular datasets. Terms verified before ingestion. California, Texas, and Florida have robust open datasets.
EPA and CDC Health Guidance Documents
Update Cadence
As published (regulatory updates)
Format
HTML / PDF
License
Public Domain — federal government publications
Fields Ingested
Source for all regulatory limits, MCLG values, and health-effect language on contaminant pages. Changes tracked with versioning.
U.S. Census Bureau — TIGER/Line Shapefiles
Update Cadence
Annual (TIGER updates; ACS population annual)
Format
Shapefile / GeoJSON
License
Public Domain — federal government data
Fields Ingested
Used for ZIP→utility matching (spatial join of ZCTA polygons with utility service area boundaries). Match confidence reflects overlap percentage.
NELAP / State Lab Certification Databases
Update Cadence
Manually verified — labs are not in a single machine-readable federal dataset
Format
HTML (state program pages)
License
Public record — state-published certification lists
Fields Ingested
Lab entries are manually verified from state certification pages. Labs are flagged for re-verification annually. Not a comprehensive directory.