The steps here reference primary sources and tool documentation so you can verify rules and implement workflows with confidence.
What are itemized contributions and why they matter
Itemized contributions are individual receipts that meet federal reporting rules and must be reported with specific contributor details, which makes them the primary source for campaign finance analysis. The federal Schedule A instructions describe when a contribution must be itemized and what fields are required for each receipt, including name, address, date, amount, employer, occupation, and contributor type, which affects how analysts treat donor records FEC Schedule A instructions.
Accurate filtering of itemized contributions matters because analysts and journalists rely on these public records to report fundraising patterns, identify major donors, and check compliance. The FEC publishes record-level receipts that make this work possible, and careful filtering reduces errors that can arise from misapplied thresholds or missing fields FEC receipts data page.
To use these records correctly, start by understanding the federal itemization threshold commonly applied: contributions that meet or exceed the $200 aggregate per contributor per election trigger are typically itemized, subject to Schedule A rules. That threshold is a reporting trigger rather than a policy judgment, so follow the Schedule A guidance when deciding which rows to include FEC Schedule A instructions.
Join the campaign updates and stay informed about Michael Carbonara’s run
Download the raw FEC receipts export or the article checklist before you begin; keeping the raw file untouched makes later reviews and audits simpler.
When you prepare to analyze itemized contributions, preserve the original export as part of an audit trail and work on a copy for filtering and deduplication. That practice lets you show exactly what changed during cleaning and supports transparent reporting.
Where to get itemized receipts: FEC exports and the API
The FEC receipts data page is the canonical source for record-level Schedule A exports and provides filters for date range, contributor name, amount, and committee, enabling targeted downloads for analysis FEC receipts data page (more background is available on the FEC about receipts data page about receipts data).
The FEC also offers an API that supports reproducible, programmatic queries and exports for 2024 through 2026 datasets, which is useful when you need repeatable filters or scheduled updates for ongoing projects FEC receipts data page (see the OpenFEC API documentation OpenFEC API documentation).
Before you clean a dataset, consult the FEC data documentation to confirm field names, date formats, and any committee-specific conventions so that your filters and joins use the correct column names and types FEC receipts data page.
Most FEC record-level exports are CSV files that can be opened in Excel, Google Sheets, or read into a scripting environment; check for UTF-8 encoding and consistent delimiters before importing to avoid corrupted fields or misaligned columns FEC receipts data page.
Preparing your raw export: file formats and required fields
Verify that the core Schedule A fields are present and consistently populated: contributor name, address components, report date, transaction amount, employer, occupation, and committee ID. If fields are missing or appear under alternate headings, update your import mapping or consult the documentation to reconcile names FEC Schedule A instructions.
Use deterministic filters and exact deduplication first, preserve the raw export, flag fuzzy-match candidates for human review, and keep an audit trail with change counts and rationale.
Save an untouched copy of the raw export and work on a separate file for cleaning. Label files clearly with the original filename and a timestamp so you can always recover the original dataset if needed.
Do basic sanity checks on types and ranges: dates should import as dates, amounts as numeric values, and text fields should not include stray delimiters or embedded newlines. Quick normalization at this stage reduces noise in matching steps later on.
Deterministic filtering: date ranges, amounts, and committees
Begin cleaning with conservative, deterministic filters that do not alter values: limit the dataset by the date range relevant to your analysis, by committee ID, and by minimum or maximum amounts that match your research question. These filters reduce size while preserving original records FEC receipts data page.
In spreadsheets, apply column filters to restrict rows by report date, committee, and amount. In a scripted workflow, apply the same filters as a reproducible step in your script so the same subset can be regenerated later FEC receipts data page.
Export any filtered subsets with descriptive filenames and timestamps. Keeping named subsets helps with audits and lets you re-run downstream steps without reapplying the same filters manually.
Exact duplicate detection and removal in spreadsheets
For small to medium exports, exact duplicate detection is a fast way to remove identical rows. Excel’s Remove Duplicate Rows function works on selected columns and is appropriate when the same receipt appears more than once with identical values in key fields Microsoft Support.
Decide which columns to include in the exact-match check. For many workflows you might use contributor name, address, date, and amount; avoid including transaction IDs if those differ across exports but the other fields match, or you could lose true duplicates.
In Google Sheets, functions like UNIQUE combined with FILTER let you extract distinct rows or preserve duplicates in a separate sheet for review. Always save the rows you remove to a retained file so you can show what was dropped and why Google Sheets help.
Exact deduplication is sufficient when exported rows are identical copies. It is a low-risk operation provided you back up the raw export and store removed rows separately.
Programmatic deduplication and reproducible workflows with pandas
For larger datasets, scripting with pandas gives reproducible, auditable deduplication. pandas.DataFrame.drop_duplicates supports key-based exact matching and can remove duplicates across chosen column sets, which is preferable for large exports where manual checks are impractical pandas documentation (see pandas.DataFrame.duplicated documentation).
Typical steps in a pandas workflow are: read the CSV with explicit dtypes, normalize text fields, apply deterministic filters, run drop_duplicates on the chosen keys, and write both the cleaned file and a companion log that records counts of rows removed and the keys used pandas documentation.
Store scripts and exports under version control and include the script version in your export filenames. This approach lets collaborators rerun the exact same cleaning steps and verify results. (See my site Michael Carbonara for related links.)
Fuzzy matching and clustering for non-exact duplicates
Fuzzy matching helps surface likely duplicates that are not exact matches, such as name variants or address formatting differences. Token-based similarity and ratio metrics are common approaches to score candidate pairs, but they are heuristics that require human review to avoid incorrect merges RapidFuzz and OpenRefine methods.
RapidFuzz offers fast similarity scoring for name and address comparisons, while OpenRefine provides clustering tools for batching candidate merges; both approaches are widely used to propose merges that a human reviewer then approves or rejects RapidFuzz and OpenRefine methods.
Quick steps to vet fuzzy-match candidates before merging
Review high scoring pairs first
Use conservative thresholds and sample-based review. High recall settings find more possible duplicates but increase false positives, so document thresholds and keep an audit trail of accepted merges.
Keeping an audit trail: logging filters, merges, and export summaries
An audit trail should record the original file name, the filters applied, the exact-duplicate criteria used, fuzzy-match rules and thresholds, and counts of records changed or removed. This makes the cleaned output reproducible and defensible FEC receipts data page.
A simple change summary can list counts by filter, the number of exact duplicates removed, fuzzy-match candidates reviewed, merges accepted, and rows deleted. Save both the cleaned CSV and a short methodology note alongside it to explain choices to readers or colleagues pandas documentation.
Preserve the raw export and a line-by-line log of merges or deletions. That log should include the original row identifiers and the rationale for each change so an independent reviewer can follow your decisions.
Decision criteria: when to merge, when to keep duplicates
Choose conservative merges when donor attribution matters. Precision-focused rules reduce false positives but can leave duplicates in the cleaned file, while recall-focused rules merge more likely duplicates but increase the risk of incorrect joins RapidFuzz and OpenRefine methods.
Committee and state-level aggregation rules can affect whether small-dollar donors are grouped or left separate, so check committee guidance and the Schedule A instructions before applying bulk merges to avoid misreporting contribution counts FEC Schedule A instructions.
Document each merge decision and use human review for borderline fuzzy matches. When in doubt, keep separate rows and note the reason for conservative handling in your change summary.
Typical errors and pitfalls to avoid
Common mistakes include merging on partial keys, failing to normalize address formats, or not backing up raw data before making changes. These errors can remove legitimate contributions or misattribute donors FEC Schedule A instructions.
To reduce risk, normalize dates, trim whitespace, standardize casing, and expand common abbreviations before matching. Those quick normalization steps improve both exact and fuzzy-match accuracy RapidFuzz and OpenRefine methods.
Keep removed rows in a separate file so you can restore them if needed, and always include a short note explaining why rows were dropped or merged.
Practical example 1: spreadsheet workflow from raw export to cleaned file
Follow these numbered actions in a spreadsheet: import the CSV with UTF-8 encoding, verify fields, apply deterministic filters for date range and committee, run exact duplicate removal, save removed rows to a separate file, and export the cleaned file with a timestamped filename. These steps create a repeatable record of the cleaning process Google Sheets help.
Check which columns to include in exact deduplication; a common choice is contributor name, normalized address, date, and amount. Exclude columns that are expected to vary between exports, like transaction ID, unless you have a reason to require an exact match.
In your change log, record the step, the filter or function used, and the number of rows affected. For example: Filter by committee ID X, rows before 12,345, after 3,210; exact duplicates removed 145; fuzzy candidates flagged 78. That short summary helps readers understand the scale of changes.
Practical example 2: a reproducible pandas script outline
A high-level script plan is: read CSV with specified dtypes, run normalization functions on name and address fields, apply deterministic filters, run drop_duplicates on chosen keys, compute fuzzy-match scores for candidate pairs, write a candidate review file, apply approved merges, and export cleaned CSV plus a change log pandas documentation.
Use pandas.DataFrame.drop_duplicates for the exact deduplication step and keep a copy of the dropped rows in a separate file with the same keys. Export both the cleaned file and a companion log that records script version and counts of rows changed for reproducibility pandas documentation.
When working with a developer, provide the pseudo steps and expected filenames so the script can be integrated into version control and scheduled runs. That way, cleaned exports are consistent and traceable. (See related posts on the news page News.)
Exporting, reporting, and sharing cleaned itemized contributions
Share cleaned outputs as CSV files alongside the original raw export and a short methodology note that lists filters used, duplicate criteria, fuzzy-match thresholds, and references to Schedule A or the FEC data page FEC receipts data page.
Include a change summary with counts for each cleaning step and consider redacting or omitting sensitive fields when publishing for a general audience, consistent with privacy expectations and the purpose of the release FEC receipts data page.
Label all files clearly with timestamps and script versions where relevant so users can understand which version of the dataset they are viewing. For assistance contact the site contact page.
Next steps and resources
Quick checklist: download the raw receipts, verify field names, apply deterministic filters, run exact dedupe, review fuzzy candidates, and document all changes. These actions will make your cleaned itemized contributions dataset reproducible and defensible FEC Schedule A instructions.
Consult the FEC receipts data documentation for updates to field definitions and the pandas and RapidFuzz or OpenRefine documentation for technical implementation guidance when you need scripting or fuzzy clustering support RapidFuzz and OpenRefine methods.
Keep a short methodology note with every public release so other researchers can reproduce your filters and understand the decisions behind merges.
Federal itemization is typically required when a contributor’s receipts meet or exceed the applicable aggregate threshold per election; check the FEC Schedule A instructions for the exact reporting trigger and required fields.
Exact deduplication is safe when rows are identical across chosen key fields, but it is important to back up the raw export and store removed rows separately in case further review is needed.
Use fuzzy matching to find likely non-exact duplicates such as name or address variants, but apply conservative thresholds and human review to avoid incorrect merges.
If you need a template or a starter script, use the checklists and pseudo-steps in this article as a guide and adapt them to your committee or project rules.
References
- https://www.fec.gov/resources/cms-content/documents/schedule_a_instructions.pdf
- https://www.fec.gov/data/receipts/
- https://www.fec.gov/campaign-finance-data/about-campaign-finance-data/about-receipts-data/
- https://api.open.fec.gov/developers/
- https://support.microsoft.com/office/remove-duplicate-rows-2fdf3d4a-7c2c-4b32-a7c6-7f3d9f1a6b9a
- https://michaelcarbonara.com/contact/
- https://support.google.com/docs/answer/3540681
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
- https://maxbachmann.github.io/RapidFuzz/
- https://openrefine.org/manual/cluster.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
- https://michaelcarbonara.com/
- https://michaelcarbonara.com/news/

