ChatGPT Prompts for Data Cleaning Workflows
20 copy-paste ChatGPT prompts for data cleaning: deduplication, missing values, outliers, standardization, type conversion. The unsexy work that determines whether analysis is right or wrong.
Standardization
4 promptsStandardization Plan
1/20[Paste sample data]. Standardization plan: text case (all lower / proper / upper), trim whitespace, format dates uniformly, standardize abbreviations. Output: rules per column, transformation steps.
Plans data standardization.
Pro tip: Standardization first; analysis after. "USA" / "U.S.A." / "United States" = 3 different values to your computer. Standardize to one canonical form.
Phone + Email Format
2/20Standardize phone + email columns. Phone: strip spaces/dashes, country code consistency. Email: lowercase, trim. Output: formulas / Power Query steps. Common cleaning task.
Standardizes contact data.
Pro tip: Phone numbers in 10 formats = same person counted as 10. Standardize to E.164 (+15551234567). Emails lowercase. Then dedupe = one record per actual person.
Date Standardization
3/20[Paste mixed dates]. Standardize to single format. Output: detection logic (which format each row), DATEVALUE, parsing for ambiguous (1/2/2024 = Jan 2 or Feb 1?), error handling.
Standardizes dates.
Pro tip: Mixed date formats = #1 data quality issue. Especially MM/DD/YYYY vs DD/MM/YYYY. Sometimes ambiguous (1/2/24 could be either). Source-tracking + careful parsing.
Address Cleaning
4/20Clean address column. Output: parse into components (street, city, state, zip), standardize abbreviations (St → Street or vice versa), validate via API (Google, USPS), handle international.
Cleans addresses.
Pro tip: Addresses notoriously messy. APIs (Google Maps, USPS) standardize + validate. Manual cleaning = errors. For volume, API integration worth it.
Prompts get you started. Tutorials level you up.
A growing library of 300+ hands-on AI tutorials. New tutorials added every week.
Deduplication
4 promptsDuplicate Detection Strategy
5/20Detect duplicates in [data]. Output: exact match (single column) easy, fuzzy match (across columns), normalize first then dedupe, when to merge vs keep separate. Strategy depends on data.
Plans duplicate detection.
Pro tip: Detect duplicates AFTER standardization, not before. "Smith" / "smith " / "SMITH" = duplicates after trim+lowercase. Order matters.
Fuzzy Matching
6/20Fuzzy duplicate detection. Output: methods (Levenshtein distance, soundex, Power Query Fuzzy Match, OpenRefine), tools per scale, tradeoff (precision vs recall). For: customer names, addresses.
Builds fuzzy matching.
Pro tip: Customer names: "Smith Inc" / "Smith Inc." / "Smith Incorporated" = same. Fuzzy matching catches. Power Query has fuzzy match; for serious work, OpenRefine free + powerful.
Merge Decision Logic
7/20Merging duplicates: which value wins. Output: most-recent (by date), most-complete (fewest blanks), source priority (CRM > spreadsheet), manual review queue. Don't auto-merge blindly.
Plans merge logic.
Pro tip: Auto-merging without rules = lost data. "Most-recent" wins for: contact info. "Most-complete" for: profiles. Source-priority for: cross-system. Decide rules; document.
Match Confidence Scoring
8/20Score match confidence (0-100). Output: criteria (exact email = 100, fuzzy name + same company = 80, etc.), threshold for auto-merge, manual review queue. Confidence-driven merging.
Scores match confidence.
Pro tip: Auto-merge >90% confident; queue 60-90% for review; ignore <60%. Tiered processing = efficient + accurate. Eliminates "merge everything" or "merge nothing" extremes.
Missing Values
4 promptsMissing Value Strategy
9/20Handle missing values. Output: by column, decision (delete row, fill with default, calculate, leave blank, flag for follow-up). Different fields = different strategies.
Strategizes missing values.
Pro tip: Missing values not all equal. Missing income = leave blank (sensitive). Missing zip = lookup from address. Missing required = ask source. Per-column strategy.
Imputation Methods
10/20Impute missing values for [column]. Methods: mean/median/mode, regression-based, k-nearest-neighbors, predictive model. Output: when each appropriate, biases introduced. Imputation = trade-off.
Imputes missing values.
Pro tip: Imputation introduces bias. Mean = pulls toward center. Median = ignores extremes. Predictive = best but complex. Document what you imputed; flag if material to analysis.
Forward/Backward Fill
11/20Forward / backward fill for time series. Output: when appropriate (slow-changing values, sticky states), tools (Excel formula, Pandas ffill/bfill), alternatives. Time-series specific.
Fills time-series gaps.
Pro tip: Forward fill: take last known value forward. Useful for: stock prices, status codes. Inappropriate for: continuous data (interpolate instead). Match method to data type.
Missing Data Documentation
12/20Document missing data handling. Output: which fields had missing, treatment per field, % imputed vs original, impact on analysis, transparency for audit. Documentation = trust.
Documents missing data handling.
Pro tip: Undocumented imputation = analysis can't be reproduced + audited. Document: what was missing, how filled, why. Future-you debugging = thanks past-you.
Like these prompts? There are full tutorials behind them.
Learn the workflows, not just the prompts. 300+ easy-to-follow tutorials inside AI Academy — and growing every week.
Outliers + Validation
4 promptsOutlier Detection
13/20[Data]. Detect outliers. Methods: standard deviation (>3σ), IQR (>1.5x IQR), percentile (top/bottom 1%), visual (boxplot). Output: identification + investigation. Real or error?
Detects outliers.
Pro tip: Outliers can be: data error, real but extreme, or signal (interesting). Don't reflexively remove. Investigate; sometimes the outlier IS the story.
Range Validation
14/20Range validation: ages should be 0-120, percentages 0-100, dates within reasonable. Output: flag out-of-range, action (correct, exclude, query source).
Validates value ranges.
Pro tip: Range errors common. "Age = 200" = data entry error. Validation catches at cleaning; wrong values pollute analysis.
Cross-Field Validation
15/20Cross-field validation. Examples: order_date < ship_date, child_age < parent_age, total = sum of components. Output: rules + violations. Logical consistency.
Cross-validates fields.
Pro tip: Single-field validation = basic. Cross-field = sophisticated. "Birth date 2020 + role manager" = inconsistency. Logical rules across columns = quality.
Source-of-Truth Reconciliation
16/20Reconcile data across [source A] vs [source B]. Output: identify discrepancies, decide source of truth per field, document differences, automate ongoing check. Multi-source = where data quality dies.
Reconciles multi-source data.
Pro tip: Two sources = two truths. Common in orgs: CRM vs ERP vs Marketing tool. Decide source-of-truth per field. Automate check; manual reconciliation = breaking ground.
Frequently Asked Questions
Prompts are the starting line. Tutorials are the finish.
A growing library of 300+ hands-on tutorials on ChatGPT, Claude, Midjourney, and 50+ AI tools. New tutorials added every week.
7-day free trial. Cancel anytime.