Prompt Library

ChatGPT Prompts for Data Cleaning Workflows

Q: Best data cleaning tool — Excel, Python, OpenRefine?

Excel: small data, ad-hoc. Power Query: spreadsheet pipelines, repeatable. Python (Pandas): scripted, scaled, reproducible. OpenRefine: messy text, fuzzy matching. Choose by scale + repeatability.

Q: Will AI clean data automatically?

AI suggests cleaning rules; you verify + apply. Auto-cleaning without review = silent errors. Hybrid (AI suggests + human reviews) = sweet spot.

Q: Most common data quality issue?

Mixed formats: dates, numbers as text, inconsistent casing, trailing whitespace. 80% of cleaning = these basics. Standardize first; then dedupe; then validate.

Q: Should I clean data or fix at source?

Both. Clean current data; fix source going forward. Data validation at entry > cleaning at analysis. But existing bad data still needs cleaning. Pipeline approach.

20 copy-paste prompts

20 copy-paste ChatGPT prompts for data cleaning: deduplication, missing values, outliers, standardization, type conversion. The unsexy work that determines whether analysis is right or wrong.

Standardization

4 prompts

Standardization Plan

1/20

[Paste sample data]. Standardization plan: text case (all lower / proper / upper), trim whitespace, format dates uniformly, standardize abbreviations. Output: rules per column, transformation steps.

Plans data standardization.

💡

Pro tip: Standardization first; analysis after. "USA" / "U.S.A." / "United States" = 3 different values to your computer. Standardize to one canonical form.

Phone + Email Format

2/20

Standardize phone + email columns. Phone: strip spaces/dashes, country code consistency. Email: lowercase, trim. Output: formulas / Power Query steps. Common cleaning task.

Standardizes contact data.

💡

Pro tip: Phone numbers in 10 formats = same person counted as 10. Standardize to E.164 (+15551234567). Emails lowercase. Then dedupe = one record per actual person.

Date Standardization

3/20

[Paste mixed dates]. Standardize to single format. Output: detection logic (which format each row), DATEVALUE, parsing for ambiguous (1/2/2024 = Jan 2 or Feb 1?), error handling.

Standardizes dates.

💡

Pro tip: Mixed date formats = #1 data quality issue. Especially MM/DD/YYYY vs DD/MM/YYYY. Sometimes ambiguous (1/2/24 could be either). Source-tracking + careful parsing.

Address Cleaning

4/20

Clean address column. Output: parse into components (street, city, state, zip), standardize abbreviations (St → Street or vice versa), validate via API (Google, USPS), handle international.

Cleans addresses.

💡

Pro tip: Addresses notoriously messy. APIs (Google Maps, USPS) standardize + validate. Manual cleaning = errors. For volume, API integration worth it.

Prompts get you started. Tutorials level you up.

A growing library of 300+ hands-on AI tutorials. New tutorials added every week.

Start 7-Day Free Trial

Deduplication

4 prompts

Duplicate Detection Strategy

5/20

Detect duplicates in [data]. Output: exact match (single column) easy, fuzzy match (across columns), normalize first then dedupe, when to merge vs keep separate. Strategy depends on data.

Plans duplicate detection.

💡

Pro tip: Detect duplicates AFTER standardization, not before. "Smith" / "smith " / "SMITH" = duplicates after trim+lowercase. Order matters.

Fuzzy Matching

6/20

Fuzzy duplicate detection. Output: methods (Levenshtein distance, soundex, Power Query Fuzzy Match, OpenRefine), tools per scale, tradeoff (precision vs recall). For: customer names, addresses.

Builds fuzzy matching.

💡

Pro tip: Customer names: "Smith Inc" / "Smith Inc." / "Smith Incorporated" = same. Fuzzy matching catches. Power Query has fuzzy match; for serious work, OpenRefine free + powerful.

Merge Decision Logic

7/20

Merging duplicates: which value wins. Output: most-recent (by date), most-complete (fewest blanks), source priority (CRM > spreadsheet), manual review queue. Don't auto-merge blindly.

Plans merge logic.

💡

Pro tip: Auto-merging without rules = lost data. "Most-recent" wins for: contact info. "Most-complete" for: profiles. Source-priority for: cross-system. Decide rules; document.

Match Confidence Scoring

8/20

Score match confidence (0-100). Output: criteria (exact email = 100, fuzzy name + same company = 80, etc.), threshold for auto-merge, manual review queue. Confidence-driven merging.

Scores match confidence.

💡

Pro tip: Auto-merge >90% confident; queue 60-90% for review; ignore <60%. Tiered processing = efficient + accurate. Eliminates "merge everything" or "merge nothing" extremes.

Missing Values

4 prompts

Missing Value Strategy

9/20

Handle missing values. Output: by column, decision (delete row, fill with default, calculate, leave blank, flag for follow-up). Different fields = different strategies.

Strategizes missing values.

💡

Pro tip: Missing values not all equal. Missing income = leave blank (sensitive). Missing zip = lookup from address. Missing required = ask source. Per-column strategy.

Imputation Methods

10/20

Impute missing values for [column]. Methods: mean/median/mode, regression-based, k-nearest-neighbors, predictive model. Output: when each appropriate, biases introduced. Imputation = trade-off.

Imputes missing values.

💡

Pro tip: Imputation introduces bias. Mean = pulls toward center. Median = ignores extremes. Predictive = best but complex. Document what you imputed; flag if material to analysis.

Forward/Backward Fill

11/20

Forward / backward fill for time series. Output: when appropriate (slow-changing values, sticky states), tools (Excel formula, Pandas ffill/bfill), alternatives. Time-series specific.

Fills time-series gaps.

💡

Pro tip: Forward fill: take last known value forward. Useful for: stock prices, status codes. Inappropriate for: continuous data (interpolate instead). Match method to data type.

Missing Data Documentation

12/20

Document missing data handling. Output: which fields had missing, treatment per field, % imputed vs original, impact on analysis, transparency for audit. Documentation = trust.

Documents missing data handling.

💡

Pro tip: Undocumented imputation = analysis can't be reproduced + audited. Document: what was missing, how filled, why. Future-you debugging = thanks past-you.

Like these prompts? There are full tutorials behind them.

Learn the workflows, not just the prompts. 300+ easy-to-follow tutorials inside AI Academy — and growing every week.

Try AI Academy Free

Outliers + Validation

4 prompts

Outlier Detection

13/20

[Data]. Detect outliers. Methods: standard deviation (>3σ), IQR (>1.5x IQR), percentile (top/bottom 1%), visual (boxplot). Output: identification + investigation. Real or error?

Detects outliers.

💡

Pro tip: Outliers can be: data error, real but extreme, or signal (interesting). Don't reflexively remove. Investigate; sometimes the outlier IS the story.

Range Validation

14/20

Range validation: ages should be 0-120, percentages 0-100, dates within reasonable. Output: flag out-of-range, action (correct, exclude, query source).

Validates value ranges.

💡

Pro tip: Range errors common. "Age = 200" = data entry error. Validation catches at cleaning; wrong values pollute analysis.

Cross-Field Validation

15/20

Cross-field validation. Examples: order_date < ship_date, child_age < parent_age, total = sum of components. Output: rules + violations. Logical consistency.

Cross-validates fields.

💡

Pro tip: Single-field validation = basic. Cross-field = sophisticated. "Birth date 2020 + role manager" = inconsistency. Logical rules across columns = quality.

Source-of-Truth Reconciliation

16/20

Reconcile data across [source A] vs [source B]. Output: identify discrepancies, decide source of truth per field, document differences, automate ongoing check. Multi-source = where data quality dies.

Reconciles multi-source data.

💡

Pro tip: Two sources = two truths. Common in orgs: CRM vs ERP vs Marketing tool. Decide source-of-truth per field. Automate check; manual reconciliation = breaking ground.

Frequently Asked Questions

Excel: small data, ad-hoc. Power Query: spreadsheet pipelines, repeatable. Python (Pandas): scripted, scaled, reproducible. OpenRefine: messy text, fuzzy matching. Choose by scale + repeatability.

AI suggests cleaning rules; you verify + apply. Auto-cleaning without review = silent errors. Hybrid (AI suggests + human reviews) = sweet spot.

Mixed formats: dates, numbers as text, inconsistent casing, trailing whitespace. 80% of cleaning = these basics. Standardize first; then dedupe; then validate.

Both. Clean current data; fix source going forward. Data validation at entry > cleaning at analysis. But existing bad data still needs cleaning. Pipeline approach.

60-80% of analysis time = cleaning. Common rule. Reducing this = standardize sources + automate cleaning + invest in data quality. The unsexy work that makes analysis possible.

Prompts are the starting line. Tutorials are the finish.

A growing library of 300+ hands-on tutorials on ChatGPT, Claude, Midjourney, and 50+ AI tools. New tutorials added every week.

Start 7-Day Free Trial Explore AI for Data Analysts

7-day free trial. Cancel anytime.

Browse All Prompt Collections