Prompt Library

Ship Better Data Science Projects with AI

Q: Can ChatGPT replace a data scientist?

No. ChatGPT is a powerful assistant that accelerates data science work, but it cannot replace the critical thinking, domain expertise, and judgment that data scientists bring. It generates code, suggests approaches, and helps with documentation, but it does not understand your business context, cannot evaluate whether results make practical sense, and sometimes produces statistically incorrect advice. Use it as a force multiplier: let it handle boilerplate code and first drafts while you focus on problem framing, validation, and interpretation.

Q: How do I make sure ChatGPT-generated code is correct for my data?

Always test generated code on a small sample of your data before running it on the full dataset. Check output shapes, values, and edge cases manually. Compare results against a known baseline or hand-calculated example. Never trust statistical outputs at face value — verify that the test assumptions are met and the interpretation matches your understanding. Treat ChatGPT code as a first draft that needs your review, not as production-ready output.

Q: Which prompts should I start with as a beginner?

Start with Data Cleaning and EDA prompts. These are the foundation of every data science project and the place where beginners lose the most time. The Complete EDA Report Generator and Missing Value Strategy Advisor will teach you the workflow that experienced data scientists follow instinctively. Once your data is clean and explored, move to the Model Selection Advisor to learn which algorithm fits your problem. Skip advanced statistical topics until you are comfortable with the basics.

Q: Can ChatGPT help me with large datasets that do not fit in memory?

ChatGPT can suggest strategies for handling large datasets — chunked reading with pandas, using Dask or Polars for out-of-core computation, database queries that filter before loading, and memory optimization techniques. However, it cannot process your actual data. You need to describe your data structure and constraints, and ChatGPT will generate code tailored to your situation. The Data Type Optimization Script prompt is a great starting point for reducing memory usage before considering more complex tools.

Q: How do I explain my data science results to non-technical stakeholders?

Use the Reporting and Communication prompts, especially Technical Findings to Executive Summary and Data Storytelling Narrative Builder. The key principle is to translate everything into business terms: dollars, customers, time saved, risk reduced. Never present p-values, R-squared, or model names to executives. Instead say "we are 95 percent confident this change will increase revenue by 200,000 dollars per quarter." Lead with the answer, support it with one compelling chart, and end with a clear recommendation.

38 copy-paste prompts

38 ready-to-use ChatGPT prompts for every stage of the data science pipeline — from raw data wrangling to stakeholder-ready reports.

In short: This page contains 38 copy-paste ready prompts, organized into 7 categories with a description and pro tip for each. The first 15 prompts are free instantly — no signup needed. Hand-curated and tested by the AI Academy team.

By Louis Corneloup · Founder, Techpresso

Last updated May 15, 2026·Hand-curated & tested by the AI Academy team

Data Cleaning & Preparation

5 prompts

Missing Value Strategy Advisor

1/38

I have a dataset with [number] rows and [number] columns for a [project goal] project. The following columns have missing values: [column 1] ([X]% missing), [column 2] ([X]% missing), [column 3] ([X]% missing). The data types are: [column 1] is [categorical/numerical/datetime], [column 2] is [categorical/numerical/datetime], [column 3] is [categorical/numerical/datetime]. For each column, recommend the best imputation strategy (mean, median, mode, forward fill, KNN imputation, regression imputation, or deletion). Explain why that method is appropriate given the data type and missingness percentage. Warn me about any patterns of missingness that could introduce bias. Provide the pandas or scikit-learn code for each recommended approach.

Provides tailored missing value strategies for each column based on data type, missingness pattern, and project context.

💡

Pro tip: Always check whether data is missing at random or systematically. If a column is missing 60% of values, imputing it may introduce more noise than signal — consider dropping it.

Outlier Detection and Treatment Plan

2/38

I am working with a dataset for [project goal]. Here are summary statistics for my key numerical columns: [Paste df.describe() output or list columns with min, max, mean, std] For each column: (1) identify likely outliers using IQR, Z-score, and domain logic, (2) recommend whether to cap, remove, transform, or keep each outlier with reasoning, (3) explain how each outlier treatment affects downstream modeling, and (4) provide Python code using pandas and scipy to implement each recommendation. Flag any values that look like data entry errors versus genuine extreme observations.

Creates a systematic outlier treatment plan that distinguishes between errors, extremes, and genuinely informative data points.

💡

Pro tip: Never blindly remove outliers. In fraud detection, churn prediction, and healthcare data, outliers are often the signal you are looking for.

Data Type Optimization Script

3/38

I have a pandas DataFrame with [number] rows. Here are the current dtypes: [Paste df.dtypes output] And the first 5 rows: [Paste df.head() output] Generate a complete Python script that: (1) converts object columns that are actually categorical to the category dtype with the correct category order, (2) downcasts numerical columns to the smallest appropriate type (int64 to int8/int16/int32, float64 to float32 where precision allows), (3) parses any string columns that contain dates into datetime, (4) strips whitespace and normalizes casing in text columns, and (5) prints a memory usage comparison before and after. Explain the expected memory savings.

Optimizes DataFrame memory usage and corrects data types, often reducing memory consumption by 50-80 percent.

💡

Pro tip: Run this script before any heavy computation. On a 10GB dataset, proper dtypes can bring memory down to 2-3GB, which is the difference between crashing and finishing.

Duplicate Record Detection

4/38

I need to find and handle duplicate or near-duplicate records in a dataset of [describe data — customer records, transactions, product listings, etc.]. The columns are: [list all columns]. Some duplicates may be exact matches, but others may be near-duplicates where [describe fuzzy match scenario — slightly different spellings, extra whitespace, phone number formats, etc.]. Write a Python script that: (1) identifies exact duplicates and counts them, (2) uses fuzzy matching on [columns] to find near-duplicates with a similarity threshold I can adjust, (3) groups potential duplicates together for review, (4) suggests a merge strategy for confirmed duplicates (which record to keep, how to combine conflicting fields), and (5) logs all deduplication decisions for an audit trail.

Builds a deduplication pipeline that catches both exact and fuzzy duplicates with a reviewable audit trail.

💡

Pro tip: Use the fuzzywuzzy or rapidfuzz library for string matching. Set your threshold conservatively at first (90+ similarity) and lower it only after reviewing false negatives.

Feature Engineering from Raw Columns

5/38

I have a dataset with these columns: [list columns with brief descriptions]. The target variable is [target] and the prediction task is [classification/regression]. Suggest 15-20 new engineered features I could create from these existing columns. For each feature, provide: (1) the feature name and formula, (2) the business intuition for why it would be predictive, (3) the pandas code to create it, and (4) a note on whether it introduces data leakage risk. Group the suggestions by type: ratio features, aggregation features, time-based features, interaction features, and binned features.

Generates a comprehensive set of engineered features with business rationale and leakage warnings.

💡

Pro tip: Feature engineering is where domain knowledge beats algorithms. The best features come from understanding the business, not from automated feature generation tools.

Prompts get you started. Tutorials level you up.

A growing library of 300+ hands-on AI tutorials. New tutorials added every week.

Start 7-Day Free Trial

Exploratory Data Analysis

5 prompts

Complete EDA Report Generator

6/38

I have a dataset about [topic] with [number] rows and these columns: [list columns with types]. The goal of the analysis is [describe goal]. Write a complete Python EDA script using pandas, matplotlib, and seaborn that: (1) generates summary statistics with df.describe() and df.info(), (2) creates distribution plots for all numerical columns with skewness and kurtosis annotations, (3) produces a correlation heatmap with the top 10 strongest correlations listed, (4) creates bar charts for all categorical variables showing value counts, (5) generates bivariate plots between the target variable and the top 5 most correlated features, (6) identifies and visualizes any obvious data quality issues, and (7) prints a text summary of the 5 most important findings. Use a clean plotting style with proper labels and titles.

Produces a publication-ready EDA script that covers univariate, bivariate, and data quality analysis in one run.

💡

Pro tip: Run the EDA before touching any model. Thirty minutes of EDA saves hours of debugging mysterious model behavior caused by data issues you would have caught visually.

Correlation Deep Dive

7/38

Here is the correlation matrix from my dataset: [Paste correlation matrix or describe key correlations] Target variable: [target]. Analyze these correlations and: (1) identify the top 10 features most correlated with the target and explain the likely causal relationship versus spurious correlation for each, (2) flag multicollinearity — which feature pairs have correlation above 0.8 and which one should I drop from each pair, (3) suggest features that have low linear correlation but might have nonlinear relationships worth exploring, (4) recommend specific interaction terms based on the correlation structure, and (5) provide Python code to create a filtered correlation heatmap showing only statistically significant correlations (p < 0.05).

Interprets correlation matrices beyond simple numbers, identifying genuine signals, multicollinearity traps, and hidden nonlinear relationships.

💡

Pro tip: High correlation with the target does not mean causation. Check whether the correlated feature is available at prediction time — using future data is the most common leakage mistake.

Segment Comparison Analysis

8/38

I want to compare [segment A] vs [segment B] in my dataset. The segmentation column is [column name] and it has values [list values]. The metrics I care about are: [list 4-6 numerical columns]. Write a Python script that: (1) calculates mean, median, and standard deviation for each metric by segment, (2) runs appropriate statistical tests (t-test for normal distributions, Mann-Whitney U for skewed) to determine if differences are significant, (3) calculates effect sizes (Cohen's d) so I know if differences are practically meaningful, (4) creates side-by-side violin plots for each metric, and (5) generates a summary table with p-values, effect sizes, and a plain-English interpretation of each comparison.

Produces statistically rigorous segment comparisons with both significance testing and practical effect size assessment.

💡

Pro tip: Statistical significance is not the same as practical significance. A p-value of 0.001 on a 0.2 percent difference in conversion rate is technically significant but probably not worth acting on.

Time Series Pattern Discovery

9/38

I have time series data with a [daily/weekly/monthly] frequency spanning [time period]. The columns are: [date column] and [list value columns]. Write a Python script using pandas and statsmodels that: (1) plots the raw time series with a rolling average overlay, (2) performs seasonal decomposition to separate trend, seasonality, and residual components, (3) runs an Augmented Dickey-Fuller test for stationarity, (4) creates autocorrelation (ACF) and partial autocorrelation (PACF) plots, (5) identifies any structural breaks or change points, and (6) summarizes the key patterns in plain English: is there a trend, what is the seasonal period, are there anomalous windows. Use [target column] as the primary series.

Decomposes time series data into interpretable components and tests for stationarity, seasonality, and structural changes.

💡

Pro tip: Always visualize time series data before modeling. A 30-second plot often reveals trends, seasonality, or outliers that no summary statistic can capture.

Distribution Analysis for Feature Selection

10/38

I am building a [classification/regression] model with [number] features. Before modeling, I need to understand the distribution of each feature. Here are my columns: [list numerical columns]. Write a Python script that: (1) plots histograms with KDE overlays for each feature, (2) tests for normality using the Shapiro-Wilk test, (3) identifies heavily skewed features (skew > 1 or < -1) and suggests transformations (log, square root, Box-Cox), (4) applies the recommended transformations and plots before/after comparisons, (5) flags zero-variance and near-zero-variance features for removal, and (6) creates a final summary table listing each feature with its distribution type, recommended transformation, and keep/drop recommendation.

Assesses the distribution of every feature and recommends transformations to improve model performance.

💡

Pro tip: Tree-based models (Random Forest, XGBoost) do not care about distributions, but linear models and neural networks do. Match your preprocessing to your algorithm.

Statistical Analysis

5 prompts

Hypothesis Test Selector

11/38

I need to test the following hypothesis: [describe your hypothesis in plain English, e.g., "customers who received the new email subject line had a higher open rate than those who received the old one"]. My data: [describe data — sample sizes, data types, number of groups]. Walk me through: (1) the correct null and alternative hypotheses, (2) which statistical test to use and why (consider sample size, data distribution, number of groups, paired vs independent, categorical vs continuous), (3) the assumptions of that test and how to check each one, (4) the complete Python code using scipy.stats to run the test, (5) how to interpret the p-value and confidence interval in plain English, and (6) what to do if the assumptions are violated — provide the nonparametric alternative and its code.

Selects the right statistical test for your specific scenario and walks you through assumptions, execution, and interpretation.

💡

Pro tip: The most common mistake is choosing a test before checking assumptions. Always verify normality and variance equality before defaulting to a t-test.

A/B Test Sample Size Calculator

12/38

I am designing an A/B test for [describe what you are testing]. Baseline metric: [current conversion rate or mean]. Minimum detectable effect: [the smallest improvement worth detecting, e.g., 2 percentage points or 5 percent relative lift]. Significance level: [typically 0.05]. Power: [typically 0.80]. Write Python code using statsmodels that: (1) calculates the required sample size per variant, (2) estimates how long the test needs to run given [daily traffic/events], (3) calculates the sample size for different MDE values so I can see the tradeoff between sensitivity and test duration, (4) adjusts for multiple variants if I want to test [number] variants instead of 2, and (5) plots a power curve showing how detection probability changes with sample size. Explain each parameter in plain English.

Calculates proper A/B test sample sizes with power analysis to avoid underpowered experiments that waste time.

💡

Pro tip: Running an A/B test without a power calculation is like driving without a destination. You will stop at the wrong time and draw the wrong conclusion.

Regression Diagnostics Suite

13/38

I have fitted a [linear/logistic/multiple] regression model to predict [target] using [list features]. Here are the model results: [Paste model summary or describe coefficients and R-squared] Write a Python script that performs a complete diagnostic check: (1) residual vs fitted values plot to check for homoscedasticity, (2) Q-Q plot to check normality of residuals, (3) Cook's distance plot to identify influential observations, (4) VIF calculation for each predictor to check multicollinearity, (5) Breusch-Pagan test for heteroscedasticity, (6) Durbin-Watson test for autocorrelation, and (7) a summary report stating which assumptions are met and which are violated, with specific remediation steps for each violation.

Runs the complete suite of regression diagnostics and provides actionable fixes for any violated assumptions.

💡

Pro tip: Never trust a regression model until you have checked its residuals. A high R-squared means nothing if the residuals show patterns — the model is missing something systematic.

Bayesian vs Frequentist Analysis Comparison

14/38

I have data from an experiment: [describe experiment — groups, sample sizes, outcomes]. Run both a frequentist and Bayesian analysis of this data. For the frequentist approach: run the appropriate hypothesis test, report the p-value and confidence interval. For the Bayesian approach: use PyMC or scipy to compute the posterior distribution with a [weakly informative/uninformative] prior, report the credible interval and the probability that [hypothesis]. Then: (1) compare the two approaches and explain where they agree and disagree, (2) explain which approach is more appropriate for my specific use case and why, (3) visualize the posterior distribution with the credible interval marked, and (4) explain how the results would change with a more informative prior.

Runs parallel frequentist and Bayesian analyses so you can compare conclusions and choose the framework that fits your decision context.

💡

Pro tip: Use Bayesian analysis when you have meaningful prior information or when stakeholders need to understand probability of outcomes directly rather than p-values.

Multi-Group Comparison with Post-Hoc Tests

15/38

I need to compare a metric across [number] groups. The groups are: [list group names]. The metric is: [describe metric]. Sample sizes per group: [list sizes]. Run a complete multi-group analysis: (1) check assumptions — normality per group (Shapiro-Wilk) and homogeneity of variance (Levene test), (2) if assumptions met, run one-way ANOVA; if not, run Kruskal-Wallis, (3) if the omnibus test is significant, run post-hoc pairwise comparisons with Bonferroni or Tukey HSD correction, (4) calculate effect sizes (eta-squared for ANOVA), (5) create a box plot with statistical annotations showing which pairs are significantly different, and (6) write a plain-English summary suitable for a non-technical stakeholder report.

Performs rigorous multi-group comparison with appropriate omnibus tests, post-hoc corrections, and stakeholder-ready summaries.

💡

Pro tip: Always correct for multiple comparisons. Testing 6 pairs without correction gives you a 26 percent chance of a false positive, even when there is no real difference.

Machine Learning Model Selection

5 prompts

Model Selection Advisor

16/38

I am building a [classification/regression] model. Here is my situation: dataset size is [number] rows and [number] features. The target variable is [describe — continuous, binary, multiclass]. Class balance: [balanced/imbalanced — give ratio]. Feature types: [number] numerical, [number] categorical, [number] text. Latency requirement: [real-time / batch]. Interpretability requirement: [must explain predictions / black box is fine]. Recommend the top 3 algorithms I should try, ranked by expected fit for this scenario. For each algorithm: (1) explain why it suits my constraints, (2) list the key hyperparameters to tune and reasonable ranges, (3) flag potential pitfalls for my specific data characteristics, and (4) provide the scikit-learn or XGBoost code to train a baseline version with sensible defaults.

Recommends the best ML algorithms for your specific constraints with ready-to-run baseline code.

💡

Pro tip: Start with the simplest model that could work. If logistic regression gets you 85 percent accuracy and XGBoost gets 87 percent, the complexity is rarely worth the 2 percent gain.

Hyperparameter Tuning Strategy

17/38

I am tuning a [model name] for [task description]. Current performance: [metric] = [value]. The hyperparameters available are: [list hyperparameters]. My compute budget allows [number] total training runs. Design a tuning strategy: (1) recommend which hyperparameters to tune first based on typical impact (rank by importance), (2) suggest the search method (grid search, random search, Bayesian optimization) and justify the choice for my compute budget, (3) define the search space for each hyperparameter with specific ranges, (4) provide the complete Python code using Optuna or scikit-learn for the search, (5) explain early stopping criteria to avoid wasting compute on bad combinations, and (6) describe how to analyze the tuning results to understand which hyperparameters matter most.

Creates a compute-efficient hyperparameter tuning plan that prioritizes the parameters with the highest impact.

💡

Pro tip: Random search with 60 iterations almost always outperforms grid search with 60 iterations. Grid search wastes compute exploring unimportant parameter combinations.

Cross-Validation Strategy Designer

18/38

I need to evaluate my model properly. My data: [describe — size, time component, groups/clusters, class balance]. The model is a [model type] predicting [target]. Design the right cross-validation strategy: (1) recommend the CV method (k-fold, stratified k-fold, group k-fold, time series split, nested CV) and explain why it fits my data structure, (2) explain what would go wrong if I used simple k-fold instead, (3) specify the number of folds and any shuffle settings, (4) provide complete Python code implementing the CV with proper metric reporting (mean, std, per-fold results), (5) include a visualization of the CV splits so I can verify they look correct, and (6) explain how to use the CV results to estimate real-world performance with appropriate confidence bounds.

Designs a cross-validation strategy matched to your data structure to avoid inflated performance estimates.

💡

Pro tip: If your data has a time component, always use time series split. Random k-fold on time series data leaks future information and makes your model look better than it is.

Class Imbalance Handling Playbook

19/38

I have a binary classification problem with severe class imbalance. Positive class: [number] samples ([percentage]%). Negative class: [number] samples ([percentage]%). Model: [current model]. Current metrics: precision = [X], recall = [X], F1 = [X], AUC = [X]. Create a comprehensive plan: (1) recommend 3 resampling techniques (SMOTE, ADASYN, random undersampling, etc.) and explain which fits my situation best, (2) suggest 3 algorithmic approaches (class weights, threshold tuning, cost-sensitive learning), (3) provide Python code implementing the top recommendation with proper cross-validation (resample inside the fold, not before), (4) explain which evaluation metrics to use and which to ignore for imbalanced data, (5) show how to find the optimal classification threshold using precision-recall curves, and (6) warn about common mistakes that make imbalanced problems look solved when they are not.

Provides a multi-pronged strategy for handling class imbalance with proper evaluation methodology.

💡

Pro tip: Never resample before splitting your data. Apply SMOTE or undersampling inside each cross-validation fold to avoid leaking synthetic data into your test set.

Model Comparison Framework

20/38

I have trained [number] different models for [task]: [list model names]. I need to decide which one to deploy. Provide a rigorous model comparison framework: (1) list the metrics I should compare (accuracy, precision, recall, F1, AUC, log loss, calibration — explain which matter most for [business context]), (2) write Python code to run all models through the same cross-validation with a paired comparison, (3) run McNemar's test or a corrected paired t-test to determine if performance differences are statistically significant, (4) create a radar chart comparing all models across multiple metrics, (5) add a complexity comparison — training time, inference time, model size, interpretability score, (6) produce a final recommendation table with a weighted scoring system based on my priorities: [rank importance of accuracy, speed, interpretability, maintainability].

Compares models rigorously across performance, complexity, and business requirements to avoid choosing based on a single metric.

💡

Pro tip: The best model on paper is not always the best model in production. A model that is 1 percent less accurate but 10x faster and fully interpretable is often the right choice.

Data Visualization

5 prompts

Executive Dashboard Design

21/38

I need to create a data dashboard for [audience — executives, product team, clients]. The key metrics are: [list 5-8 KPIs with brief descriptions]. The data updates [daily/weekly/monthly]. Design the dashboard: (1) recommend the layout — which metrics get top placement and why, (2) specify the chart type for each metric (line, bar, gauge, number card, etc.) with justification, (3) define the color scheme and visual hierarchy, (4) include comparison context for each metric (vs last period, vs target, vs benchmark), (5) suggest 2-3 drill-down views for power users, and (6) provide complete Python code using Plotly Dash or Streamlit to build a working prototype. The dashboard should tell a story, not just display numbers.

Designs a stakeholder-appropriate dashboard with proper visual hierarchy, context, and a working prototype.

💡

Pro tip: The best dashboards answer a question before the viewer asks it. Lead with the metric they care about most, and put comparison context right next to it.

Publication-Quality Statistical Plots

22/38

I need to create [number] plots for a [report/presentation/paper]. The data shows: [describe the key finding each plot should communicate]. For each plot: (1) recommend the best chart type to communicate that specific finding, (2) provide complete matplotlib or seaborn code with publication-quality formatting — proper font sizes (12pt+ for labels), high DPI (300), clean white background, minimal gridlines, (3) include informative titles and axis labels that state the finding, not just the variable names, (4) add annotations highlighting the key takeaway, (5) use a colorblind-friendly palette, and (6) export as both PNG and SVG. Each plot should be understandable without reading surrounding text.

Creates presentation-ready plots with proper formatting, annotations, and accessibility standards.

💡

Pro tip: Title your plots with the insight, not the variables. "Revenue grew 40 percent after campaign launch" is a better title than "Revenue over Time."

Interactive Exploration Notebook

23/38

I have a dataset about [topic] with columns: [list columns]. Create a Jupyter notebook with interactive visualizations using Plotly that lets me explore the data dynamically: (1) an interactive scatter plot with dropdown selectors for x-axis, y-axis, and color-by columns, (2) a filterable histogram with a slider for bin size, (3) an interactive correlation explorer where clicking a cell in the heatmap shows the underlying scatter plot, (4) a parallel coordinates plot for comparing [3-5 numerical features] across [categorical grouping], and (5) a time series viewer with range slider if there is a date column. Include ipywidgets for any controls. Add markdown cells explaining what to look for in each visualization.

Builds an interactive data exploration notebook that lets you discover patterns through visual manipulation.

💡

Pro tip: Interactive notebooks are great for exploration but terrible for sharing findings. Once you find something interesting, create a static, annotated version for your report.

Geospatial Data Visualization

24/38

I have data with geographic components: [describe — lat/long coordinates, country names, state/province, zip codes, city names]. The metric I want to visualize is [describe metric]. Total rows: [number]. Create a Python script using Folium or Plotly that: (1) creates a choropleth map colored by [metric] at the [country/state/county] level, (2) adds interactive tooltips showing the location name and metric value on hover, (3) includes a heatmap layer for point data if I have lat/long coordinates, (4) adds a legend with meaningful color breakpoints (not arbitrary quantiles), (5) implements clustering for dense point data to avoid overplotting, and (6) exports both an interactive HTML version and a static image for reports. Use a sequential color scale appropriate for [metric type].

Creates interactive geographic visualizations with choropleth maps, heatmaps, and clustering for spatial data patterns.

💡

Pro tip: Choose your color breakpoints based on meaningful thresholds, not just equal quantiles. The difference between 10 and 50 matters more than the difference between 50 and 55.

Model Performance Visualization Suite

25/38

I have a trained [classification/regression] model. Predictions and actuals: [Describe or paste sample of y_true, y_pred, y_prob] Create a complete model performance visualization suite: (1) confusion matrix with counts and percentages, (2) ROC curve with AUC score annotated (for classification), (3) precision-recall curve with average precision (for classification), (4) calibration plot comparing predicted probabilities to actual frequencies, (5) residual plots — predicted vs actual, residual distribution, residual vs each feature (for regression), (6) feature importance plot — bar chart of top 20 features, and (7) SHAP summary plot showing feature impact on individual predictions. Use a consistent visual style across all plots and arrange them in a 2x4 grid for a single-page overview.

Generates a comprehensive visual audit of model performance covering discrimination, calibration, residuals, and feature importance.

💡

Pro tip: The calibration plot is the most underused diagnostic. A model with great AUC but poor calibration gives you misleading probability estimates, which matters when probabilities drive decisions.

Go from copy-pasting to actually mastering AI.

AI Academy: 300+ hands-on tutorials on ChatGPT, Claude, Midjourney, and 50+ other tools. New tutorials added every week.

Start Your Free Trial

Reporting & Communication

5 prompts

Technical Findings to Executive Summary

26/38

I completed a data science analysis and need to present findings to [audience — C-suite, product manager, non-technical client]. Here are my technical findings: [Paste your technical findings, including model performance, key statistics, and conclusions] Rewrite this as a one-page executive summary that: (1) starts with the business question and the bottom-line answer in the first sentence, (2) translates all statistical results into business impact (dollars, percentages, customer counts — not p-values or R-squared), (3) includes exactly 3 key findings with one supporting data point each, (4) states the recommended action clearly, (5) acknowledges limitations in one sentence without undermining confidence, and (6) ends with a specific next step and timeline. Use no jargon. A smart executive with no data science background should understand every word.

Transforms technical analysis into a decision-focused executive summary that drives action.

💡

Pro tip: Executives do not want to know what you did or how you did it. They want to know what you found, what it means for the business, and what they should do about it.

Data Science Project Documentation

27/38

I need to document a data science project so my team can understand, reproduce, and extend it. Project: [describe project]. Model: [describe model]. Data sources: [list]. Write a documentation template that covers: (1) project overview — problem statement, stakeholders, success criteria, (2) data dictionary — table with every column, its type, description, source, and any transformations applied, (3) methodology — model selection rationale, feature engineering decisions, validation approach, (4) results — key metrics with confidence intervals, comparison to baseline, (5) deployment notes — how to retrain, input/output formats, latency expectations, (6) known limitations and failure modes, (7) monitoring plan — what metrics to track post-deployment and alert thresholds, and (8) changelog — version history with what changed and why.

Creates comprehensive project documentation that enables reproducibility and smooth handoffs between team members.

💡

Pro tip: Document your failures too. Knowing which approaches did not work and why saves the next person from repeating your experiments.

Stakeholder Q&A Prep Sheet

28/38

I am presenting a data science project to [stakeholder group]. My analysis shows [summarize key finding]. The recommended action is [action]. Anticipate the 10 toughest questions stakeholders will ask, grouped into: (1) methodology challenges — "Why did you use this approach?", "How confident are you?", (2) data quality concerns — "Is the data reliable?", "What about missing data?", (3) business impact questions — "What is the ROI?", "How does this compare to the status quo?", (4) implementation concerns — "How long to implement?", "What are the risks?", and (5) political/organizational challenges — "Does this affect my team?", "Why should we change?" For each question, provide a concise, confident answer with one supporting data point. Include a "bridge" phrase to redirect hostile questions back to the key finding.

Prepares you for the hardest questions stakeholders will ask by anticipating objections and drafting data-backed responses.

💡

Pro tip: The question you fear most is the one you should prepare for first. If you cannot answer it convincingly, your recommendation has a gap.

Automated Report Generator Script

29/38

I need to create a recurring [daily/weekly/monthly] data report that runs automatically. The report should cover: [list 4-6 metrics or sections]. Data sources: [describe — SQL database, CSV files, API]. Audience: [who reads it]. Write a complete Python script that: (1) connects to [data source] and pulls the required data, (2) calculates all metrics with period-over-period comparisons, (3) generates visualizations using matplotlib, (4) formats everything into a clean HTML email or PDF using Jinja2 templates, (5) highlights any metrics that breach alert thresholds in red, (6) includes an automatically generated commentary section for significant changes, and (7) sends the report via email using smtplib or saves to a shared folder. Include error handling and logging so I know when the report fails.

Builds a self-running report pipeline that generates, formats, and delivers recurring data reports without manual intervention.

💡

Pro tip: Build the report manually first, then automate. Automating a bad report just means you deliver bad analysis faster.

Data Storytelling Narrative Builder

30/38

I have the following data insights from my analysis: [List 5-7 findings with their supporting metrics] Help me weave these into a compelling data story for [audience]. Structure it as: (1) the hook — start with the most surprising or high-stakes finding to grab attention, (2) the context — what was happening before, what question we set out to answer, (3) the journey — walk through 3-4 findings in a logical sequence where each builds on the previous one, (4) the climax — the key insight that changes how we should think or act, (5) the resolution — the specific recommendation and expected outcome, and (6) the call to action — what needs to happen next and by when. For each section, suggest the supporting chart or visual. Write it as a presentation script I can deliver in [10/15/20] minutes.

Transforms a list of disconnected findings into a narrative arc that persuades and motivates action.

💡

Pro tip: Data does not speak for itself. The same data can tell a story of crisis or opportunity depending on how you frame it. Choose the frame that drives the action you need.

Python & SQL Code Generation

3 prompts

Complex SQL Query Builder

31/38

I need to write a SQL query against this schema: [Describe tables, columns, primary/foreign keys, or paste CREATE TABLE statements] The business question is: [describe what you need to answer, e.g., "What is the monthly revenue per customer segment for the last 12 months, including customers who made no purchases"]. Write the SQL query with: (1) proper JOIN types and explain why each is LEFT/INNER/FULL, (2) window functions where they simplify the logic, (3) CTEs for readability if the query is complex, (4) handling of NULL values and edge cases, (5) performance considerations — suggest indexes if appropriate, and (6) comments explaining each major section. Test the logic by describing what each CTE or subquery produces. Optimize for readability first, then performance.

Generates production-quality SQL with proper joins, window functions, NULL handling, and explanatory comments.

💡

Pro tip: Always start with a CTE that selects your base population, then join everything to it. This makes the query readable and prevents accidental row duplication from bad joins.

Pandas Pipeline Optimizer

32/38

Here is my current pandas code that works but is slow or messy: [Paste your pandas code] The DataFrame has [number] rows. Optimize this code: (1) identify operations that can be vectorized instead of using apply or loops, (2) replace any iterrows with vectorized alternatives, (3) chain operations using method chaining for readability, (4) use categorical dtypes for string columns with low cardinality, (5) suggest where .query() or .eval() could improve readability, (6) benchmark the original vs optimized version with estimated speedup, and (7) add type hints and docstrings. Explain each optimization and why it is faster. Preserve the exact same output — do not change the logic, only the implementation.

Refactors slow pandas code into optimized, idiomatic pipelines with benchmarked performance improvements.

💡

Pro tip: A single replace of iterrows with a vectorized operation can speed up your code 100-1000x. Always vectorize first, then consider other optimizations.

Data Pipeline Error Handling

33/38

I am building a data pipeline in Python that: [describe pipeline — reads from source, transforms, loads to destination]. Write a robust version with proper error handling: (1) add try/except blocks with specific exception types (not bare except), (2) implement retry logic with exponential backoff for API calls and database connections, (3) add data validation checkpoints — row counts, schema checks, value range checks — after each step, (4) implement logging using the logging module with appropriate levels (INFO for progress, WARNING for recoverable issues, ERROR for failures), (5) add a dead-letter queue or error table for rows that fail validation, (6) implement idempotency so the pipeline can be safely rerun without duplicating data, and (7) send an alert (email, Slack, or log) on failure with the error details and the step that failed.

Adds production-grade error handling, validation, retry logic, and alerting to a data pipeline.

💡

Pro tip: Every pipeline fails eventually. The question is whether it fails loudly with a clear error message or silently produces wrong data for weeks before anyone notices.

Frequently Asked Questions

No. ChatGPT is a powerful assistant that accelerates data science work, but it cannot replace the critical thinking, domain expertise, and judgment that data scientists bring. It generates code, suggests approaches, and helps with documentation, but it does not understand your business context, cannot evaluate whether results make practical sense, and sometimes produces statistically incorrect advice. Use it as a force multiplier: let it handle boilerplate code and first drafts while you focus on problem framing, validation, and interpretation.

Always test generated code on a small sample of your data before running it on the full dataset. Check output shapes, values, and edge cases manually. Compare results against a known baseline or hand-calculated example. Never trust statistical outputs at face value — verify that the test assumptions are met and the interpretation matches your understanding. Treat ChatGPT code as a first draft that needs your review, not as production-ready output.

Start with Data Cleaning and EDA prompts. These are the foundation of every data science project and the place where beginners lose the most time. The Complete EDA Report Generator and Missing Value Strategy Advisor will teach you the workflow that experienced data scientists follow instinctively. Once your data is clean and explored, move to the Model Selection Advisor to learn which algorithm fits your problem. Skip advanced statistical topics until you are comfortable with the basics.

ChatGPT can suggest strategies for handling large datasets — chunked reading with pandas, using Dask or Polars for out-of-core computation, database queries that filter before loading, and memory optimization techniques. However, it cannot process your actual data. You need to describe your data structure and constraints, and ChatGPT will generate code tailored to your situation. The Data Type Optimization Script prompt is a great starting point for reducing memory usage before considering more complex tools.

Use the Reporting and Communication prompts, especially Technical Findings to Executive Summary and Data Storytelling Narrative Builder. The key principle is to translate everything into business terms: dollars, customers, time saved, risk reduced. Never present p-values, R-squared, or model names to executives. Instead say "we are 95 percent confident this change will increase revenue by 200,000 dollars per quarter." Lead with the answer, support it with one compelling chart, and end with a clear recommendation.

Prompts are the starting line. Tutorials are the finish.

A growing library of 300+ hands-on tutorials on ChatGPT, Claude, Midjourney, and 50+ AI tools. New tutorials added every week.

Start 7-Day Free Trial Explore AI for Data Analysts courses

7-day free trial. Cancel anytime.

Browse All Prompt Collections