Policy Viewer
Non-Standard Datasets
Scenario 2
Description
The header is missing in the dataset, but the names of the columns of interest can be deduced from the column values.
Golden Example Thought
Upon reviewing the first few rows, it seems that the dataset lacks proper headers, as the first row appears to be part of the data rather than containing meaningful labels. However, based on the column values, I can deduce that the column at index 0 (January) represents the Month, and the column at index 3 (250) contains Cost information. These columns are clearly distinguishable, and no other columns have similar values that could lead to confusion. To prepare the dataframe for analysis, I'll reload the dataset using the 'header=None' option, treating the first row as data instead of headers. Then, I'll rename the columns at indices 0 and 3 to Month and Cost, respectively (assuming a 0-based index). Finally, I'll display the first five rows along with the column names and data types, as reloading the dataset might alter the data types.
Golden Example RTU
Note: It looks like the dataset's missing some headers, so I'll figure out the column names from the values themselves.
Golden Example Code
1# Load the csv file without the header
2df = pd.read_csv('Truck Repairs for 2023 - Sheet1 - Dr. Victoria Gamble.csv', header=None)
3
4# Rename the columns
5df.rename(columns={0: 'Month', 3: 'Cost'}, inplace=True)
6
7# Display the first 5 rows
8print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
9
10# Print the column names and their data types
11print(df.info())Scenario 5
Description
The dataset contains metadata or additional data in the first row(s) that are not part of the header, with the actual headers located in a subsequent row.
Golden Example Thought
The first five rows suggest that the dataset includes metadata before the actual header, as row 2 contains meaningful column names like Product and Noon that are relevant to the user request and accurately describe the data below. To address this, I'll reload the dataset using the same parameters as before, adding the 'header' parameter set to 1 (accounting for the 0-based index). I'll then display the first 5 rows along with the columns and their types, as the reload might alter the column types.
Golden Example RTU
Note: It looks like the dataset wasn't loaded with the correct headers due to extra information at the top. I'll reload it with the proper headers and continue with the analysis.
Golden Example Code
1# Read the CSV file into a DataFrame with the correct header starting from row 2
2df = pd.read_csv("product_sales_-_ps pivot table.csv", encoding='latin-1', header=1)
3
4# Display the first 5 rows
5print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
6
7# Print the column names and their data types
8print(df.info())Categorical Sampling
Categorical Sampling
Description
Categorical sampling is necessary when at least one relevant column is of the object data type, is categorical in nature, and is required to answer the query.
Golden Example Thought
In order to <Query's Requirement(s)>, I'll first inspect the values in the <column name(s)> column. This will help me unify any identical entries that may be represented inconsistently.
Golden Example RTU
I'll start by looking into the payment methods used by your clients.
Golden Example Code
1# Get all unique values from `Z`
2unique_values = df['Z'].unique()
3
4# Check the number of unique values in `Z`
5if len(unique_values) > 50:
6 # If there are too many unique values, sample the top 50
7 top_occurring_values = df['Z'].value_counts().head(50).index.tolist()
8 print(top_occurring_values)
9else:
10 # Otherwise print all unique values in `Z`
11 print(unique_values)Conversions
2A Conversion
Description
This policy applies when df.info() indicates that the relevant column is of object type, and df.head() reveals non-numeric characters in the first five rows.
Golden Example Thought
To find <Query's Requirement(s)>, I'll use the <column name(s)> columns.From the output of df.info(), it can be seen that the <column-name(s)> columns is/are of object type. From the output of df.head(), it is clear that <column name(s)> can be cleaned by removing the < non-numeric characters> character(s), and then converting it/them to numeric. Next, I'll <Query's Requirement(s)>.
Golden Example RTU
I'll start by organizing the Total column and proceed with the analysis.
Golden Example Code
1# Remove '€' and ',' from the `Total` column
2df['Total'] = df['Total'].astype(str).str.replace('€', '', regex=False).str.replace(',', '', regex=False)
3
4# and convert it to numeric
5df['Total'] = pd.to_numeric(df['Total'])
6
7# Aggregate on `Cost Code` and sum `Total` column and pick top value by Budget
8expensive_cost_code = df.groupby('Cost Code')['Total'].sum().reset_index(name='Budget').sort_values(by='Budget', ascending=False).iloc[0]
9
10# Display the output
11print(expensive_cost_code)3A Conversion
Description
The relevant columns need to be converted to numeric based on df.info(), as they are of object type. However, df.head() does not show any visible non-numeric characters in those columns.
Golden Example Thought
In order to calculate the percentage of GMV across different brands, I'll use the GMV and Type columns.I first need to convert the GMV column to numeric as the output of df.info() shows it is currently of object type. From the output of df.head(), GMV does not apparently have any non-numeric characters and thus, I'll sample and print 20 non-numeric values in that column to understand how to clean and convert it to numeric.
Golden Example RTU
I'll start by looking into the GMV column to prepare it for analysis.
Golden Example Code
1import numpy as np
2
3# Get all unique non-numeric values from `GMV`
4non_numeric_values = df[pd.to_numeric(df['GMV'], errors='coerce').isna()]['GMV'].unique()
5
6if (len(non_numeric_values) > 20):
7 # Sample 20 of them if there are too many unique values
8 print(f"Non Numeric values in GMV {np.random.choice(non_numeric_values, 20, replace=False)}")
9else:
10 # Otherwise print all unique non-numeric values from `GMV`
11 print(f"Non Numeric values in GMV {non_numeric_values}")3A Conversion following failed 2A
Description
Scenario 2A was required, as per df.info() and df.head(), but the model runs into a ValueError during numerical conversion. The model then needs to do 3A sampling and conversion.
Golden Example Thought
A ValueError occurred during the conversion of the net_worth column because this column contains non-numeric values, such as '1 141 dollars', which interferes with the numerical conversion. To address this issue, I'll sample 20 non-numeric values to determine a suitable method for cleaning and converting it to numeric type.
Golden Example RTU
It looks like there are non-numeric values in the net_worth and Roth_IRA columns. I'll review their values before moving forward.
Golden Example Code
1import numpy as np
2
3# Get all unique non-numeric values from `net_worth` column
4non_numeric_net_worth_value = df[pd.to_numeric(df['net_worth'], errors='coerce').isna()]['net_worth'].unique()
5if (len(non_numeric_net_worth_value) > 20):
6 # Sample 20 of them if there are too many unique values
7 print(np.random.choice(non_numeric_net_worth_value, 20, replace=False))
8else:
9 # Otherwise print all unique non-numeric values from `net_worth`
10 print(non_numeric_net_worth_value)Datetime Conversion
Description
This policy applies when date like column is relevant to the analysis.
Golden Example Thought
In order to display a comparison of the monthly sales of different products, I'll first convert the Date column to a datetime format. Then I'll extract the month from the Date column and create a new column labeled Month. < Remaining-analysis>
Golden Example RTU
I'll organize the data then visualize the comparison of the monthly sales of different products. <DEV:NOT-RECOMMENDED-TO-USE-THIS>
Golden Example Code
1# Convert `Date` column to datetime
2df['Date'] = pd.to_datetime(df['Date'])
3# Extract month from `Date` and save it to new column `Month`
4df['Month'] = df['Date'].dt.month_name()ModuleNotFound
ModuleNotFound
Description
A ModuleNotFoundError occurs in the response when the model attempts to use a Python module that is not installed in its environment. NOTE: Here, golden example code is part of RTU.
Golden Example Thought
It appears that I do not have access to the module, 'xgboost'. Therefore, I'll provide the code to prepare the data, train a logistic regressor, Random Forest classifier and XGBoost classifier on the training data and finally evaluate and compare their performance on the test data.
Golden Example RTU
I cannot address this query with the libraries currently available to me. However, you can find the code below and run it in your local environment.
Golden Example Code
1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import OneHotEncoder, StandardScaler,LabelEncoder
3from sklearn.linear_model import LogisticRegression
4from sklearn.ensemble import RandomForestClassifier
5from xgboost import XGBClassifier
6from sklearn.metrics import accuracy_score
7
8# Separate features and target variable
9X = df.drop('Stage', axis=1)
10y = df['Stage']
11
12# Identify categorical and numerical columns
13categorical_cols = X.select_dtypes(include=['object']).columns
14numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
15
16# One-hot encode categorical columns
17encoder = OneHotEncoder(drop='first', sparse_output=False)
18X_encoded = encoder.fit_transform(X[categorical_cols])
19encoded_df = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out(categorical_cols))
20
21# Standardize numerical columns
22scaler = StandardScaler()
23X_scaled = scaler.fit_transform(X[numerical_cols])
24scaled_df = pd.DataFrame(X_scaled, columns=numerical_cols)
25
26# Combine encoded categorical columns and scaled numerical columns
27X_processed = pd.concat([encoded_df, scaled_df], axis=1)
28
29# Split the data into training and testing sets
30X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)
31
32# Create and train models
33models = {
34 'Logistic Regression': LogisticRegression(),
35 'Random Forest': RandomForestClassifier(),
36 'XGBoost': XGBClassifier(),
37}
38
39results = {}
40for name, model in models.items():
41 # Adjust labels for XGBoost
42 if name == 'XGBoost':
43 # Encode labels to start from 0 as expected by XGBoost
44 encoder = LabelEncoder()
45 y_train_encoded = encoder.fit_transform(y_train)
46 y_test_encoded = encoder.transform(y_test)
47
48 # Fit the model using encoded labels
49 model.fit(X_train, y_train_encoded)
50 y_pred = model.predict(X_test)
51 # Decode predictions to original labels for accuracy calculation
52 y_pred_original = encoder.inverse_transform(y_pred)
53 accuracy = accuracy_score(y_test, y_pred_original)
54 else:
55 # Fit the model using original labels
56 model.fit(X_train, y_train)
57 y_pred = model.predict(X_test)
58 accuracy = accuracy_score(y_test, y_pred)
59
60 results[name] = accuracy
61 print(f'{name} Accuracy: {accuracy:.4f}')
62
63# Identify the best performing model
64best_model = max(results, key=lambda name: results[name])
65print(f'Best Performing Model: {best_model}')Tie-breaking issue
Does not select multiple equally good responses when it should
Description
[Query - Example] Which pokemon has the highest HP? [Thought] Note: --> We don't have an exact policy for this yet. Basically thought must go for find the maximum value, filter the dataset with the rows with that maximum value to find every rows with maximum value without missing any row.
Golden Example Thought
To identify the pokemon(s) with the highest HP, I'll analyze the pokemon_hp column to find the pokemons with the highest HP. First, I'll find the maximum HP in this dataset, then I'll filter all pokemons with maximum HP.
Golden Example RTU
I'll look into the pokemons to find with highest HP pokemon.
Golden Example Code
1# Find the maximum `Quantity_Sold`
2maximum_quantity_sold_liters = avg_quantity_sorted['Quantity_Sold'].max()
3
4# Filter the data with the maximum `Quantity_Sold`
5filtered_df = avg_quantity_sorted[avg_quantity_sorted['Quantity_Sold'] == maximum_quantity_sold_liters]
6
7# Print the data with the maximum `Quantity_Sold`
8print(filtered_df[['Fuel_Type' , 'Quantity_Sold']].to_markdown(index=False, numalign="left", stralign="left"))Visualization
6L
Description
Sorting by List on User Request ErrorUser requests that the plot be sorted through an ordered list of levels. Query Example: "Make a bar plot of total sales across US, Canada, and Mexico. Sort in the following order: Mexico, Canada, US."
Golden Example Thought
As requested, I'll create a bar chart to visualize the median calories burned for Yoga, HIIT, Cardio, and Strength, with the bars arranged in that order. Visualize the data:
- Create a list
orderwith elements 'Yoga', 'HIIT', 'Cardio', and 'Strength'. - Create a bar chart using the
Workout_Typecolumn on the x-axis and theCalories_Burnedcolumn on the y-axis. - Set x-axis title as 'Workout Type' and y-axis title as 'Median Calories Burned'.
- Use the
orderlist to control the order of the bars. - Add tooltips for columns
Workout_TypeandCalories_Burned. - Set the title of the plot to 'Median Calories Burned by Workout Type'.
- Make the plot interactive enabling zooming and panning.
- Save the plot as a JSON file.
Golden Example RTU
I'll create a bar chart to visualize the median calories burned by workout type.
Golden Example Code
1import altair as alt
2
3# Create a list `order` with elements 'Yoga', 'HIIT', 'Cardio', and 'Strength'.
4order = ['Yoga', 'HIIT', 'Cardio', 'Strength']
5# Create a bar chart using the `Workout_Type` column on the x-axis and the `Calories_Burned` column on the y-axis.
6chart = alt.Chart(median_calories_burned).mark_bar().encode(
7 # Set x-axis title as 'Workout Type' and y-axis title as 'Median Calories Burned'.
8 x=alt.X('Workout_Type', sort=order, title='Workout Type'),
9 # Use the `order` list to control the order of the bars.
10 y=alt.Y('Calories_Burned', title='Median Calories Burned'),
11 # Add tooltips for columns `Workout_Type` and `Calories_Burned`.
12 tooltip = ['Workout_Type', 'Calories_Burned']
13).properties(
14 # Set the title of the plot to 'Median Calories Burned by Workout Type'.
15 title='Median Calories Burned by Workout Type'
16).interactive() # Make the plot interactive enabling zooming and panning.
17# Save the chart to a JSON file
18chart.save('median_calories_burned_by_workout_type_bar_chart.json')23L
Description
Non-tooltip ChartsThe model should always add the relevant variables in the tooltip.
Golden Example Thought
As requested, I'll now create a pie chart using the aggregated data to visualize the distribution of marital status. Visualize the data:
- Create a pie chart using the
marital.statusandcountcolumns. - Use
countas the theta andmarital.statusfor the color. - Set the title of the plot to 'Distribution of Marital Status'.
- Order the pie slices by
percentagein descending order. - Add
percentagelabels to the pie chart. - Add tooltips for the
marital.statusandcountcolumns andpercentagewith the format '.1%'. - Save the plot to a JSON file.
Golden Example RTU
Now, I'll create a pie chart to show the distribution of marital status.
Golden Example Code
1import altair as alt
2
3# Create the base chart
4base = alt.Chart(df_plot).encode(theta=alt.Theta("count", stack=True))
5# Specify the outer radius of the arcs and encode color based on the `marital.status`
6columnpie = base.mark_arc(outerRadius=120).encode(
7 color=alt.Color("marital.status"),
8 # Order the arcs by `count` in descending order
9 order=alt.Order("percentage", sort="descending"),
10 tooltip=["marital.status", "count", alt.Tooltip("percentage", format=".1%")]
11)
12# Add text to show the percentage on the chart
13text = base.mark_text(radius=140).encode(
14 text=alt.Text("percentage", format=".1%"),
15 order=alt.Order("percentage", sort="descending"),
16 color=alt.value("black") # Set the color of the labels to black
17)
18# Combine the pie chart and text
19chart = pie + text
20# Add title to the chart
21chart = chart.properties(title='Distribution of Marital Status')
22# Save the chart in json file
23chart.save('marital_status_pie_chart.json')24L
Description
Non-interactive ChartsThe model should always make the chart interactive using .interactive() method to allow for zooming and panning.
Golden Example Thought
As requested, I'll create a stacked bar chart using the Company, TypeName, and Number of Laptops columns to visualize the total number of laptops sold by each company, grouped by laptop type.Visualize the data:
- Create a stacked bar chart using the columns
Company,TypeName, andNumber of Laptops. - Map the
Companycolumn to the x-axis and add title 'Company'. - Map the
Number of Laptopscolumn to the y-axis and add title 'Number of Laptops'. - Map the
TypeNamecolumn to color. - Add tooltips for columns
Company,TypeName, andNumber of Laptops. - Add plot title as 'Total Laptops Sold by Company and Type'.
- Make the plot interactive enabling zooming and panning.
- Save the chart as a JSON file.
Golden Example RTU
Now, I'll create a stacked bar chart to visualize the number of laptops sold by company and type.
Golden Example Code
1import altair as alt
2
3# Create a stacked bar chart
4chart = alt.Chart(df_plot).mark_bar().encode(
5 # Set the x-axis to display the companies
6 x=alt.X('Company', title='Company'),
7 # Set the y-axis to display the number of laptops sold
8 y=alt.Y('Number of Laptops', title='Number of Laptops'),
9 # Use different colors for each laptop
10 typecolor='TypeName',
11 # Add tooltips to show details when hovering over each bar
12 tooltip=['Company', 'TypeName', 'Number of Laptops']).properties(
13 # Set the chart title
14 title='Total Laptops Sold by Company and Type'
15 ).interactive() # Make the chart interactive for zooming and panning
16# Save the chart as a JSON file
17chart.save('total_laptops_sold_by_company_and_type_stacked_bar_chart.json')