Policy Viewer

Non-Standard Datasets

Scenario 2

Description

The header is missing in the dataset, but the names of the columns of interest can be deduced from the column values.

Golden Example Thought

Upon reviewing the first few rows, it seems that the dataset lacks proper headers, as the first row appears to be part of the data rather than containing meaningful labels. However, based on the column values, I can deduce that the column at index 0 (January) represents the Month, and the column at index 3 (250) contains Cost information. These columns are clearly distinguishable, and no other columns have similar values that could lead to confusion. To prepare the dataframe for analysis, I'll reload the dataset using the 'header=None' option, treating the first row as data instead of headers. Then, I'll rename the columns at indices 0 and 3 to Month and Cost, respectively (assuming a 0-based index). Finally, I'll display the first five rows along with the column names and data types, as reloading the dataset might alter the data types.

Golden Example RTU

Note: It looks like the dataset's missing some headers, so I'll figure out the column names from the values themselves.

Golden Example Code

1# Load the csv file without the header
2df = pd.read_csv('Truck Repairs for 2023 - Sheet1 - Dr. Victoria Gamble.csv', header=None)
3
4# Rename the columns
5df.rename(columns={0: 'Month', 3: 'Cost'}, inplace=True)
6
7# Display the first 5 rows
8print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
9
10# Print the column names and their data types
11print(df.info())

Scenario 5

Description

The dataset contains metadata or additional data in the first row(s) that are not part of the header, with the actual headers located in a subsequent row.

Golden Example Thought

The first five rows suggest that the dataset includes metadata before the actual header, as row 2 contains meaningful column names like Product and Noon that are relevant to the user request and accurately describe the data below. To address this, I'll reload the dataset using the same parameters as before, adding the 'header' parameter set to 1 (accounting for the 0-based index). I'll then display the first 5 rows along with the columns and their types, as the reload might alter the column types.

Golden Example RTU

Note: It looks like the dataset wasn't loaded with the correct headers due to extra information at the top. I'll reload it with the proper headers and continue with the analysis.

Golden Example Code

1# Read the CSV file into a DataFrame with the correct header starting from row 2
2df = pd.read_csv("product_sales_-_ps pivot table.csv", encoding='latin-1', header=1)
3
4# Display the first 5 rows
5print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
6
7# Print the column names and their data types
8print(df.info())

Categorical Sampling

Categorical Sampling

Description

Categorical sampling is necessary when at least one relevant column is of the object data type, is categorical in nature, and is required to answer the query.

Golden Example Thought

In order to <Query's Requirement(s)>, I'll first inspect the values in the <column name(s)> column. This will help me unify any identical entries that may be represented inconsistently.

Golden Example RTU

I'll start by looking into the payment methods used by your clients.

Golden Example Code

1# Get all unique values from `Z`
2unique_values = df['Z'].unique()
3
4# Check the number of unique values in `Z`
5if len(unique_values) > 50:
6    # If there are too many unique values, sample the top 50
7    top_occurring_values = df['Z'].value_counts().head(50).index.tolist()
8    print(top_occurring_values)
9else:
10    # Otherwise print all unique values in `Z`
11    print(unique_values)

Conversions

2A Conversion

Description

This policy applies when df.info() indicates that the relevant column is of object type, and df.head() reveals non-numeric characters in the first five rows.

Golden Example Thought

To find <Query's Requirement(s)>, I'll use the <column name(s)> columns.From the output of df.info(), it can be seen that the <column-name(s)> columns is/are of object type. From the output of df.head(), it is clear that <column name(s)> can be cleaned by removing the < non-numeric characters> character(s), and then converting it/them to numeric. Next, I'll <Query's Requirement(s)>.

Golden Example RTU

I'll start by organizing the Total column and proceed with the analysis.

Golden Example Code

1# Remove '€' and ',' from the `Total` column
2df['Total'] = df['Total'].astype(str).str.replace('€', '', regex=False).str.replace(',', '', regex=False)
3  
4# and convert it to numeric
5df['Total'] = pd.to_numeric(df['Total'])
6  
7# Aggregate on `Cost Code` and sum `Total` column and pick top value by Budget
8expensive_cost_code = df.groupby('Cost Code')['Total'].sum().reset_index(name='Budget').sort_values(by='Budget', ascending=False).iloc[0]
9  
10# Display the output
11print(expensive_cost_code)

3A Conversion

Description

The relevant columns need to be converted to numeric based on df.info(), as they are of object type. However, df.head() does not show any visible non-numeric characters in those columns.

Golden Example Thought

In order to calculate the percentage of GMV across different brands, I'll use the GMV and Type columns.I first need to convert the GMV column to numeric as the output of df.info() shows it is currently of object type. From the output of df.head(), GMV does not apparently have any non-numeric characters and thus, I'll sample and print 20 non-numeric values in that column to understand how to clean and convert it to numeric.

Golden Example RTU

I'll start by looking into the GMV column to prepare it for analysis.

Golden Example Code

1import numpy as np
2
3# Get all unique non-numeric values from `GMV`
4non_numeric_values = df[pd.to_numeric(df['GMV'], errors='coerce').isna()]['GMV'].unique()
5    
6if (len(non_numeric_values) > 20):
7    # Sample 20 of them if there are too many unique values
8    print(f"Non Numeric values in GMV {np.random.choice(non_numeric_values, 20, replace=False)}")
9else:
10    # Otherwise print all unique non-numeric values from `GMV`
11    print(f"Non Numeric values in GMV {non_numeric_values}")

3A Conversion following failed 2A

Description

Scenario 2A was required, as per df.info() and df.head(), but the model runs into a ValueError during numerical conversion. The model then needs to do 3A sampling and conversion.

Golden Example Thought

A ValueError occurred during the conversion of the net_worth column because this column contains non-numeric values, such as '1 141 dollars', which interferes with the numerical conversion. To address this issue, I'll sample 20 non-numeric values to determine a suitable method for cleaning and converting it to numeric type.

Golden Example RTU

It looks like there are non-numeric values in the net_worth and Roth_IRA columns. I'll review their values before moving forward.

Golden Example Code

1import numpy as np
2    
3# Get all unique non-numeric values from `net_worth` column
4non_numeric_net_worth_value = df[pd.to_numeric(df['net_worth'], errors='coerce').isna()]['net_worth'].unique()
5if (len(non_numeric_net_worth_value) > 20):
6    # Sample 20 of them if there are too many unique values
7    print(np.random.choice(non_numeric_net_worth_value, 20, replace=False))
8else:
9    # Otherwise print all unique non-numeric values from `net_worth`
10    print(non_numeric_net_worth_value)

Datetime Conversion

Description

This policy applies when date like column is relevant to the analysis.

Golden Example Thought

In order to display a comparison of the monthly sales of different products, I'll first convert the Date column to a datetime format. Then I'll extract the month from the Date column and create a new column labeled Month. < Remaining-analysis>

Golden Example RTU

I'll organize the data then visualize the comparison of the monthly sales of different products. <DEV:NOT-RECOMMENDED-TO-USE-THIS>

Golden Example Code

1# Convert `Date` column to datetime
2df['Date'] = pd.to_datetime(df['Date'])
3# Extract month from `Date` and save it to new column `Month`
4df['Month'] = df['Date'].dt.month_name()

ModuleNotFound

ModuleNotFound

Description

A ModuleNotFoundError occurs in the response when the model attempts to use a Python module that is not installed in its environment. NOTE: Here, golden example code is part of RTU.

Golden Example Thought

It appears that I do not have access to the module, 'xgboost'. Therefore, I'll provide the code to prepare the data, train a logistic regressor, Random Forest classifier and XGBoost classifier on the training data and finally evaluate and compare their performance on the test data.

Golden Example RTU

I cannot address this query with the libraries currently available to me. However, you can find the code below and run it in your local environment.

Golden Example Code

1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import OneHotEncoder, StandardScaler,LabelEncoder
3from sklearn.linear_model import LogisticRegression
4from sklearn.ensemble import RandomForestClassifier
5from xgboost import XGBClassifier
6from sklearn.metrics import accuracy_score
7
8# Separate features and target variable
9X = df.drop('Stage', axis=1)
10y = df['Stage']
11
12# Identify categorical and numerical columns
13categorical_cols = X.select_dtypes(include=['object']).columns
14numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
15
16# One-hot encode categorical columns
17encoder = OneHotEncoder(drop='first', sparse_output=False)
18X_encoded = encoder.fit_transform(X[categorical_cols])
19encoded_df = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out(categorical_cols))
20
21# Standardize numerical columns
22scaler = StandardScaler()
23X_scaled = scaler.fit_transform(X[numerical_cols])
24scaled_df = pd.DataFrame(X_scaled, columns=numerical_cols)
25
26# Combine encoded categorical columns and scaled numerical columns
27X_processed = pd.concat([encoded_df, scaled_df], axis=1)
28
29# Split the data into training and testing sets
30X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)
31
32# Create and train models
33models = {
34    'Logistic Regression': LogisticRegression(),
35    'Random Forest': RandomForestClassifier(),
36    'XGBoost': XGBClassifier(),
37}
38
39results = {}
40for name, model in models.items():
41    # Adjust labels for XGBoost
42    if name == 'XGBoost':
43        # Encode labels to start from 0 as expected by XGBoost
44        encoder = LabelEncoder()
45        y_train_encoded = encoder.fit_transform(y_train)
46        y_test_encoded = encoder.transform(y_test)
47
48        # Fit the model using encoded labels
49        model.fit(X_train, y_train_encoded)
50        y_pred = model.predict(X_test)
51        # Decode predictions to original labels for accuracy calculation
52        y_pred_original = encoder.inverse_transform(y_pred)
53        accuracy = accuracy_score(y_test, y_pred_original)
54    else:
55        # Fit the model using original labels
56        model.fit(X_train, y_train)
57        y_pred = model.predict(X_test)
58        accuracy = accuracy_score(y_test, y_pred)
59
60    results[name] = accuracy
61    print(f'{name} Accuracy: {accuracy:.4f}')
62
63# Identify the best performing model
64best_model = max(results, key=lambda name: results[name])
65print(f'Best Performing Model: {best_model}')

Tie-breaking issue

Does not select multiple equally good responses when it should

Description

[Query - Example] Which pokemon has the highest HP? [Thought] Note: --> We don't have an exact policy for this yet. Basically thought must go for find the maximum value, filter the dataset with the rows with that maximum value to find every rows with maximum value without missing any row.

Golden Example Thought

To identify the pokemon(s) with the highest HP, I'll analyze the pokemon_hp column to find the pokemons with the highest HP. First, I'll find the maximum HP in this dataset, then I'll filter all pokemons with maximum HP.

Golden Example RTU

I'll look into the pokemons to find with highest HP pokemon.

Golden Example Code

1# Find the maximum `Quantity_Sold`
2maximum_quantity_sold_liters = avg_quantity_sorted['Quantity_Sold'].max()
3
4# Filter the data with the maximum `Quantity_Sold`
5filtered_df = avg_quantity_sorted[avg_quantity_sorted['Quantity_Sold'] == maximum_quantity_sold_liters]
6
7# Print the data with the maximum `Quantity_Sold`
8print(filtered_df[['Fuel_Type' , 'Quantity_Sold']].to_markdown(index=False, numalign="left", stralign="left"))

Visualization

6L

Description

Sorting by List on User Request ErrorUser requests that the plot be sorted through an ordered list of levels. Query Example: "Make a bar plot of total sales across US, Canada, and Mexico. Sort in the following order: Mexico, Canada, US."

Golden Example Thought

As requested, I'll create a bar chart to visualize the median calories burned for Yoga, HIIT, Cardio, and Strength, with the bars arranged in that order. Visualize the data:

  1. Create a list order with elements 'Yoga', 'HIIT', 'Cardio', and 'Strength'.
  2. Create a bar chart using the Workout_Type column on the x-axis and the Calories_Burned column on the y-axis.
  3. Set x-axis title as 'Workout Type' and y-axis title as 'Median Calories Burned'.
  4. Use the order list to control the order of the bars.
  5. Add tooltips for columns Workout_Type and Calories_Burned.
  6. Set the title of the plot to 'Median Calories Burned by Workout Type'.
  7. Make the plot interactive enabling zooming and panning.
  8. Save the plot as a JSON file.

Golden Example RTU

I'll create a bar chart to visualize the median calories burned by workout type.

Golden Example Code

1import altair as alt
2
3# Create a list `order` with elements 'Yoga', 'HIIT', 'Cardio', and 'Strength'.
4order = ['Yoga', 'HIIT', 'Cardio', 'Strength']
5# Create a bar chart using the `Workout_Type` column on the x-axis and the `Calories_Burned` column on the y-axis.
6chart = alt.Chart(median_calories_burned).mark_bar().encode(
7    # Set x-axis title as 'Workout Type' and y-axis title as 'Median Calories Burned'.
8    x=alt.X('Workout_Type', sort=order, title='Workout Type'),
9    # Use the `order` list to control the order of the bars.
10    y=alt.Y('Calories_Burned', title='Median Calories Burned'),
11    # Add tooltips for columns `Workout_Type` and `Calories_Burned`.
12    tooltip = ['Workout_Type', 'Calories_Burned']
13).properties(
14    # Set the title of the plot to 'Median Calories Burned by Workout Type'.
15    title='Median Calories Burned by Workout Type'
16).interactive() # Make the plot interactive enabling zooming and panning.
17# Save the chart to a JSON file
18chart.save('median_calories_burned_by_workout_type_bar_chart.json')

23L

Description

Non-tooltip ChartsThe model should always add the relevant variables in the tooltip.

Golden Example Thought

As requested, I'll now create a pie chart using the aggregated data to visualize the distribution of marital status. Visualize the data:

  1. Create a pie chart using the marital.status and count columns.
  2. Use count as the theta and marital.status for the color.
  3. Set the title of the plot to 'Distribution of Marital Status'.
  4. Order the pie slices by percentage in descending order.
  5. Add percentage labels to the pie chart.
  6. Add tooltips for the marital.status and count columns and percentage with the format '.1%'.
  7. Save the plot to a JSON file.

Golden Example RTU

Now, I'll create a pie chart to show the distribution of marital status.

Golden Example Code

1import altair as alt
2
3# Create the base chart
4base = alt.Chart(df_plot).encode(theta=alt.Theta("count", stack=True))
5# Specify the outer radius of the arcs and encode color based on the `marital.status` 
6columnpie = base.mark_arc(outerRadius=120).encode(
7  color=alt.Color("marital.status"),
8  # Order the arcs by `count` in descending order
9  order=alt.Order("percentage", sort="descending"),
10  tooltip=["marital.status", "count", alt.Tooltip("percentage", format=".1%")]
11)
12# Add text to show the percentage on the chart
13text = base.mark_text(radius=140).encode(
14  text=alt.Text("percentage", format=".1%"),
15  order=alt.Order("percentage", sort="descending"),
16  color=alt.value("black") # Set the color of the labels to black
17)
18# Combine the pie chart and text
19chart = pie + text
20# Add title to the chart
21chart = chart.properties(title='Distribution of Marital Status')
22# Save the chart in json file
23chart.save('marital_status_pie_chart.json')

24L

Description

Non-interactive ChartsThe model should always make the chart interactive using .interactive() method to allow for zooming and panning.

Golden Example Thought

As requested, I'll create a stacked bar chart using the Company, TypeName, and Number of Laptops columns to visualize the total number of laptops sold by each company, grouped by laptop type.Visualize the data:

  1. Create a stacked bar chart using the columns Company, TypeName, and Number of Laptops.
  2. Map the Company column to the x-axis and add title 'Company'.
  3. Map the Number of Laptops column to the y-axis and add title 'Number of Laptops'.
  4. Map the TypeName column to color.
  5. Add tooltips for columns Company, TypeName, and Number of Laptops.
  6. Add plot title as 'Total Laptops Sold by Company and Type'.
  7. Make the plot interactive enabling zooming and panning.
  8. Save the chart as a JSON file.

Golden Example RTU

Now, I'll create a stacked bar chart to visualize the number of laptops sold by company and type.

Golden Example Code

1import altair as alt
2          
3# Create a stacked bar chart
4chart = alt.Chart(df_plot).mark_bar().encode(
5  # Set the x-axis to display the companies
6  x=alt.X('Company', title='Company'),
7  # Set the y-axis to display the number of laptops sold
8  y=alt.Y('Number of Laptops', title='Number of Laptops'),
9  # Use different colors for each laptop
10  typecolor='TypeName',
11  # Add tooltips to show details when hovering over each bar
12  tooltip=['Company', 'TypeName', 'Number of Laptops']).properties(
13    # Set the chart title
14    title='Total Laptops Sold by Company and Type'
15    ).interactive() # Make the chart interactive for zooming and panning
16# Save the chart as a JSON file
17chart.save('total_laptops_sold_by_company_and_type_stacked_bar_chart.json')