Building a Personalized AI Trading Consultant with GPT-4 (2024)

Introduction

In recent years, integrating artificial intelligence (AI) in stock trading has revolutionized investors’ decisions. With the emergence of Large Language Models (LLMs) such as GPT-3 and GPT-4, a paradigm shift has occurred, making complex market analyses and insights more accessible to individual investors and traders. This transformative technology leverages vast amounts of data and sophisticated algorithms to provide a depth of market understanding that was once the exclusive domain of institutional investors. This article focuses on developing a personalized AI trading consultant using LLMs, designed to match individual investment profiles based on risk appetite, investment timeframe, budget, and desired returns, empowering retail investors with personalized, strategic investment advice.

Building a Personalized AI Trading Consultant with GPT-4 (1)

Stock trading consultants powered by Large Language Models (LLMs) like GPT-3 and GPT-4 have revolutionized financial advisory services. They can leverage AI to analyze historical stock data and current financial news, providing personalized investment advice that aligns with an investor’s unique portfolio and financial objectives. We will attempt to build a consultant to predict market behavior and trends, offering tailored recommendations based on individual risk tolerance, investment duration, available capital, and desired returns.

Learning Objectives

By the end of this article, readers will be able to:

Gain insight into how AI and LLMs like GPT-3 transform stock market analysis and trading.
Recognize the ability of AI-driven tools to provide personalized investment advice based on individual risk profiles and investment goals.
Learn how AI utilizes historical and real-time data to formulate investment strategies and predictions.
Understand how AI in stock trading makes sophisticated investment strategies accessible to a broader audience, including retail investors.
Discover how to leverage AI-driven tools for informed decision-making in personal investment and stock trading.
Concept of a Stock Trading Consultant Using LLMs

This article was published as a part of the Data Science Blogathon.

About the Dataset
Data Preparation
Exploratory Data Analysis (EDA)
Feature Engineering
Integration with GPT-4 API
Challenges
Frequently Asked Questions

About the Dataset

The dataset for this project, sourced from the New York Stock Exchange and available on Kaggle, comprises four CSV files spanning seven years. It includes ‘fundamentals.csv’ with essential financial metrics, ‘prices.csv’ and ‘prices-split-adjusted.csv’ providing historical stock prices and adjustments for stock splits, and ‘securities.csv’ offering additional company information like sector classification and headquarters. Collectively, these files provide a comprehensive view of company performance and stock market dynamics.

Data Preparation

Implementing the stock trading consultant with Large Language Models (LLMs) like GPT-4 starts with crucial data preparation. This process includes critical tasks: data cleaning, normalization, and categorization, using the provided datasets: fundamentals.csv, prices.csv, prices-split-adjusted.csv, and securities.csv.

Step 1: Data Cleaning

In the ‘Fundamental Dataset,’ we address missing values in ‘For Year,’ ‘Earnings Per Share,’ and ‘Estimated Shares Outstanding’ (173, 219, and 219 missing values) using median imputation.
We convert the ‘Period Ending’ column to datetime format, making numerical fields analysis-ready.

import pandas as pd# Loading the datasetsfundamentals = pd.read_csv('/content/fundamentals.csv')prices = pd.read_csv('/content/prices.csv')prices_split_adjusted = pd.read_csv('/content/prices-split-adjusted.csv')securities = pd.read_csv('/content/securities.csv')# Handling missing values and data type conversions in the Fundamentals datasetfundamentals_info = fundamentals.info()fundamentals_missing_values = fundamentals.isnull().sum()# Formatting date columns in all datasetsfundamentals['Period Ending'] = pd.to_datetime(fundamentals['Period Ending'])prices['date'] = pd.to_datetime(prices['date'])prices_split_adjusted['date'] = pd.to_datetime(prices_split_adjusted['date'])# Displaying information about missing values in the Fundamentals datasetfundamentals_missing_values

Building a Personalized AI Trading Consultant with GPT-4 (2)

# Dropping the unnecessary 'Unnamed: 0' columnfundamentals.drop(columns=['Unnamed: 0'], inplace=True)# Imputing missing values in 'Earnings Per Share' and 'Estimated Shares Outstanding' with the medianfor column in ['Earnings Per Share', 'Estimated Shares Outstanding']: median_value = fundamentals[column].median() fundamentals[column].fillna(median_value, inplace=True)# Re-checking for missing values after imputationfundamentals_missing_values_post_imputation = fundamentals.isnull().sum()fundamentals_missing_values_post_imputation

Building a Personalized AI Trading Consultant with GPT-4 (3)

The ‘date’ columns are already consistent in datetime format for the prices and prices-split-adjusted datasets. We verify data consistency, especially regarding stock splits.

# Checking for consistency between the # Prices and Prices Split Adjusted datasets# We will compare a sample of data for the # same ticker symbols and dates in both datasets# Selecting a sample of ticker symbolssample_tickers = prices['symbol'].unique()[:5]# Creating a comparison DataFrame for each ticker symbol in the samplecomparison_data = {}for ticker in sample_tickers: prices_data = prices[prices['symbol'] == ticker] prices_split_data = prices_split_adjusted [prices_split_adjusted['symbol'] == ticker] merged_data = pd.merge(prices_data, prices_split_data, on='date', how='inner', suffixes=('_raw', '_split')) comparison_data[ticker] = merged_data# Displaying the comparison for the first ticker symbol as an examplecomparison_data[sample_tickers[0]].head()

Building a Personalized AI Trading Consultant with GPT-4 (4)

A comparison of prices.csv and prices-split-adjusted.csv for a sample ticker symbol (WLTW) reveals differences in open, close, low, and high stock prices due to stock split adjustments. Volume columns are consistent, indicating accurate trading volume data.

Step 2: Normalization of Prices

We use the prices-split-adjusted.csv dataset for the stock trading consultant as it offers a consistent view of stock prices over time, accounting for stock splits.

Step 3: Data Integration

The final data preparation step involves integrating these datasets. We merge fundamentals.csv, prices-split-adjusted.csv, and securities.csv, creating a comprehensive data frame for analysis. Given their large size, we integrate the most relevant columns based on the ticker symbol and date fields to match financials with stock prices and company information.

# Selecting relevant columns from each dataset for integrationfundamentals_columns = ['Ticker Symbol', 'Period Ending', 'Earnings Per Share', 'Total Revenue']prices_columns = ['symbol', 'date', 'open', 'close', 'low', 'high', 'volume']securities_columns = ['Ticker symbol', 'GICS Sector', 'GICS Sub Industry']# Renaming columns for consistencyfundamentals_renamed = fundamentals[fundamentals_columns].rename(columns={'Ticker Symbol': 'symbol', 'Period Ending': 'date'})prices_split_adjusted_renamed = prices_split_adjusted[prices_columns].rename(columns={'open': 'open_price', 'close': 'close_price', 'low': 'low_price', 'high': 'high_price', 'volume': 'trade_volume'})securities_renamed = securities[securities_columns].rename(columns={'Ticker symbol': 'symbol'})# Merging datasetsmerged_data = pd.merge(pd.merge(fundamentals_renamed, prices_split_adjusted_renamed, on=['symbol', 'date']), securities_renamed, on='symbol')# Displaying the first few rows of the integrated datasetmerged_data.head()

Building a Personalized AI Trading Consultant with GPT-4 (5)

The resultant dataset includes key metrics: earnings per share, total revenue, open/close/low/high stock prices, trading volume, and sector information for each ticker symbol.

Exploratory Data Analysis (EDA)

Next, we will conduct EDA to understand the distribution and relationships in the dataset, which is crucial for feature selection and model training.

import matplotlib.pyplot as pltimport seaborn as sns# Exploratory Data Analysis (EDA)# Summary statistics for numerical columnsnumerical_summary = merged_data.describe()# Correlation matrix to understand relationships between # different numerical featurescorrelation_matrix = merged_data.corr()# Visualizing the correlation matrix using a heatmapplt.figure(figsize=(12, 8))sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')plt.title('Correlation Matrix of Numerical Features')plt.show()correlation_matrix

Building a Personalized AI Trading Consultant with GPT-4 (6)

Building a Personalized AI Trading Consultant with GPT-4 (7)

The EDA provides valuable insights into our integrated dataset:

Feature Engineering

With these analytical insights, we move forward to enhance our dataset through feature engineering:

We’re introducing predictive financial ratios:
- PE_Ratio: This ratio, representing Price to Earnings, is derived by dividing the closing stock price by the Earnings Per Share.
- Price_Change: This reflects the variance in stock price, calculated by subtracting the opening price from the closing price.
- Average_Price: This metric averages the day’s opening, closing, low, and high stock prices.
To address anomalies in the data, the Interquartile Range (IQR) method will identify and mitigate outliers within our numerical fields.
Normalization of pivotal numerical features, including Earnings Per Share and Total Revenue, will be executed using the MinMaxScaler, ensuring a standardized scale for model input.
The ‘GICS Sector’ category will undergo one-hot encoding to convert sector classifications into a binary format compatible with algorithmic learning processes.
The culmination of this process yields a dataset enriched with 103 columns, amalgamating the original data, the newly engineered features, and the one-hot encoded sectors.

from sklearn.preprocessing import MinMaxScaler# Renaming columns for consistencyfundamentals_renamed = fundamentals.rename(columns={'Ticker Symbol': 'symbol', 'Period Ending': 'date'})prices_split_adjusted_renamed = prices_split_adjusted.rename(columns={'symbol': 'symbol', 'date': 'date', 'open': 'open_price', 'close': 'close_price', 'low': 'low_price', 'high': 'high_price', 'volume': 'trade_volume'})securities_renamed = securities.rename(columns={'Ticker symbol': 'symbol'})# Merging datasetsmerged_data = pd.merge(pd.merge(fundamentals_renamed, prices_split_adjusted_renamed, on=['symbol', 'date']), securities_renamed, on='symbol')# Creating new featuresmerged_data['PE_Ratio'] = merged_data['close_price'] / merged_data['Earnings Per Share']merged_data['Price_Change'] = merged_data['close_price'] - merged_data['open_price']merged_data['Average_Price'] = (merged_data['open_price'] + merged_data['close_price'] + merged_data['low_price'] + merged_data['high_price']) / 4# Handling Outliers: Using the IQR method to identify and handle outliers in numerical columnsQ1 = merged_data.quantile(0.25)Q3 = merged_data.quantile(0.75)IQR = Q3 - Q1merged_data = merged_data[~((merged_data.isin([Q1 - 1.5 * IQR, Q3 + 1.5 * IQR])).any(axis=1))]# Feature Scaling: Normalizing the numerical featuresnumerical_features = ['Earnings Per Share', 'Total Revenue', 'open_price', 'close_price', 'low_price', 'high_price', 'trade_volume', 'PE_Ratio', 'Price_Change', 'Average_Price']scaler = MinMaxScaler()merged_data[numerical_features] = scaler.fit_transform(merged_data[numerical_features])# Encoding Categorical Variables: One-hot encoding for 'GICS Sector'merged_data_encoded = pd.get_dummies(merged_data, columns=['GICS Sector'])# Displaying a sample of the preprocessed datasetmerged_data_encoded.head()

Building a Personalized AI Trading Consultant with GPT-4 (8)

Model Training & Testing

Building a Personalized AI Trading Consultant with GPT-4 (9)

For our stock price prediction project, we must choose a machine learning model that excels in handling regression tasks, as we are dealing with predicting continuous stock price values. Given our dataset’s diverse and complex nature, our model needs to capture intricate patterns within the data adeptly.

Model Selection: Chosen for its versatility and robustness, the Random Forest Regressor is ideal for handling our dataset’s complexity and variety of features. It excels in regression tasks, is less prone to overfitting, and can take non-linear relationships.
Data Splitting: The dataset is split into an 80/20 ratio for training and testing. This ensures a comprehensive training phase while retaining a significant dataset for validation.
Handling Missing Values: Missing values are addressed using the SimpleImputer’s median filling strategy from sklearn.impute, ensuring data completeness and consistency across the dataset.
Training Process: The model is trained on the imputed training data, reflecting real-world scenarios with missing data points.
Performance Evaluation: After training, the model’s predictive accuracy is assessed using the imputed testing set, giving insights into its real-world applicability.

The following code demonstrates the steps involved in this process,

from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_error, r2_scorefrom sklearn.impute import SimpleImputer# Assuming 'close_price' as the target variable for predictionX = merged_data_encoded.drop(['close_price', 'symbol', 'date'], axis=1) # dropping non-numeric and target variabley = merged_data_encoded['close_price']# Checking for non-numeric columns in the datasetnon_numeric_columns = X.select_dtypes(include=['object']).columns# If there are non-numeric columns, we'll remove them from the datasetif len(non_numeric_columns) > 0: X = X.drop(non_numeric_columns, axis=1)# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Initializing the Random Forest Regressorrandom_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)# Creating an imputer object with a median filling strategyimputer = SimpleImputer(strategy='median')# Applying the imputer to the training and testing setsX_train_imputed = imputer.fit_transform(X_train)X_test_imputed = imputer.transform(X_test)# Training the modelrandom_forest_model.fit(X_train_imputed, y_train)# Predicting on the test sety_pred = random_forest_model.predict(X_test_imputed)# Evaluating the modelmse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)mse, r2

Building a Personalized AI Trading Consultant with GPT-4 (10)

Model Performance

The output of our Random Forest Regressor model indicates the following:

Mean Squared Error (MSE): The low MSE value of 8.592×10−5 suggests that our model’s predictions are very close to the actual values, indicating high accuracy in predicting stock prices.
R-squared (R²): An R² value of approximately 0.96 implies that the model can explain about 96% of the variability in the stock prices, which is exceptionally high for stock market predictions.

Integration with GPT-4 API

After training the Random Forest Regressor model and enabling it for predictions, we will integrate it seamlessly with the GPT-4 API. This integration facilitates the model to analyze and predict stock prices and communicate these insights effectively to the users. The GPT-4 API, with its advanced natural language processing capabilities, can interpret complex financial data and present it in a user-friendly way.

How does the Integration Work?

Here’s a detailed explanation of how the integration works:

User Query Processing: The function get_model_predictions processes the user’s query to extract relevant information, such as the ticker symbol. Since we do not have the latest data, we will utilize the summary of the particular stock in question and generate test data.
Model Prediction and Scaling:The Random Forest model predicts the stock price from the test data and scales it back to its original value using the previously defined scaling method.
Preparing the Prompt for GPT-4: The query_gpt4_with_context function combines the user’s query, model predictions, and additional context, including price trends, fundamentals, and securities information for the specified stock. This prompt guides GPT-4 in delivering a tailored financial consultation based on the user’s query and the model’s analysis.
GPT-4 Query and Response: The prompt generates a tailored response based on the data and the user’s financial profile.

import osfrom openai import OpenAIfrom sklearn.impute import SimpleImputeros.environ["OPENAI_API_KEY"] ='YOUR API KEY'client = OpenAI()imputer = SimpleImputer(strategy='median')# Function to get model predictions based on user querydef get_model_predictions(user_query): ticker_symbol = user_query[0].split()[-1].strip().upper() # Applying imputter to the data and using the model to make # predictions imputed_test_data = imputer.fit_transform(test_data) predicted_scaled_value = random_forest_model.predict(imputed_test_data)[0] confidence = 0.9 #Assuming 90% confidence in our predictions # Creaing a placeholder array with the same shape as the # original scaled data placeholder_array = np.zeros((1, len(numerical_features))) # Inserting the predicted scaled value at the correct position placeholder_array[0][3] = predicted_scaled_value # Performing the inverse transformation predicted_original_value = scaler.inverse_transform(placeholder_array) # Extracting the scaled-back value for 'close_price' predicted_stock_price = predicted_original_value[0][3] return { "predicted_stock_price": predicted_stock_price, "confidence": confidence }# Function to query GPT-4 with model contextdef query_gpt4_with_context(model_context,additional_data_context, user_query): prompt = f"{additional_data_context}\n\n {model_context}\n\n{user_query}\n\nYou are a financial advisor, an expert stock market consultant. Study the predictions, the data provided and the client's profile to provide consultation related to the stock, to the user based on the above information. Also, focus your advise to the given stock only." response = client.chat.completions.create( model="gpt-4", messages=[{"role": "system", "content": prompt}] ) return response.choices[0].message.content.strip()

Now, let’s test the efficacy of our stock consultant with a couple of scenarios:

Test Case 1

I am a 25-year-old single male with a high-risk tolerance. I seek at least 15% annual growth in all my stock investments. Two years ago, I bought 100 shares of ABBV at $40 per share. Should I sell my ABBV shares? What is the likely profit in dollars and percentage from this sale?

Let’s feed this query into our model and see what output we get.

user_query = ["I am 25-year-old single male. Given my status, I have a high risk tolerance and look for at least 15% growth per year in all my stock investments. I bought 100 shares of ABBV at $40 per share two years ago. Should I look to sell my shares of company ABBV","What is my likely profit in $ and % from this sale?"]# Generating a random row of data for the queried stock # based on its summary statisticsticker_symbol = user_query[0].split()[-1].strip().upper()df1 = merged_data_encoded[merged_data_encoded['symbol'] == ticker_symbol]df1 = df1.drop(['close_price'], axis=1)test_data = df1.describe().loc[['mean', 'std']].Ttest_data['random_value'] = np.random.randn(len(test_data)) * test_data['std'] + test_data['mean']# Selecting only the random values to form a DataFrametest_data = pd.DataFrame(test_data['random_value']).transpose()model_predictions = get_model_predictions(user_query)# Generating model contextmodel_context = f"The current predicted stock price of {ticker_symbol} is ${model_predictions['predicted_stock_price']} with a confidence level of {model_predictions['confidence']*100}%."# Generating additional data contextadditional_data_context = prices[prices['symbol']==ticker_symbol],fundamentals[fundamentals['Ticker Symbol']==ticker_symbol],securities[securities['Ticker symbol']==ticker_symbol]gpt4_response = query_gpt4_with_context(model_context,additional_data_context, user_query)print(f"GPT-4 Response: {gpt4_response}")

Building a Personalized AI Trading Consultant with GPT-4 (11)

Test Case 2

I am 40 year old married female. Given my status, I have a low risk tolerance and look for atleast 10% growth per year in all my stock investments. I bought 100 shares of ALXN at $100 per share two years ago. Should I look to sell my shares of company ALXN? What is my likely profit in $ and % from this sale?

user_query = ["I am 40 year old married female. Given my status, I have a low risk tolerance and look for atleast 10% growth per year in all my stock investments. I bought 100 shares of ALXN at $100 per share two years ago. Should I look to sell my shares of company ALXN?","What is my likely profit in $ and % from this sale?"]# Generating a random row of data for the queried stock # based on its summary statisticsticker_symbol = user_query[0].split()[-1].strip().upper()df1 = merged_data_encoded[merged_data_encoded['symbol'] == ticker_symbol]df1 = df1.drop(['close_price'], axis=1)test_data = df1.describe().loc[['mean', 'std']].Ttest_data['random_value'] = np.random.randn(len(test_data)) * test_data['std'] + test_data['mean']# Selecting only the random values to form a DataFrametest_data = pd.DataFrame(test_data['random_value']).transpose()model_predictions = get_model_predictions(user_query)# Generating model contextmodel_context = f"The current predicted stock price of {ticker_symbol} is ${model_predictions['predicted_stock_price']} with a confidence level of {model_predictions['confidence']*100}%."# Generating additional data contextadditional_data_context = prices[prices['symbol']==ticker_symbol],fundamentals[fundamentals['Ticker Symbol']==ticker_symbol],securities[securities['Ticker symbol']==ticker_symbol]gpt4_response = query_gpt4_with_context(model_context,additional_data_context, user_query)print(f"GPT-4 Response: {gpt4_response}")

Building a Personalized AI Trading Consultant with GPT-4 (12)

Challenges

One of the biggest challenges in implementing a project like this is ensuring the accuracy and timeliness of financial data is crucial. Inaccurate or outdated data can lead to misguided predictions and recommendations.
Numerous unpredictable factors, including geopolitical events, economic changes, and company-specific news influence stock markets. These elements can make AI predictions less reliable.
AI models, despite their advanced capabilities, may struggle to fully grasp intricate financial terminologies and concepts, potentially impacting the quality of investment advice.
Financial advising is heavily regulated. Ensuring that AI-driven recommendations comply with legal standards and ethical guidelines is a significant challenge.

Conclusion

Our exploration of AI in stock trading shows that models like GPT-3 and GPT-4 redefine the landscape, assimilating vast data, applying sophisticated analysis, and offering precise, personalized insights. The stock trading consultant development signifies a leap toward accessible, informed trading for everyone.

Key Takeaways

The integration of AI into stock trading is not futuristic—it’s here, reshaping how we interact with the stock market.
AI-driven models like GPT-3 and GPT-4 provide personalized strategies, aligning with individual risk profiles, and financial goals.
AI harnesses both historical and real-time data to predict market trends and inform investment strategies.
Sophisticated investment strategies are no longer just for institutional investors; they are now accessible to retail investors, thanks to AI.
AI empowers investors to make informed decisions, providing a strategic advantage in the volatile realm of stock investment.

Frequently Asked Questions

Q1. What are Large Language Models (LLMs) and how do they apply to stock trading?

A. LLMs are AI models that process and generate text. They analyze financial reports, news, and market data in stock trading to provide investment insights and predictions.