Final Project: Crude Oil Price Projection

Content:

Topic & Purpose Questions Data sources Technologies/Tools Results Summary

★ Topic & Purpose

✓ Topic: Crude Oil Price Projection

Crude Oil Price Projection

✓ Reason why we selected this topic

For this project, we intend to develop a robust machine learning model that will deliver a one-year projection for the price of West Texas Intermediate (WTI) Crude Oil, one of the most well-known and widely produced blends of crude in the United States. To do so, we will be examining historical and current data from the Energy Information Agency (EIA), a federal agency that tracks production, sales, and spot & futures prices of WTI Crude. The price projection will be a function of oil production, refinery utilization and capacity, and sales.

★ Questions

✓ Questions we hope to answer with the data

Where will the data be sourced from?
How will the data be data be transformed and stored?
Which machine learning models and libraries should be used?
How will the results be visualized?
Which model produces the most accurate projections for WTI?

★ Data sources

✓ Description of their source of data

Gathering data from U.S. Energy Information Administration (EIA)
- Crude Oil Production dataset
  
  The volume of crude oil produced from oil reservoirs during given periods of time. The amount of such production for a given period is measured as volumes delivered from lease storage tanks (i.e., the point of custody transfer) to pipelines, trucks, or other media for transport to refineries or terminals with adjustments for (1) net
  differences between opening and closing lease inventories, and (2) basic sediment and water (BS&W).
- Product Supplied dataset
  
  Approximately represents consumption of petroleum products because it measures the disappearance of these products from primary sources, i.e., refineries, natural gas-processing plants, blending plants, pipelines, and bulk terminals. In general, Product Supplied of each product in any given period is computed as follows: Product supplied = field production + refinery production + imports + unaccounted for crude oil (+ net receipts when calculated on a PAD District basis) - stock change - crude oil losses - refinery inputs - exports
- Refinery Utilization and Capacity dataset
  
  Ratio of the total amount of crude oil, unfinished oils, and natural gas plant liquids run through crude oil distillation units to the operable capacity of these units.
- NYMEX Futures Prices dataset
  
  New York Mercantile Exchange, The price quoted for delivering a specified quantity of a commodity at a specified time and place in the future.
- WTI Crude SPOT Prices Historical dataset
  
  West Texas Intermediate, The price for a one-time open market transaction for immediate delivery of a specific quantity of product at a specific location where the commodity is purchased "on the spot" at current market rates.

★ Technologies we used

✓ Technologies, languages, tools, and algorithms used throughout the project

Data Cleaning
- Jupyter Notebook 6.4.6
  - Python (ETL process)
    - Pandas library
- Quick DBD (Create ERDs)
- PostgreSQL (Join/Merge datasets)
Database Storage
- pgAdmin 4
  - PostgreSQL
- Google Cloud Platform (GCP)
Machine Learning
- Jupyter Notebook or Google Colaboratory (Google Colab Notebook)
  - Python
    - Scikit-learn library
      - Prophet
      - LSTM (Keras/TensorFlow)
      - ARIMA
      - Random Forest Regressor algorithm
Dashboard
- Tableau Public 2021.3.3
- Visual Studio Code 1.63.2
  - HTML/CSS
- GitHub Pages
Slides
- Google Slides

★ Results

✓ Result of analysis

Results_small

★ Summary

✓ Summary

In conclusion, our results indicate that Prophet yields the highest accuracy score for forecasting WTI Crude Oil futures, based on the MAPE scores.
While further improvements can certainly be made to all of these models, Prophet offers an intuitive, user-friendly forecasting platform that boasts impressive accuracy, given its simplicity.
Random Forest Regressors, particularly when tuned to account for only the most important variables, also serve as a powerful forecasting tool when applied to WTI pricing.
ARIMA struggles to efficiently cope with the volatility of recent Crude pricing data, as seen in the forecast graph. While the ARIMA calculates a relatively acceptable MAPE, the trend line does not cope well during periods of high price volatility.
LSTM models have high potential for accurate modeling of financial data, but it appears that the hyperparameters of the model would require extensive tuning in order to be improved.
- Might also benefit from a larger dataset

✓ Recommendation for future analysis

Use of weekly or daily data
- Possible increase in model accuracy, at the cost of:
  - Increased runtime
  - Increased potential for model overfitting
Inclusion of pricing data for historically-correlated asset classes (USD, Gold, etc)
- Caution: in recent years, some of these correlations have shifted
Inclusion of global oil demand figures, not just product from domestic suppliers

✓ Anything the team would have done differently

For data:

There are several influences on oil prices
- We could have explored additional data types, such as
  - Weather patterns
  - Natural disasters
  - Political Instability
  - Military Conflicts
  - Correlated financial instruments

For Machine Learning:

Further exploration of different model types
- VARMA (Vector AutoRegressive Moving Average with differencing)
  - Multivariate iteration of ARIMA
- LSTM encoder-decoder Model (interprets data sequence-by-sequence)
Extrapolation of projections beyond the constraints of the time series
- Considerable scarcity of publicly-available examples/recommendations of how this is accomplished, especially in the case of multivariate, multi-step models
Deeper understanding of hyperparameter tuning for more complex models
- Would assist in the improvement of model accuracy

For Dashboard:

Explore more datasets from various sources and include more categorical data for calculation and visualization.

⇈
Go Top