Large language models (e.g., GPT-4) seem poised to revolutionize the business world. It’s only a matter of time before many professions are transformed in some way by AI, as GPT can already generate functional code, review and draft legal documents, give tax advice, and turn hand-sketched diagrams into fully-functioning websites. Among the roles most likely to be affected by GPT are those involving sales, marketing, customer support, and media, although it’s almost impossible to imagine a domain that won’t in some way be affected by GPT. While certain tasks invariably demand a human touch, it’s likely that the focus of many roles will shift toward these key human endeavors and away from those that can be automated. With all this in mind, it’s pertinent to ask what challenges organizations are likely to encounter as they begin to invest in advanced AI and which roadblocks developers are likely to run up against as they work to incorporate GPT APIs into software products. While it is still too early to anticipate all possible hurdles teams using GPT are likely to experience, our understanding of AI and large language models suggest at least a few that will be particularly prominent.
Cost and latency
The most advanced GPT-4 model is currently only accessible via OpenAI’s API. This has both benefits and drawbacks. One drawback is that businesses are beholden to OpenAI’s cost structure and backend systems. If the API experiences downtime or returns responses with high latency, this can lead to frustrating experiences for users of downstream products. Currently, OpenAI does not provide SLAs with respect to downtime/latency, so there is likely to be large variance in these respects. See Chip Huyen’s analysis for some statistics. What this means for businesses is that they should have monitoring in place that can detect failed API calls or high-latency responses and fallback systems that can take over in instances of failure.
While OpenAI’s pricing structure is transparent, it is based on token-level usage. This can make it quite difficult to estimate long-term costs because the length in tokens of prompts and the responses returned by the API can be highly variable. This is likely to result in some limitations for downstream products, where user inputs often control the prompts issued to the API. Most businesses will see large fluctuations in their usage of the model as they acquire new customers and see spikes or lulls in demand. Thus, it can be difficult to determine how to charge customers for services that involve GPT in the backend. However, companies can become creative in their pricing strategies to help mitigate these effects. In order to prevent costs from ballooning, companies might need to cap their product’s users in some way, for example by offering subscription pricing tiers based on the extent to which their usage of the product results in queries to the OpenAI API. Businesses and developers will also need to become skilled at prompt engineering in order to get the most bang for their buck. Prompt tuning is likely to be one of the largest expenditures for companies using the API, as the quality of the model’s responses is known to vary largely based on the wording and structure of the prompt. Monitoring expenditures and tying them back to prompt usage and specific responses is key to effective cost-management. Being able to identify costly prompts can give organizations insight into which use cases are causing cost surges and can either reduce usage on the backend or adjust their product pricing structures accordingly.
One of the most exciting potential applications of GPT-4 lies in chaining together different tasks. For example, using the model to create a shopping list in Instacart, book flights using services such as Kayak or Expedia, or access structured knowledge databases via calls to Wolfram Alpha or other services. This functionality is currently provided via OpenAI plugins and third-party tools such as LangChain. While the potential here is huge, it also introduces multiple potential points of failure into an AI system. Without proper monitoring in place, it would be difficult to tell whether an error originates with the GPT-4 API, the downstream service, or somewhere in their intersection. This naturally extends the idea of monitoring downstream KPIs based on a series of business processes.
Bias and fairness
While OpenAI has not released the exact details of its training procedures, it’s safe to assume that GPT was trained on unfathomably large, internet-scale datasets. Much of this data is likely to represent both the best and worst of humanity, from lengthy discourses on freedom and liberty to racist diatribes and sexist screeds. While we can hope that the good data of humanity outweighs the bad, research on large language models has shown that they are sometimes liable to generate biased outputs and occasionally make prejudicial decisions if not properly monitored and calibrated. Luckily OpenAI already provides APIs through which teams can finetune GPT on their own internal datasets and in doing so mitigate some bias, assuming their own data has been carefully vetted against standards of fairness and equity. This is not sufficient however, and teams should put careful monitoring procedures in place to ensure that users interacting with their systems are not being presented with biased outputs. Furthermore, monitoring should ensure that any biased outputs do not propagate to downstream business decisions that are made with respect to customers. For example, one should ensure that a GPT-based credit screening product is not blanket denying applications from users of a certain demographic or requiring them to undergo additional steps that are not otherwise justified by objective criteria such as income and employment history.
One challenge of ChatGPT relative to other AI models is its proprietary nature. If you were to train your own language model in-house, you’d have an understanding of the datasets the model was trained on, the model’s strengths and weaknesses, and the ability to tweak the model when it did not perform as expected. While using a proprietary model such as ChatGPT does bring its share of benefits – a predictable pricing structure, no need to manage model training and hosting infrastructure, and access to a more powerful model than could be trained in-house – it also brings about complications that businesses need to consider when using ChatGPT’s outputs within their products and processes. Because ChatGPT is a black-box neural network, it’s often impossible to predict how it might respond to a given query. For example, ChatGPT has been known to fail at basic math and to hallucinate by making plausible, but provably false, statements. In effect, when you incorporate ChatGPT into your products, you are taking on responsibility for everything it says, both good and bad. While the majority of the model’s outputs are likely to be reasonable, it’s important that you have monitoring procedures in place that can detect when the model makes incorrect or harmful statements.
Finally, it’s important to note that ultimately OpenAI has complete and total control over the GPT-4 API. While unlikely, they can revoke access to their API for organizations whose use cases they don’t approve of. Many have voiced concern about the strict regulation around GPT-4 which some worry will be used to prioritize certain uses over others. GPT-4 has also demonstrated political bias and can be used as a tool of influence. Beyond this, there are concerns as to the integrity of the data used to train the model. Because GPT-4 is an LLM, it can regurgitate chunks of text from its training set, much of it likely pulled from copyrighted sources. This can put business owners in a legal bind if strict adherence to copyright regulations, GDPR, and other privacy standards is their goal. Furthermore, the OpenAI terms of service specify that any data or prompts passed to the API may be used for further training and evaluation of the model. Thus, it is likely that reliance on the API is contraindicated for certain use cases where customer data must be kept private, such as in the healthcare industry.
General AI challenges
Because ChatGPT’s performance is so impressive on general reasoning tasks, it can be tempting to believe that it doesn’t suffer from the sorts of issues that plague less powerful AI models. However, this is not the case. ChatGPT can still experience deleterious effects, like concept and data drift, as its user base and prediction environment changes. The model will need to be periodically fine-tuned to a business’ most recent data in order to mitigate this. Furthermore, issues in prediction performance will often appear in small, seemingly negligible, slices and segments of users. In order to stay on top of model decay and proactively remedy it before it propagates to downstream business processes, it is important to ensure the proper monitoring procedures are in place to detect it early.
Business leaders and software developers should not be afraid of ChatGPT and other generative AI models, as they bring with them seemingly limitless possibilities and the ability to drastically drive down costs and improve productivity. However, they should be aware of the unique set of challenges these models introduce and how they might come to bear in a business context where customer experience is paramount. As such, it’s important to proactively develop a plan to address these challenges before they come to a head. In particular, implementing holistic ML monitoring capabilities is vital to ensuring that generative AI has a positive impact. If you have any questions about how monitoring your GPT models can impact your business, get in touch with us.