Best Practices for Choosing an LLM for AI Applications

I also think it’s really important to know how large language models (LLMs) work within apps. I discovered that not all LLMs are created equally. But the cool thing is choosing an LLM. There are different types of them like some are big and others small Each has its own training data and some can do things which others cannot. There is a lot to choose from and your choice can always affect how well your app functions, the quality of it and yes on top of that even the price you would have to pay.

1. The coolest LLMs you can find today

2. How to compare them using important stuff

3. The good and bad of size versus performance

The most promising LLMs in the market

There have been lots of new large language models (LLMs) recently. Super-heavy on schedule time polishing these models in the previous year. New ones were forged by different groups, each with cool things they could do. A few are enormous and more brilliant than anything at any point seen. Some are just smaller but better at doing whatever it is specifically.

In this piece, we will take a look at some of the best LLMs to do in 2025. We will discuss where they came from what we learned about them and how they operate. We will learn what they do on various tests and their strengths or weaknesses. We will also discuss what these models are used for, and the dark side of AI that perhaps threatens us all in a future.

Proprietary models

Proprietary LLMs originate from private corporates and do not open-source their code. They are typically paid services. These seem to be a step above the Android One models, with more frequent support and updates as well as better-aligned safety branding. With more complexity and larger training datasets, they usually. outperform open-source models most of the time. The only catch is, they are a “black box” and the abstraction hides their inner workings from developers. Now, let us turn to August 2023 and scan over three of the most common proprietary LLMs for in-place research.

GPT-4

OpenAI: GPT-4 was launched in March 2023, andgot another one of the newest model. It has a newer version as well, GPT-4 Turbo. At the moment, it is one of the perfect models out there. The CEO Sam Altman went as far to say that they are already working on GPT-5!

GPT-4 is one of the GPT models (generative pretrained transformer). This extends a design introduced by OpenAi that makes the use of decoder-only transformers. This is based on the following transformations; The following simple diagram illustrates a basic setup.

The diagram above reveals that the decoder-only mode retains some of essential ingredients in the transformer architecture we have gone through in Chapter 1 — namely, Positional Embeddings, Multi-Head Attention and Feed Forward layers. This model, however, only had a decoder. It learns the prediction of next token in a sequence with respect to all previous tokens. This has no separate encoder to summarize incoming info as in models using both an encoder and a decoder. The decoder instead tracks it all in a hidden state, and incrementally updates that as the text is generated.

So what has improved in GPT-4 from previous models? _GPT-4 was trained on publicly available datasets, as well as a collection of data associated with the web pages (and other relevant materials) that OpenAI had gathered permission to use. They were not clear about the contents of training set, however. They trained along with the RLHF to let the model learn and improve itself by knowing what is more suited for making users happy.

GPT-4 performed well at some logic inference and analysis tasks. These baseline were run on some of the state-of-the-art systems like MMLU discussed in Chapter 1. GPT-4 outperformed older models across all conditioning prompts, including in other languages for the MMLU tests.

An illustration of how well GPT-4 did on MMLU

Exam results (ordered by GPT-3.5 performance)

While GPT-4 is still far from perfect, it achieves significant gains on the TruthfulQA benchmarks. They are tests to see how accurately the model can distinguish facts from lies (i.e. We have discussed TruthfulQA benchmarks in Chapter 1 under Model evaluation area)8

Now, just to visualize what this means in practice, here is a chart comparing the performance of GPT-4 with that of GPT-3 on the TruthfulQA benchmark. 5 (the model adopted in OpenAI’s ChatGPT) and Anthropic-LM (we will cover this design in the following areas).

Accuracy on adversarial questions (TruthfulQA mc1)

GPT-4, it took about a year and OpenAI went above in beyond to make sure GPT improved security-wise along with being more aligned for serving user queries. They assembled a team of over 50 experts today from the beginning. These included people from AI alignment risks, privacy and cybersecurity experts. They were trying to understand the severity of risks from such a high stake model and most importantly — what could be done, so as not for those risk cases become true?!

Gemini 1.5

Google launched Gemini 1.5 in December 2023, the top generative AI model [source: Google). Gemini, like GPT-4 is multimodal, which means it can work with different types of data – this includes text,, images and anything else that you throw at it such as tweets or code. It is designed T5 architecture that uses Mixed-of-Group or Expert and encoder-decoder Transformer.

It comes in different sizes (such as Ultra, Pro and Nano) for various computing needs from big data centers down to smaller mobile devices. Alternatively, developers can also integrate its features into their app using the API for different versions of Gemini.It does perform better on text, image and audio tasks compared to Gemini 1.0 A screenshot of this is as shown below.

Note that Gemini 1.5 Pro is outperforming Gemini 1.0 Ultra (which is remarkably bigger) in many benchmarks across the various domains. As of today, Gemini Pro can be tried via a web app at gemini. google.com for free, while Gemini Ultra is available via a premium subscription with a monthly fee. On the other hand, Gemini Nano, which is tailored for mobile devices, can be executed on capable Android devices via the Google AI Edge SDK for Android. Note that, as of April 2024, this SDK is still under early access preview and you can apply for the early access program at Google Form. Finally, Gemini Pro and Ultra can also be consumed by developers via the REST API from Google AI Studio.

Claude 2

Claude 2: Constitutional Large-scale Alignment via User Data and Expertise Anthropic is a company developed by former OpenAI people focusing on AI safety and alignment, that created this giant language model. This was planned in July 2023.

Consequently, the model outlined here is of a flavour based in transformer technology and trained on public data from around the net combined with various proprietary datasets. It was a combination of supervised methods, unsupervised learning, reinforcement learning from human feedback (RLHF), and also this legal AI tool called constitutional AI (CAI). This is an exclusive function of Claude 2. Anthropic went to great lengths to ensure Claude 2 is consistent with beau-cafe safety measures.

Or more of a procedure that they built called CAI, which was described in their paper “Constitutional AI:Harmlessness through Confirmed-Aggregate-Input” back in December 2022. CAI aims to increase the safety and alignment with human values of a model. This is to prevent harmful or unethical outputs, not to assist people with illegal behavior and also create AI system that will be good friendly, honest and no hurtful. In order to carry out this task, CAI depends on a series of guiding principles for the model’s behavior instead just human feedback or data. These principles are drawn from many sources, including the Fates of AI Reactions to Long-term Enforcement Scales Human Rights Best Practice Trust and Safety Trade AI Research Organization Thoughts Outside Journalists Counter Arguments Mindfall Truths Results obtained in real life.

Claude’s training process according to the CAI technique

Claude 2 is special too as its context length can be so long as up to a stunningly high amount of tokens, such as even going for over a total count of all words in the entire Bible. This allows a user to enter an extensive writing, such as libraries of technical literature or even entire book comfortably without another imports. It can also generate longer outputs than most LLMs. Additionally, Claude 2 does well with code (71.2% on the HumanEval benchmark).

Claude 2 is an intriguing model and a formidable rival to GPT-4. It can be accessed via the REST API or through the Anthropic beta chat experience, which, as of August 2023, is limited to users in the United States and the United Kingdom.

The following comparison table shows the main differences between the three models: ChatGPT vs Gemini vs Claude

	GPT-4	Gemini	Claude 2
Company or institution	OpenAI	Google	Anthropic
First release	March 2023	December 2023	July 2023
Architecture	Transformer-based, decoder only	Transformer-based	Transformer-based
Sizes and variants	Parameters not officially specified Two context-length variants: GPT-4 8K tokens GPT-4 32K tokens	Three sizes, from smallest to largest: Nano, Pro, and Ultra	Not officially specified
How to use	REST API at OpenAI developer platforms Using OpenAI Playground at https://platform. openai.com/playground	REST API at Google AI Studio Using Gemini at https://gemini. google.com/	REST API after compiling the form at https://www. anthropic.com/claude

Open-source models

A key benefit of open-source models is that developers can see and access the entire source code. For LLMs, this means:

You can change the architecture a lot, since if you modify it in your local version then we no longer need to worry about reinstating that from those who own models.
It is also possible to train your own model besides the fine-tuning possibilities that are offered with its proprietary models.
You do not have to pay for the use of containers, unlike proprietary models which are mostly based on a Pay-per-use model.

In this book, we will benchmark open-source models with the Hugging Face Open LLM Leaderboard (Chetti et al. 2048), which you can access through HugggingFace This project assesses LLM performance on a series of benchmark natural-language understanding tasks and is deployed at the Hugging Face Spaces, which provides a platform for running machine-learning applications.

LLaMA-2

Large Language Model Meta AI 2 (LLaMA-2) is a new set of models created by Meta, released to the public on July 18, 2023. Unlike its initial version, which was restricted to researchers, LLaMA-2 is open-source and available for free.

LLaMA-2 models come in three sizes: 7, 13, and 70 billion parameters. All the versions have been trained on 2 trillion tokens and have a context length of 4,092 tokens. On top of that, all model sizes come with a “chat” version, called LLaMA-2-chat, which is more versatile for general-purpose conversational scenarios compared to the base model LLama-2.

The fine-tuning process of LLaMA-2-chat involved a two-stage training.

Supervised fine-tuning — The model was then tuned with public supervision datasets and over a million human annotations. This makes the model safer as well as more suitable for chats. The model produces its outputs guided by a specific set of prompts, and The loss function ensures both diversity and relevance.
Reinforcement Learning from Human Feedback (RLHF): For this technique, just like GPT-4 which is a human feasible metric for checking the answers of output. The feedback is then used to improve and optimise the model even more.

To access the model, you need to submit a request on Meta’s website (the form is available at https://ai.meta.com/resources/models-and-libraries/llama-downloads/). Once a request is submitted, you will receive an email with the GitHub repository where you will be able to download the following

assets:

Model code
Model weights
README (User Guide)
Responsible Use Guide
License
Acceptable Use Policy
Model Card

Falcon LLM

Falcon LLM is part of a new movement in language model development towards creating less heavyweight models with fewer parameters but giving the training data quality first priority. In contrast to a large model like GPT-4 that would need a lot of computational horsepower and days or weeks of runtime for training, Falcon LLM is much more efficient.

AUSTIN, TEXAS: Falcon LLM was developed by Abu Dhabi’s Technology Innovation Institute (TII) last year and released in May 2023; it is a transformer autoregressive model built on the decoder-only setting with over 1 trillion tokens trained parameters. Also Published in a lighter version of 7 billion parameters It also provides a slimmed meme version, “Instruct,” that is specifically trained to follow commands given by the user.

From launch, Falcon LLM has sat at the top of Open leaderboard just behind some versions of LLaMA. This then raises the question: How exactly does a model with “just” 40 billion parameters perform so well?

The key lies in the training dataset. Falcon was built with purpose-transformed data, specialized tools and a unique data pipeline to harvest relevant content from the structured web. The pipeline above makes use of a multitude of filters, and deduplication mechanisms for only keeping high-quality content. The developed dataset named RefinedWeb has been released by TII through the Apache-2.´s license. It is licensed under the Apache 2.

Despite using only 4 GPU days during the training of models like GPT-3 and PaLM-62B (i.e., about ~75%-80% of their compute budget), Falcon manages to achieve performance significantly better than these prior large-scale neural language model benchmarks due to focus on superior data quality, as well as various optimizations.

Mistral

The last one on our list is the Mistral series developed by a company called Mistral AI, which was founded in April 2023 from former Meta Platforms and Google DeepMind scientists. Hailing from France, Mistral AI has rapidly established itself as a major player in the field receiving significant investment — and quickly becoming one to watch for opting open-source LLMs emphasizing transparency and easy adoption of artificial intelligence.

The Mistral-7B-v0qq, is one of them. More interestingly, 1 is a decoder-only transformer with 7.3B of parameters optimized for generative text tasks (in other words gpt-4) which would make it the beast to test how powerful FullAttention has been all along? It has novel architectural choices like grouped-query attention (GQA) and sliding-window attention (SWA), that allow it to beat baselines in many benchmarks.

There is an additional fine-tuned variant of the model called Mistral-7B-instruct which contains general-purpose capabilities. This variant now beats all other 7b-parameter LLMs on MT-Bench, the evaluation framework involving an LLM as a judge. As is the case with many other Transformer-based models, Mistral can be accessed and downloaded through the Hugging Face Hub.

Main differences between the three models: LlaMA vs Falcon LLM vs Mistral

	LlaMA	Falcon LLM	Mistral
Company or institution	Meta	Technology Innovation Institute (TII)	Mistral AI
First release	July 2023	May 2023	September 2023
Architecture	Autoregressive transformer, decoder-only	Autoregressive transformer, decoder- only	Transformer, decoder only
Sizes and variants	Three sizes: 7B, 13B, and 70B, alongside the fine- tuned version (chat)	Two sizes: 7B and 40B, alongside the fine- tuned version (instruct)	7B size alongside the fine-tuned version (instruct)
Licenses	A custom commercial license is available at https://ai.meta.com/ resources/models-and-libraries/llama-downloads/	Commercial Apache 2.0 licensed	Commercial Apache 2.0 licensed
How to use	Submit request form at https://ai.meta.com/ resources/models-and-libraries/llama-downloads/ and download the GitHub repo Also available in Hugging Face Hub	Download or use Hugging Face Hub Inference API/ Endpoint	Download or use Hugging Face Hub Inference API/Endpoint or Azure AI Studio

Best LLM Programs in the Market

The last chapter made a distinction between proprietary and open-source oriented models, explained what was good or bad about both kinds of model. It took a deep dive into the architecture and technical aspects of prominent models like GPT-4, PaLM-2, Claude 2, LLaMA-2, Falcon LMM etc; again in addition to different flavors of some kind category termed as an code-gem class ™ for instanceƔLMMs. It also provided a pattern that could be used to help developers choose the right LLMs required in building an AI-powered application which is very essential as this would make it possible for such applications to achieve more relevance based on specific industry situations. In the next chapter, we move on to practical work with LLMs in applications.

References

GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf
Train short, test long: attention with linear biases enables input length extrapolation. https://arxiv.org/pdf/2108.12409.pdf
Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073
Hugging Face Inference Endpoint. https://huggingface.co/docs/inference-endpoints/index
Hugging Face Inference Endpoint Pricing. https://huggingface.co/docs/inference-endpoints/pricing
Model Card for BioMedLM 2.7B. https://huggingface.co/stanford-crfm/BioMedLM PaLM 2 Technical Report. https://ai.google/static/documents/palm2techreport.pdf
Solving Quantitative Reasoning Problems with Language Models. https://arxiv.org/abs/2206.14858
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. https://arxiv.org/abs/2306.05685