Lecture 7 - Building AI models in the wild

Practical touch of AI models

This lecture was about some practical use cases.

Interesting case with profile photo

The job was to verify user photo with profile picture. Does it match? Model had poor performance over time.

What can be done?

  • Retraining - made little to no impact.

What was the cause? New iPhone camera was so much higher resolution that it broke the model. It’s important to have the same format/size of input data as the expected outcome. I think it’s also related to LLMs. In most cases they are trained on short conversations, so they do are not able to work within huge context. Usually we don’t throw hundreds of pages into our message. Mostly we operate in few pages max analyzing some PDF file or something.

Recommendation engine

Offline model doesn’t match with online performance.

What happened?

Embeddings were generated from model which we are trying to outperform. It had the same efficiency over all sets of data. New data were not matching context of the model.

What can be done?

  • Create different sets of embeddings from different models. In a set of models, they are more specialized and should match more cases.

LLMs

Interesting fact from lecture is about task probability.

In internet text, rot-13 is about 60 times more common than Rot-2. GPT cannot do properly rot-2 as it is statistical model and haven’t seen it enough. Even if concept is the same, “move” letters 2 times in one direction, it cannot pick up proper way. It might happen as it is not working on detecting concepts, but only patterns. The pattern of rot-13 is more common than others, so it extracted only it. The concept of moving things by rot-XYZ is not in the model.

Essay score

The problem is about giving score at the start of response. Model is doing calculations to calculate probability based on previous text. We ask to give it a score, and it will return e.g. 2/10. The rest of generated text, like opinion will match 2/10 as most cases on the internet follow the concept of giving review after the score. And it will stick to 2/10, so only 4 characters are determining the rest of output.

Hallucinations

Interesting fact is about hallucination. Model can detect it by itself. If you ask to generate something fake and then validate it, it will know. If the text was generated fake, it will have really low probabilities over whole generation. So the model, can score itself, based on internal probabilities and in this way detect is the generated output containing fake information. E.g. we can simply ask to rate the content, based on how probable it is to be fake.

Lecture: https://youtu.be/ZAGiinWiFsE?si=2rFVQgMPNbi-pnS0