Have you considered training a shallow MLP autoencoder, perhaps with tied weights between the encoder and the decoder to reduce the dimensionality of the embeddings? Another (IMO, better) approach I can think of off the top of my head, would be to use a semi-supervised contrastive learning approach, with labelled similar and dissimilar video pairs, like in this notebook[1] from OpenAI.
[1]: https://github.com/openai/openai-cookbook/blob/main/examples...