If they are delivering a compressed file and an uncompressed file, that already disqualifies the test. The only way to do ABX comparisons is to round-trip one version through the encoder and back to the original format. Anything else introduces uncontrolled dependencies (e.g. on a particular device's decoder implementation) and side channels that unblind the experiment (like loading time).
This is a common mistake people make, e.g. comparing 48k vs 96k files. What you need to do is take a 96k original, downsample it to 48k, upsample it again to 96k (both using very high quality algorithms), then compare it to the original 96k file again. Otherwise you're relying on your playback software or hardware's resampling algorithm, and I guarantee that's a compromise between quality and performance, and not valid for a scientific test.