Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OK this was a bit funny:

  Top HW failure modes: 
  * GPU falling off the bus
I honestly thought "do they mean GPUs falling off a bus entering the data center" and then realized its actually the connectivity, as they mention in the next line

  GPUs falling off: In this case, GPUs are not detected by the host on PCIe.


The bits on the bus go round and round!

There is a lot of interesting yet unpublished work on 'data center' scale compute complexes. It was a rabbit hole I fell into several times while at Google.


They do publish some of that, or at least they used to. In particular "The Datacenter as a Computer" [0] was a very interesting read.

[0] https://research.google/pubs/the-datacenter-as-a-computer-an...



Back when EVGA was still selling GPUs ...


And offering warranty. And not doing stealth total component change under same sku.


They did it to finance their street racing habit, I'm sure. :P


Speaking for myself (and I guess anyone else dealing with pcie riser hell in on-prem deep learning setups), its nice to see the massive orgs dealing with pretty much the same exact pain points as not-so-massive orgs.


I was imagining that some sys admin has to walk to the server, take out the GPU, blow against the PCI-E pins like a game cartridge, and put it back to try again.


More to do with bent pins, material obstruction, or something as trivial as cable management (eg: bundles of qsfp weighing down the ports that are press-fitted not soldered).


"GPU has fallen off the bus" is an actual error message nvidia.ko prints to dmesg in this case :p


Brings a whole new meaning to bus factor


I’ve never met a GPU that could survive getting hit by a bus.


> GPU falling off the bus

I'm wondering if we could prompt llama3 with the above statement. What kind of response would it give?


With temperature set to 1, it recognizes the joke, but proceeds to explain what the "bus" is in computer terms, picks a problem this prompt could mean, and explains how to solve it. In ~20 tries it always gave me something along the lines of:

The infamous "GPU falling off the bus" issue!

This problem typically occurs when a graphics processing unit (GPU) is not properly seated or connected to its expansion slot, such as PCIe, on a motherboard.

Here are some troubleshooting steps to help resolve the issue:

(numbered list of steps or options follows)

Tested on Llama 3 Instruct 7B Q8_0, because that one fits entirely on my GPU.


+1, interesting findings! I like how it was able to infer the meaning from such a short phrase in a limited context.


It's actually a very common phrase on forums, I think because it's an actual error that Linux will report: https://askubuntu.com/questions/868321/gpu-has-fallen-off-th.... I've also never heard of it, but it seems like it must appear a lot in the training data and probably about 0 times is referring to a bus on the road.


In my testing, both Llama 3 and its abliterated (uncensored) variant from[0] almost always remarked more or less directly that they see the joke in the phrase, so either they've seen the other meaning in training, or inferred it.

--

[0] - https://news.ycombinator.com/item?id=40665721


Oh I agree it probably inferred the joke. I was actually more surprised that it knew the real meaning of the phrase because I as a human did not, until I looked it up and saw how common it is.


Please use the word ablated instead. That article's title is not using a real word. I'm assuming it's the author's English issue, since they called the model "helpfull" instead of "helpful".


Oops. I actually originally wrote "ablated", then changed it to be consistent with the title.


To be specific, the system prompt used was (default in LM Studio config for Llama 3 V2):

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

And then the query was:

GPU falling off the bus

And yes, I imagine it read that query as ending with an implied "pls help!".


like they did to our dear Anton in Silicon Valley


A GPU falling off the bus would be one mega flop


Haha… It is a a ‘Tera Flop’, as they are falling in the ground…


The audience in the back goes clap clap clap, chapeau bas.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: