Hi everyone,

I have been experiencing some weird problems lately, starting with the OS becoming unresponsive “randomly”. After reinstalling multiple times (different filesystems, tried XFS and BTRFS, different nvme slots with different nvme drives, same results) I have narrowed it down to heavy IO operations on the nvme drive. Most of the time, I can’t even pull up dmesg, and force shutdown, as ZSH gives an Input/Output error no matter the command. A couple of times I was lucky enough for the system to stay somewhat responsive, so that I could pull up dmesg.

It gives a controller is down, resetting message, which I’ve seen on archwiki for some older Kingston and Samsung nvmes, and gives Kernel parameters to try (didn’t help much, they pretty much disable aspm on pcie).

What did help a bit was reverting a recent bios upgrade on my MSI Z490 Tomahawk, causing the system to not crash immediately with heavy I/O, but rather mount as ro, but the issue still persists. I have additionally run memtest86 for 8 passes, no issues there.

I have tried running the lts Kernel, but this didn’t help. The strange thing is, this error does not happen on Windows 11.

Has anyone experienced this before, and can give some pointers on what to try next? I’m at my wits end here. EDIT: When this issue first appeared, I assumed the Kioxia drive was defective, which the manufacturer replaced after. This issue still happens with the new replacement drive too, as well as the Samsung drive. I thus assume, that neither drives are defective (smartctl also seems to think so)

Here are hardware and software details:

  • Arch with latest Zen Kernel, 6.7.4, happened with other, older kernels too though, tried regular, lts and zen
  • BTRFS on LUKS
  • i9-10850k
  • MSI z490 Tomahawk
  • GSkill 3200 MHz RAM, 32GB, DDR4
  • Samsung 970 Evo 1TB & Kioxia Exceria G2 1TB (tested both drives, in both slots each, over multiple installs)
  • Vega 56 GPU
  • Be quiet Straight Power 11 750W PSU
2 points
*

This is probably a strange question, but did you memtest86 your memory already?

permalink
report
reply
3 points

Did you even read the post?

permalink
report
parent
reply
2 points
*

I’m sorry, did I overlook the memory tested part? I don’t see it?
Edit: I see it now, sorry

permalink
report
parent
reply
5 points

I did, yes. One of the first ideas I got.

permalink
report
parent
reply
1 point

I see it now, dammit sorry

permalink
report
parent
reply
3 points

The only thing I can think of is to try the drives in a different system and see how they behave (same OS and configuration).

If they behave the same then that rules out everything except the drives themselves and the OS.

Considering how you mentioned the behavior is better in Windows, it sounds like a software issue, but you never know until you try.

permalink
report
reply
1 point

Unfortunately I have no other system at hand at the moment that’s able to accept nvme drives :( I could try using windows for a couple of days see whether the issue is really linux-related, but I am trying to avoid that lol

permalink
report
parent
reply
1 point

Maybe phone a friend?

permalink
report
parent
reply
1 point

Maybe even a PCIe pass through to a VM could do the trick if you’re desparate lol (with Linux living in a separate drive)

Orrrr maybe even try FreeBSD… (or mac OS, but eww gross don’t test that)

permalink
report
parent
reply
2 points
*

The other way to look at it is to stick the drives into a usb enclosure. That gets you away from the PC’s 3v3 rail. If you then hang the drive enclosure off of a powered hub/dock, you are definitely way outside of the PC’s power supply problems.

Here’s one that I have, hopefully it’s still made halfway good. https://www.amazon.com/gp/product/B08G14NBCS/

permalink
report
parent
reply
1 point

Not a bad idea actually, totally didn’t think about that.

permalink
report
parent
reply
2 points

Which of the drives does this happen with? Or does it happen with both?

permalink
report
reply
1 point

Happens with both drives, I have tried each possible permutation (Samsung in slot 1 and 2, kioxia in slot 1 and 2, and even only installing one drive at a time)

permalink
report
parent
reply
1 point

Are both drives fully encrypted with LUKS? Is trim enabled in both crypttab and fstab?

permalink
report
parent
reply
3 points

Both drives were encrypted (Samsung as root drive, encrypted except for the efi partition, and kioxia fully encrypted and mounted via crypttab and a key file residing on the encrypted Samsung partition for automatic unlock), although now as I have been reinstalling quite often, and couldn’t be bothered to set up the encryption for the second drive so it stays unused atm. Trim is enabled via a kernel parameter, but not in the fstab directly anymore (as I’m running BTRFS now, and from what I’ve gathered passing the ssd option to BTRFS is enough to enable trim, verified with lsblk --discard)

permalink
report
parent
reply
4 points

Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

permalink
report
parent
reply
1 point

Boot a live ISO with the flags recommended in the kernel message and do some tests on the bare drives. That way you won’t have the filesystem and subsequently the rest of the system giving out on you while you’re debugging.

Which tests are you referring to exactly? I have read about badblocks for example, and it not being much use for ssds in general, due to their automatic bad-block-remapping, so they remain invisible to the OS as all remapping happens in the drive’s controller. Smart values look great for both drives, about 20TBW on the Samsung drive, and a lot less on the Kioxia drive.

permalink
report
parent
reply
5 points

ssds getting not enough power? i’d test it with different PSU, i had a problem with my ssd failing and changing PSU worked, apparently 3.3VDC rail is routed on the motherboards without any conversion straight to m.2/pcie devices

permalink
report
reply
1 point

Unfortunately I don’t have a spare PSU, but I might try to measure the 3.3 volt rail with a multimeter (don’t own an oscilloscope unfortunately) while under load and see what happens

permalink
report
parent
reply
0 points

you could also try one of those USB to m.2 and see if that works

permalink
report
parent
reply
1 point

Do you have a spare set up where you can boot up from that same SSD? Literally any laptop would work plug and play and that would rule out the possibility of it being the motherboard on the OP.

permalink
report
parent
reply
1 point

Yeah, an oscilloscope would be handy in hunting spikes, it’s a bit harder with a standard multimeter, you sure you don’t know anyone with a spare PSU to borrow?

permalink
report
parent
reply
2 points

What brand is Kioxia?

permalink
report
reply
4 points

it’s a sub brand of Toshiba, so not some unknown shit, very respected brand i’d say

permalink
report
parent
reply
3 points

Yeah I did a bit of reading. I was about to blame the no-name Chinese storage! There’s so much garbage nvme stuff floating around lately.

permalink
report
parent
reply
1 point

there is so much garbage flash in general lately

permalink
report
parent
reply

Linux

!linux@lemmy.ml

Create post

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word “Linux” in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

  • Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
  • No misinformation
  • No NSFW content
  • No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

Community stats

  • 6.8K

    Monthly active users

  • 6.6K

    Posts

  • 181K

    Comments