Should TrueNAS use hardware RAID, or should ZFS manage the disks directly?
ZFS needs direct visibility of disks, SMART data, and error states. An HBA or JBOD mode is usually preferred, followed by vdev design based on performance, capacity, and rebuild windows.
1. Conclusion and scope
Prepare the client and server versions, domain membership, DNS and gateway settings, network location, full error text, event timestamps, and recent changes. The reserved example domain corp.example is used throughout; no customer domain, IP address, account, or device identifier is included.
This issue falls under Backup, NAS and business continuity. Logs and configuration can often be collected remotely first. Bulk permission changes, switch-path work, production cutovers, and recovery drills should use a controlled implementation window.
2. Symptoms and environment
- Capture the complete error text, event-log timestamp, and failed action rather than relying on a verbal description.
- Record the affected scope, first occurrence, reproducibility, and whether the result changes on another subnet.
- A successful backup job only means the job completed without a reported error; it does not prove restore-point integrity, application consistency, repository health, or bootability.
3. Troubleshooting sequence
- With TrueNAS and ZFS, the operating system should normally see each disk directly so SMART and error information remain available rather than hidden behind hardware RAID.
- ZFS should see individual disks and their real error state. Prefer an HBA or JBOD mode rather than hiding redundancy behind hardware RAID virtual disks.
- Choose mirror or RAIDZ vdevs from capacity, IOPS, rebuild window, and fault-tolerance requirements; pool topology cannot be changed as freely as conventional RAID.
- Before production, record SMART baselines, serial numbers, slot mapping, and replacement procedure so an alert identifies the physical drive, not merely a device name.
- Snapshots depend on the original storage and are suitable for short-term rollback; independent backups must cross devices or failure domains and be recovery-tested.
- Change one variable at a time and export the current configuration before making changes.
zpool status
zpool list
smartctl -a /dev/sdXReplace server names, domains, and paths with values verified for your environment. Do not copy real IP addresses, domains, or accounts from an unrelated environment.
4. Safe remediation and rollout
Start with read-only queries, configuration exports, and one-system validation. Once the root cause is confirmed, define the target scope, change window, and rollback method. Include recovery testing in monthly or quarterly operations, rotating full-machine, file, database, and critical-application tests while recording recovery time.
- Before production, record SMART baselines, serial numbers, slot mapping, and replacement procedure so an alert identifies the physical drive, not merely a device name.
- Snapshots depend on the original storage and are suitable for short-term rollback; independent backups must cross devices or failure domains and be recovery-tested.
- Change one variable at a time and export the current configuration before making changes.
5. Validation, rollback and common mistakes
Do not stop when the service works once. Revalidate with the user workflow, logs, a restart or fresh sign-in, another network location where relevant, and the next policy or backup cycle.
Validation and rollback checks
- Change one variable at a time and export the current configuration before making changes.
- Test full-machine, file, database, and application recovery separately and record RTO, RPO, credentials, network isolation, and acceptance results.
- Check repository capacity, file-system health, integrity checks, retention chains, synthetic operations, and immutable or offline copies.
Common mistakes to avoid
- Treating a successful job or an existing snapshot as proof of recoverability.
- Running recovery tests on the production network and causing identity conflicts.
- Keeping every copy on the same appliance without an independent or offline copy.
Need an assessment based on your actual environment?
Send the exact error, screenshots, operating system and application versions, a high-level network diagram, the affected scope, and the steps already attempted. We will first determine whether the issue is suitable for remote troubleshooting or requires an on-site change window, then confirm scope and pricing.
