Longhorn Failed Upgrade from v0.8.1 to v1.0.0 caused by pv created before v0.6.2

This post is mirror of https://forums.rancher.com/t/failed-upgrade-from-v0-8-1-to-v1-0-0-caused-by-pv-created-before-v0-6-2/17586

I scale down all pods that has vpc to 0
I upgraded longhorn chart from 0.8.1 to 1.0.0
Pods in longhorn-system was upgraded successfully
I upgrade all volume’s engine image successfully, and delete old engine image (0.8.1)
I scaled up the pods again. All is fine and can be mounted successfully, but 2 pods failed to mount
I checked longhorn manager logs, found this log when I scaled up pods

Skipping VolumeAttachment csi-5a855e5c301b037f5ca2d9661295f07e0aef8fd94bdd8bbc65e608d910d66759 for attacher io.rancher.longhorn

Myself

I remember io.rancher.longhorn in release note, I checked again https://github.com/longhorn/longhorn/releases/tag/v1.0.0

I went to https://longhorn.io/docs/0.8.1/deploy/upgrade/longhorn-manager/#migrate-pvs-and-pvcs-for-the-volumes-launched-in-v062-or-older

I executed https://longhorn.io/docs/0.8.1/deploy/upgrade/longhorn-manager/#migration-steps
Unfortunatelly, the migration was fail (because I execute it after I upgrade to v1.0.0)

bash migrate.sh pvc-f476893d-0309-11ea-9a5c-ce8d7549db3b
FATA[2020-06-01T11:53:41Z] Error migrate PVs and PVCs for the volumes: Failed to migrate PV and PVC for the volume pvc-f476893d-0309-11ea-9a5c-ce8d7549db3b: failed to delete then recreate PV/PVC, users need to manually check the current PVC/PV then recreate them if needed: failed to wait for the old PV deletion complete
command terminated with exit code 1

Myself

When I check the pv, it was stuck in terminating state

pvc-f476893d-0309-11ea-9a5c-ce8d7549db3b 5Gi RWO Retain Terminating mine/data-mariadb-0 longhorn 204d

Myself

I remember issue like this happened to me when I tried to delete pv long time before. I solved it by removing the finalizer when editing pv (PLEASE DON’T DO THIS IF YOU STILL WANT YOUR PV DATA), and yes, my pv was gone, my data was gone. Ok no problem.

I checked another pv with younger creation time and compare it with my oldest one. There is a different in finalizer section, external-attacher/io-rancher-longhorn vs external-attacher/driver-longhorn-io.

I just think that external-attacher/io-rancher-longhorn doesn’t exist in v1.0.0 and make my pv in terminating state. So I changed external-attacher/io-rancher-longhorn into external-attacher/driver-longhorn-io, and re-run migration script in https://longhorn.io/docs/0.8.1/deploy/upgrade/longhorn-manager/#migration-steps, and the pv was upgraded successfully. When I scaled up the pods, it also can mount the migrated volume.

By this result, I think the migration was well done, but please do at your own risk and always backup your data before upgrade. I was lucky I still have my data backed up.

In short, when you have upgraded from 0.8.1 into 1.0.0, but you still have pvc that need to be migrated, you can do this (at your own risk):

  1. Execute this to see pvc list that need to be migrated
    1. kubectl get pv –output=jsonpath=”{.items[?(@.spec.csi.driver==\”io.rancher.longhorn\”)].spec.csi.volumeHandle}”
  2. Edit the finalizer in each pv from external-attacher/io-rancher-longhorn to external-attacher/driver-longhorn-io
  3. Run the migration script for each pv (https://longhorn.io/docs/0.8.1/deploy/upgrade/longhorn-manager/#migration-steps number 3)
    1. curl -s https://raw.githubusercontent.com/longhorn/longhorn/v0.8.1/scripts/migrate-for-pre-070-volumes.sh | bash -s — volume name

Please do with your own risk. I don’t know if this is the right steps. If you still can do https://longhorn.io/docs/0.8.1/deploy/upgrade/longhorn-manager/#migration-failure-handling, please do it that way.

I will try to answer any question to this thread as far as I can, but I’m not longhorn expert.

I hope this can help someone.

If any of you can reproduce or validate and explain exactly technically why this scenario was happened, it will be a big help.
Thank you.

Leave A Comment