[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] qcow2 corruption repair can not proceed due to bad snap
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] qcow2 corruption repair can not proceed due to bad snapshot |
Date: |
Mon, 23 Nov 2015 18:57:07 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 20.11.2015 20:33, Brian Taber wrote:
> I recently ran across an issue (completely my own fault) that others
> have encountered with varying details/success in fixing. I had a VM
> stuck in shutdown (windoze asking/waiting to kill a program) that I
> thought was already down when I created a snapshot on the 3 disks
> attached to the VM. After running the snapshot command I went back to
> the machine and instead of just turning off (which would have been
> better), I let the shutdown complete.
>
> Needless to say all 3 images had corruption to varying degrees. The
> first disk, system disk, was the worse. The other 2 has databases and
> were repairable via the "qemu-img check -r all image.img" command (with
> a bunch of messages/warnings). I suspect the limited activity on
> shutdown helped save them. The system disk would not perform a check,
> it encountered:
>
> qemu-img: Could not open 'image.img': Could not read snapshots: File too
> large
>
> Searching online for this returns different repair methods, but the
> latest version of qemu I compiled for a newer qemu-img (I did not want
> to use an older version as suggested in posts), I pulled latest source,
> compiled, but I got the same error trying to check or convert the image.
> I dug into the qcow2 code, silenced that particular error, and was able
> to get the check to actually run (I was able to work around the problem
> and let the repair run with modifications to block/qcow2.c about line
> 1136 and ignoring the return result if 27 (EFBIG) and setting res to 0;
> probably really bad to do, just did this to get get to checks). The
> repair run repaired the image to the point the checks came back OK.
> Unfortunately the image was still broke, trying to list snapshots or
> use image returned the file to long error again.
>
> Ultimately I was able to repair the system disk by converting the image
> to raw as suggested in other posts now that it was repaired and was able
> to start the machine again right where it left off (or at least it
> appears so). Disk checks within the machine return OK. One thing I am
> unsure of is how safe the qemu images are in regards to snapshots, and I
> dare not try to do anything with them as they are, and will convert to
> raw then all of them back into qemu images.
They are safe, but you may only have one program writing to an image
file at a time. Therefore, if you want to do snapshots of a live VM, you
have to do that through the respective qemu instance (e.g. using the QMP
command blockdev-snapshot-internal-sync).
> Even though this is entirely due to creating a snapshot while the disk
> is in use, some thoughts:
>
> - if a user is trying to run a repair it should not error about
> snapshots and proceed with checks/repairs and allow convert if possible.
I don't think this should be done silently. If qcow2 encounters errors
during the repair process, those are errors which generally mean “Trying
to repair this image may or will damage it further”. Therefore, at least
there should be a flag the user has to set to tell the qcow2 driver to
ignore errors as far as possible.
Another way to do it would be a runtime option for qcow2 to ignore the
snapshot table (because apparently most of the people who ran a qemu-img
snapshot operation on an in-use image noticed that something went wrong
because loading the snapshot table fails). qemu-img convert could set
that option automatically so you can convert a qcow2 file with an
invalid snapshot table to raw (ignoring the snapshots).
> - if possible, before actually doing a snapshot, check if the file is in
> use to avoid this situation all together
Yes, this has been proposed a couple of times and is something we will
have to do sooner or later, since so many people make the mistake of
using qemu-img on a qcow2 file that is in use by a VM (knowingly or by
mistake).
I don't know the current status of this. Some people proposed a
qcow2-specific flag in the file, but the obvious problem is that this
flag will be a nuisance if some process accessing the qcow2 file
crashes. Would be solvable by either abusing qemu-img amend for removing
that flag, or by adding a new option to qemu-img check which allows you
to override that flag if you are sure that no process is accessing the
file anymore.
Other people suggested using flock(), but that would be a Unix-specific
solution.
I'm personally leaning towards the qcow2 flag. Having a way to reset it
using some qemu-img subcommand should suffice for the rare and
not-to-be-expected (;-)) case of qemu crashing.
> I would submit a patch, but I do not know enough about the possible
> repercussions of ignoring an error and repairing/converting.
Nobody knows, it depends on what in the image is broken exactly. Note
that repairing a qcow2 image basically only means repairing the refcount
information. The snapshot table will still be broken, even if you get
qemu-img check -r all to run.
I think it would be better to focus on allowing even terminally-broken
qcow2 images to be converted to raw (or a fresh qcow2 image), saving as
much data as possible.
Max
signature.asc
Description: OpenPGP digital signature