Azure migration with Site Recovery

2016-08-27 azure, recovery Comments

A few days ago, I was working in a project to move an environment to the Azure Cloud.

After some research we decided to give Azure Site Recovery a try, since it could copy our VMs online, we would get almost no downtime with the migration. We spawned two brand new VMs on-premises, one Windows VM and one Linux VM. Then we configured the vault and the VM replication for those tests VMs.
It worked like a charm. Almost no downtime. Then we agreed that it should be our best way to move a bunch of VMs.

Everything was working great until we hit an specific Windows VM.

Symptoms

We’ve configured the replication and waited a long time until it was 100% synced over a 20Mbit link shared among other things. Then we scheduled a downtime to perform the failover. Since we were confident from the other VMs we didn’t performed a ‘Test failover’ (bad idea!).
After performing a ‘Planned Failover’, we waited a long time for it to come up, but it never did.

Then, I’ve enabled the Boot Diagnostics to get a screenshot and found an unbootable machine.

Unbootable BIOS

Diagnosing

What the heck? What this VM has different from the other ones?
My first guess was a wrong configuration, because would’ve selected the wrong OS disk during the configuration.

Well, this VM came from a P2V process in the past using SCVMM. During this process, this machine got a really weird disk configuration.

Disk configuration

The Site Recovery asks you what is the ‘OS Disk’ during the configuration, I’ve selected C_2013-02-18T174100.vhd since it holds the OS C:\Windows, but e827c18b-00ce-11e1831b...vhd holds Windows boot loader.

By the error message I guess I should’ve selected that disk e827c18b-00ce-11e1831b...vhd as the ‘OS Disk’, then I’ve created a new VM swapping the disks, so Azure will find the boot loader that appear to be missing.

It partially worked. At least the error message has changed.

Unbootable Boot Loader

Now, the boot process has started, but the boot loader was unable to find the OS itself.

It brought up a new question, why the boot loader is unable to find the OS?
It started to sound that it would take a long time to find it out, then we aborted the migration for this machine and restarted it on-premises.

When I got the machine online on-premises, I’ve gone over all BCD configuration and it looked fine.
So, I started to wonder what could cause the OS disk to be unreachable and what is the “disk configuration” at a given Azure VM in terms of the hardware abstraction layer (HAL).
To get more info I’ve used another migrated VM that had also gone over a P2V process in the past and worked. This VM also has a bunch of disks (VHDs) attached to it.

Checking this machine using Windows Device Manager gave me an really important clue. Check this out.

Disks by connection

It’s important to notice that this machine shows only two devices attached to the IDE controller. One is a disk is the ‘OS Disk’ and the other is the ‘Temporary Storage’ that azure provides to every VM.
All other disks are connected to ‘Microsoft Hyper-V SCSI Controller’. This controller is a synthetic device! It requires the VMBUS to be up and running to be accessed, but the VMBUS will go online only when the driver for it is loaded.

We ran into a chicken and egg problem. The boot loader couldn’t find the ‘OS Disk’ because it’s attached to device that will be available only after the OS is loaded.

Fixing it

So, to get this this machine bootable the Boot loader and the OS must be in the same VHD.
To achieve this I had to use an Ubuntu with the two disks from the Windows machine and a new one large enough to merge things.

(parted) print devices
/dev/sda (31.5GB)  
/dev/sdb (107GB)
/dev/sdc (115MB) << Disk with the System Partition
/dev/sdd (146GB) << Disk with the OS Partition
/dev/sde (161GB) << Target disk where both things will be merged
/dev/sr0 (1180kB)

Merging both disks

Checking size of the System Partition

(parted) select /dev/sdc
Using /dev/sdc
(parted) unit s
(parted) print
Model: Msft Virtual Disk (scsi)
Disk /dev/sdc: 225280s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start  End      Size     Type     File system  Flags
 1      63s    204862s  204800s  primary  ntfs         boot

Checking size of the OS Partition

(parted) select /dev/sdd
Using /dev/sdd
(parted) print
Model: Msft Virtual Disk (scsi)
Disk /dev/sdd: 285282304s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start  End         Size        Type     File system  Flags
 1      63s    285265982s  285265920s  primary  ntfs         boot

Create the partition table at the target disk

(parted) select /dev/sde
Using /dev/sde
(parted) mklabel msdos
(parted) mkpart primary ntfs 63s 204862s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary ntfs 204863s 285470783s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) set 1 boot on
(parted) quit

Copying data to the target disk

root@RECOVERY1:~# dd if=/dev/sdc1 of=/dev/sde1 bs=512M
0+1 records in
0+1 records out
104857600 bytes (105 MB) copied, 5.37418 s, 19.5 MB/s

root@RECOVERY1:~# dd if=/dev/sdd1 of=/dev/sde2 bs=512M
272+1 records in
272+1 records out
146056151040 bytes (146 GB) copied, 11130.1 s, 13.1 MB/s

Adjusting the boot loader

Now we’ve to adjust the boot loader configuration to tell it to
look for the OS at the same disk.

To do that we’ve to use BCDEdit at a Windows machine.

After spinning up a Windows Server and attaching the new disk, the Disk Management shows like this:

Disk Management

Reconfiguring the boot loader

C:\Users\jrocha>bcdedit /store E:\boot\bcd /enum /v

Windows Boot Manager
--------------------
identifier              {9dea862c-5cdd-4e70-acc1-f32b344d4795}
device                  unknown
description             Windows Boot Manager
locale                  en-US
inherit                 {7ea2e1ac-2e61-4728-aaa3-896d9d0a9f0e}
integrityservices       Enable
default                 {ee1dc098-0100-11e1-accb-b7522c70ccb8}
resumeobject            {ee1dc097-0100-11e1-accb-b7522c70ccb8}
displayorder            {ee1dc098-0100-11e1-accb-b7522c70ccb8}
toolsdisplayorder       {b2721d73-1db4-4c62-bf78-c548a880142d}
timeout                 30

Windows Boot Loader
-------------------
identifier              {ee1dc098-0100-11e1-accb-b7522c70ccb8}
device                  unknown
path                    \Windows\system32\winload.exe
description             Windows Server 2008 R2
locale                  en-US
inherit                 {6efb52bf-1766-41db-a6b3-0ee5eff72bd7}
recoverysequence        {ee1dc099-0100-11e1-accb-b7522c70ccb8}
integrityservices       Enable
recoveryenabled         No
osdevice                unknown
systemroot              \Windows
resumeobject            {ee1dc097-0100-11e1-accb-b7522c70ccb8}
nx                      OptOut
bootstatuspolicy        IgnoreAllFailures
detecthal               Yes

Pay attention to device: unknown at both sections of BCD store.
It will render the machine unbootable, to solve it we’ve to do this:

bcdedit /store E:\boot\bcd /set {bootmgr} device partition=E:
bcdedit /store E:\boot\bcd /set {bootmgr} integrityservices enable
bcdedit /store E:\boot\bcd /set {default} device partition=F:
bcdedit /store E:\boot\bcd /set {default} integrityservices enable
bcdedit /store E:\boot\bcd /set {default} recoveryenabled Off
bcdedit /store E:\boot\bcd /set {default} osdevice partition=F:
bcdedit /store E:\boot\bcd /set {default} bootstatuspolicy IgnoreAllFailures

Now it should be great, just checking.

C:\Users\jrocha>bcdedit /store E:\boot\bcd /enum /v

Windows Boot Manager
--------------------
identifier              {9dea862c-5cdd-4e70-acc1-f32b344d4795}
device                  partition=E:
description             Windows Boot Manager
locale                  en-US
inherit                 {7ea2e1ac-2e61-4728-aaa3-896d9d0a9f0e}
integrityservices       Enable
default                 {ee1dc098-0100-11e1-accb-b7522c70ccb8}
resumeobject            {ee1dc097-0100-11e1-accb-b7522c70ccb8}
displayorder            {ee1dc098-0100-11e1-accb-b7522c70ccb8}
toolsdisplayorder       {b2721d73-1db4-4c62-bf78-c548a880142d}
timeout                 30

Windows Boot Loader
-------------------
identifier              {ee1dc098-0100-11e1-accb-b7522c70ccb8}
device                  partition=F:
path                    \Windows\system32\winload.exe
description             Windows Server 2008 R2
locale                  en-US
inherit                 {6efb52bf-1766-41db-a6b3-0ee5eff72bd7}
recoverysequence        {ee1dc099-0100-11e1-accb-b7522c70ccb8}
integrityservices       Enable
recoveryenabled         No
osdevice                partition=F:
systemroot              \Windows
resumeobject            {ee1dc097-0100-11e1-accb-b7522c70ccb8}
nx                      OptOut
bootstatuspolicy        IgnoreAllFailures
detecthal               Yes

Conclusion

It seems like that Azure VMs are prepared to run just one VHD attached to the IDE Controller.
All other disks are connected via the VMBUS showing up like a SCSI disk.

I guess all data access to a data disk should be faster than over an emulated IDE channel,
I think we can take it as heads up to avoid IO at the OS Disk in an Azure VM.

I don’t know the Azure Site Recovery have any plans to fix that or accept it as a behavior by design,
but if you have non-standard disk configuration and you are planning to move that VM to azure using this great tool, be careful!

Jorge Rocha Gualtieri

Software Engineer