[Crash-utility] kdump format may be updated

Tue Oct 24 12:02:21 UTC 2006

On Mon, 2006-10-23 at 10:46 -0400, Dave Anderson wrote:
> Magnus Damm wrote: 
> > Isn't the Xen hypervisor mapped like va = pa + offset? 

> I don't know.  If I look at the xen-syms file, the VirtAddr 
> and PhysAddr in the PT_LOAD segment are both 0xff100000, 
> which doesn't make sense to me: 
> 
> Program Headers: 
>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg
> Align 
>   LOAD           0x001000 0xff100000 0xff100000 0x92658 0x92658 RWE
> 0x1000 

Yes, I agree that this looks strange. But it don't find it surprising...

> But anyway, I would presume that the the crash_notes contents 
> in the xen binary contains registers that only make sense 
> with respect to xen, and even if you put the xen cr3 value 
> there, it doesn't have any relationship to the cr3 value 
> required for translating dom0 linux kernel virtual addresses. 

The idea is that the crash_notes contents in the Xen hypervisor space
contains registers indexed by physical cpu number.

It is possible to locate the crashing physical cpu by looking up a
global variable in hypervisor symbol, and from there it should be
possible to backtrack and find the domain pseudo-phys to virt mapping
table. I say "should" because it is probably pretty hairy.

> In your patch here: 
> 
>   [Xen-devel] [PATCH 03/04] Kexec / Kdump: x86_32 specific code 
> 
> the dom0 cr3 is located by the find_dom0_cr3() function, which 
> has to walk the xen binary's "domain_list" 
> 
> +void find_dom0_cr3(void) 
> +{ 
> +       struct domain *d; 
> +       struct vcpu   *v; 
> +       uint32_t *buf; 
> +       uint32_t cr3; 
> +       Elf_Note note; 
> + 
> +       /* Don't need to grab domlist_lock as we are the only thing
> running */ 
> + 
> +       /* No need to traverse domain_list, as dom0 is always first
> */ 
> +       d = domain_list; 
> +       BUG_ON(d->domain_id); 
> + 
> +       for_each_vcpu ( d, v ) { 
> +               if ( test_bit(_VCPUF_down, &v->vcpu_flags) ) 
> +                       continue; 
> +               buf = (uint32_t *)per_cpu(crash_notes, v->processor); 
> +               if (!buf) /* XXX: Can this ever occur? */ 
> +                       continue; 
> + 
> +               memcpy(&note, buf, sizeof(Elf_Note)); 
> +               buf += (sizeof(Elf_Note) +3)/4 + (note.namesz + 3)/4
> + 
> +                       (note.descsz + 3)/4; 
> + 
> +               /* XXX: This probably doesn't take into account shadow
> mode, 
> +                * but that might not be a problem */ 
> +               cr3 = pagetable_get_pfn(v->arch.guest_table); 
> + 
> +               buf = append_elf_note(buf, "Xen Domanin-0 CR3", 
> +                       NT_XEN_DOM0_CR3, &cr3, 4); 
> +               final_note(buf); 
> + 
> +               printk("domain:%i vcpu:%u processor:%u cr3:%08x\n", 
> +                      d->domain_id, v->vcpu_id, v->processor, cr3); 
> +       } 
> +} 
> + 
> 
> So when you ask, can I "look up the crash_notes symbol and find the 
> saved registers there?", I'm presuming that you are talking about 
> the crash_notes variable above?  But if you're planning on removing 
> the code above, why would there be any crash_notes variable remaining?

We _do_ need to save the registers, so the functionality is not going
away. Don't you worry about that. =) It is more of a problem how
references to crash notes are passed around from the hypervisor down to
dom0 user space and that we are currently tightly coupled with the
crash_note format used by the kernel.

Our internal interfaces are not particularly clean at the moment. We
have code that keeps the crash_notes in the hypervisor, but passes the
physical addresses (or machine addresses in xen lingo) for the notes all
the way down to kexec-tools in dom0 user space. These addresses are then
used to create the ELF headers. dom0 only knows about VCPU:s, but
because we are creating a system-wide crash dump we want to use physical
cpus. So down in user space we then need to create a mapping between
physical cpu:s and VCPU:s. And can we be sure that dom0 has all cpus
available as VCPU:s?

Do you see how things are starting to get complex?

I'd like to rip out all this code and avoid creating ELF note program
headers in kexec-tools at all in the Xen case. This doesn't mean that
the crash notes go away - just that the ELF header reference is ripped
out. This simplifies things a lot. It does however unfortunately put
stress on the tool which then needs to find the registers using symbol
lookup.

> In any case, you can see that what is needed is the per-domain cr3
> value, 
> which is embedded in the xen-binary data. 
> 
> Or, as I mentioned before, alternatively, the
> pfn_to_mfn_frame_list_list 
> value from the per-domain arch_shared_info also would work: 
> 
> struct arch_shared_info { 
>     unsigned long max_pfn;                  /* max pfn that appears in
> table */ 
>     /* Frame containing list of mfns containing list of mfns
> containing p2m. */ 
>     xen_pfn_t     pfn_to_mfn_frame_list_list; 
>     unsigned long nmi_reason; 
>     uint64_t pad[32]; 
> }; 
> 
> But again, there's no easy way for the crash utility to dig 
> them out of a completely foreign binary's. 

No, but that's because your tool is missing knowledge about the binary
right? =) Is there any easy way out... No! =) Or maybe there is?

I hope we can find a good balance between your code and ours. Maybe a
relatively fair balance could be that we provide per-physical cpu
pointers to some virtual to physical mapping tables which should be easy
to parse for your tool, but in return your tool doesn't depend on
finding register information using the note program headers in the ELF
header...

[snip]

> > > So, given that the dom0 pseudo-physical address needs to be 
> > > translated into a machine address, I need to be able to find my
> > way 
> > > to the phys_to_machine_mapping array.  From that point on, it's 
> > > becomes a matter of searching the array for the desired 
> > > pseudo-physical 
> > > address, getting the associated machine address, and then using 
> > > the PT_LOAD segments of the ELF header to find the memory. 
> > 
> > Yes, this makes sense. 
> > 
> > > To find the phys_to_machine_mapping array, there are two keys 
> > > to Pandora's box: 
> > > 
> > > (1) the dom0 cr3 value -- which in a writable page table kernel, 
> > >     will contain an mfn value.  With that starting point, a page 
> > >     table walk can be initiated for the "phys_to_machine_mapping" 
> > >     virtual address. 
> > > 
> > > (2) alternatively, given the dom0 pfn_to_mfn_frame_list_list mfn, 
> > >     I also have a starting point in order to reconstruct the 
> > >     phys_to_machine_mapping array. 
> > > 
> > > Either one works.  I preferred #2 because it would presumably
> > work 
> > > for both writable and shadow page table kernels.  But, I've never 
> > > done any work with shadow page table kernels (Red Hat is going
> > with 
> > > writable...), so I don't know what the ramifications are for
> > those 
> > > kernels. 
> > 
> > I would go with #2 too, although I must admit that my knowledge
> > about he 
> > Xen internals is a bit limited. How do you locate this one? Through
> > a 
> > global symbol in hypervisor space? 
> >  
> > 
> 
> 
> Your knowledge is a hell of lot less limited than mine...  ;-) 
> 
> Anyway, yes, it appears that the "domain_list" global is the key 
> that leads to either the per-domain pfn_to_mfn_frame_list_list or 
> the per-domain cr3 value. 
> 
> With respect to the per-domain pfn_to_mfn_frame_list_list location, 
> in the xen sources, there's the following "shared_info" structure, 
> that contains the data that's shared between the xen binary and 
> and each domain.  And this structure contains the "arch_shared_info" 
> structure I showed above -- which contains the per-domain 
> pfn_to_mfn_frame_list_list value: 

[snip]

> 
> The best "common-code" example I see in the xen sources accessess the 
> shared_info structure above like so: 
> 
> void evtchn_set_pending(struct vcpu *v, int port) 
> { 
>     struct domain *d = v->domain; 
>     shared_info_t *s = d->shared_info; 
> 
> So looking at your currently existing find_dom0_cr3() function, 
> the vcpu pointer can be pulled the same way from the domain_list: 
> 
> +void find_dom0_cr3(void) 
> +{ 
> +       struct domain *d; 
> +       struct vcpu   *v; 
> +       uint32_t *buf; 
> +       uint32_t cr3; 
> +       Elf_Note note; 
> + 
> +       /* Don't need to grab domlist_lock as we are the only thing
> running */ 
> + 
> +       /* No need to traverse domain_list, as dom0 is always first
> */ 
> +       d = domain_list; 
> +       BUG_ON(d->domain_id); 
> + 
> +       for_each_vcpu ( d, v ) { 

That's good, isn't it? If I've understood things right it's possible to
locate the data you need using the domain list symbol?

[snip]

> > > > > The crash utility is wholly based upon the internal structure 
> > > > > of the Linux kernel. 
> > > > 
> > > > So why can't you just require that Xen dumps needs to be cut
> > out 
> > > > with 
> > > > dom0 cut? 
> > > > 
> > > > 
> > > Well, to answer your question with a question: 
> > > 
> > >  Why should it be required if it could be so easily avoided? 
> > > 
> > > As Henry David Thoreau said, "Simplicity, simplicity,
> > simplicity..." 
> > 
> > I'm glad to hear that. That means that we both want simplicity. =) 
> > 
> > I think we should save registers that are not saved today, and I
> > would 
> > be happy to add that data to our crash notes in the hypervisor. 
> > 
> > But can you locate the crash notes without any reference from the
> > ELF 
> > header? 
> >  
> > 
> As I explained above, I don't see how?  I can't maneuver around 
> following multiple data structure linkages -- using data structures 
> that crash knows nothing about.  (Not to mention knowing how exactly 
> the xen virtual-to-physical translation works...) 

Yeah, I agree that navigating around those structures seems rather
painful. But OTOH, if you want to know things that only the internals
can tell you, you need to be able to parse them, right? But maybe you
only want to cover the "simple" dom0 case. (Simple yeah right)

> But I'm still confused about that -- why would the "crash_notes" 
> exist in the xen sources/binary if you're not going to put them 
> in the ELF header of a resultant xen dumpfile?  What exactly 
> is going to be put into the dumpfile's ELF header? 

Four words: As little as possible. =)

Thanks for all the comments so far!

/ magnus