[Crash-utility] help debug number of CPU detect failure

Wed Mar 4 22:49:01 UTC 2020

> Hello List,
>
> I've a two ELF coredumps from two different HyperV VMs generated by this
> tool (https://github.com/Azure/azure-linux-utils/tree/master/vm2core).
>
> Crash works with one of these coredumps but do not work with other.
>
> I've placed the output generated by crash tool here:
>
> Not ok with crash:
> ./crash/crash /usr/lib/debug/boot/vmlinux-4.15.0-88-generic
> vm1_numa_4gb_5cpu.coredump --kaslr 600000 -m phys_base=4355784704 -d8
>  https://raw.githubusercontent.com/santoshx/temp/master/notok_with_crash.txt
>
> Ok with crash:
>  ./crash/crash /usr/lib/debug/boot/vmlinux-4.15.0-88-generic
> vm1_nonuma_4gb_5cpu.coredump --kaslr 3c00000 -m phys_base=2344615936 -d8
>  https://raw.githubusercontent.com/santoshx/temp/master/ok_with_crash.txt
>
>
> The problem I see that in non-working case crash fails to detect correct
> cpu_possible_mask:
>
> Relevant part of $ diff ok_with_crash.txt notok_with_crash.txt:
>
> <   cpu_active_mask: cpus: 0 1 2 3 4
> < FREEBUF(0)
> < <readmem: ffffffff86039f40, KVADDR, "pv_init_ops", 8, (ROE),
> 7ffe01722870>
> < <read_kdump: addr: ffffffff86039f40 paddr: 91c39f40 cnt: 8>
> < read_netdump: addr: ffffffff86039f40 paddr: 91c39f40 cnt: 8 offset:
> 91c3a760
> ---
>> <readmem: ffffffff826f2b60, KVADDR, "possible", 1024, (ROE),
>> 5638a35a2280>
>> <read_kdump: addr: ffffffff826f2b60 paddr: 1060f2b60 cnt: 1024>
>> read_netdump: addr: ffffffff826f2b60 paddr: 1060f2b60 cnt: 1024 offset:
>> fe0f3380
>> cpu_possible_mask: cpus: 3 4 5 6 8 13 14 18 20 21 22 26 28 29 30 33 36
>> 37 38 48 49 52 53 54 56 59 60 61 62 64 65 68 69 70 72 73 74 75 76 78 82
>> 83 85 86 90 91 93 94 96 99 101 102 104 105 108 109 110 114 116 117 118
>> 123 124 125 126 128 133 134 138 140 141 142 146 148 149 150 153 156 157
>> 158 168 169 172 173 174 176 179 180 181 182 184 185 188 189 190 192 193
>> 194 195 196 198 200 202 205 206 211 212 213 214 216 219 221 222 226 228
>> 229 230 232 233 234 235 236 238 242 243 245 246 248 251 253 254 256 257
>> 260 261 262 266 268 269 270 275 276 277 278 280 285 286 290 292 293 294
>> 298 300 301 302 305 308 309 310 320 321 324 325 326 328 331 332 333 334
>> 336 337 340 341 342 344 345 346 347 348 350 352 354 357 358 361 362 363
>> 365 366 370 372 373 374 376 378 381 382 385 388 389 390 392 393 394 395
>> 396 398 402 403 405 406 408 411 413 414 416 417 420 421 422 426 428 429
>> 430 435 436 437 438 440 445 446 450 452 453 454 458 460 461 462 465 468
>> 469 470 480 481 484 485 486 488 491 492 493 494 496 497 500 50
>  1 502 504 505 506 507 508 510 514 515 517 518 520 523 525 526 528 529 532
> 533 534 538 540 541 542 547 548 549
>
> I'm trying to find where the problem is? in the crash too or the tool that
> generated the ELF coredumps?

I suspect that it's a problem with either the --kaslr offset and/or
the phys_base value that you have used.

It appears that the read of the cpu_possible mask is not using the
correct virtual address, or perhaps the wrong physical address, and
as a result it is trying to translate bogus data.  In fact, the full
output txt file shows that every thing that it reads is garbage, e.g.,
the cpu masks, the utsname data structure, the linux_banner string, etc.

Dave