AIX 5L Internals .fr

1-16. 64-bit Kernel base enablement . ...... (build mechanisms), but not to the small set of applications that rely upon shipped header file ..... ISVs, and enables AIX to use industry wide solutions to some problems. Finally ...... Test the system dump configuration of an AIX5L system ...... In practice, however, no one really cares.
2MB taille 56 téléchargements 665 vues
AIX 5L Internals Student Guide Version 20001015

IBM Web Server Knowledge Channel

Student Guide

Draft Version for review, Sunday, 15. October 2000, title.fm

Tradmarks IBM® is a registered trademark of International Business Machines Corporation. UNIX is a registered trademark in the United States, other countries, or both and is licensed exclusively through X/Open Compnay Limited. >

July 2000 Edition

The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or simular results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

© Copyright International Business Machines Corporation 2000. All rights reserved. This document may not be reproduced in whole or in part without the prior written permission from IBM. Information in this course is subject to change without notice.

Web Server Knowledge Channel Technical Education

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Student Guide

Contents Kernel Overview Kernel Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Kernel states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8 Kernel exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10 Kernel Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12 Kernel Limits Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16 64-bit Kernel base enablement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17 64-bit Kernel stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24 CPU big- and little-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26 Multi Processor dependent designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28 Command and Utility compatibility for 32-bit and 64-bit kernels . . . . . . . . . . . . . . . . 1-29 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30 Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33 Interrupt handling in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35 Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36 Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-37 IA-64 Hardware Overview IA-64 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 IA-64 formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 IA-64 memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 IA-64 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 IA-64 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 IA-64 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Power Hardware Overview Power Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Power CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 64 bit CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 SMP Hardware Overview SMP Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Configuring System Dumps on AIX 5L About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 System Dump Facility in AIX5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Configuring for System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Obtaining a Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Dump Status and completion codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 dumpcheck utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 Verify the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Packaging the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26 Introduction to Dump Analysis Tools About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 System Dump Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6 dump components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Dump creation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 Component dump routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 bosdebug command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 © Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents iii

Student Guide

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Memory Overlay Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 System Hang Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 truss command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20 KDB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23 kdb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 KDB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26 KDB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29 KDB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33 KDB trace sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36 KDB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38 KDB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42 KDB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43 KDB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45 KDB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47 KDB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49 KDB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50 KDB Kernel stack sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-58 KDB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-60 KDB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-62 KDB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-65 KDB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-69 KDB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-72 KDB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-77 KDB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-80 KDB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-86 KDB data and instruction block address translation sub commands . . . . . . . . . . . . 6-87 KDB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-89 IADB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-90 iadb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-92 IADB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-93 IADB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-96 IADB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-100 IADB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105 IADB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-106 IADB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-108 IADB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110 IADB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-111 IADB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-112 IADB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114 IADB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-115 IADB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-116 IADB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-117 IADB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-118 IADB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-119 IADB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-120 IADB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-122 IADB block address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-123 IADB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-124 IADB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-125 iv AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Student Guide

Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127 Process Management Process Management Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2 Process operations fork() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8 Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 Process operations exit system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Process operations, wait() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13 Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 Thread Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 AIX Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 Thread Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 Threads Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 Thread states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27 Process swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29 The Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33 AIX run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36 Process and Threads data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39 Process and Threads data structures addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43 What is new in AIX 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-50 Signal handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-53 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57 Memory Management Overview of Virtual Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 Memory Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 Memory Object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8 Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 Page Not In Hardware Frame Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13 Loading Pages From The Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17 Filesystem I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Free Memory and Page Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 vmtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Memory Objects (Segments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23 Shared Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25 shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26 Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 IA-64 Virtual Memory Manager IA-64 Addressing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2 Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Region Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5 © Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents v

Student Guide

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Single vs. Multiple Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7 AIX 5L Region Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10 LP64 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14 ILP32 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16 LVM Logical Volume Manager overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3 Data Integrity and LVM Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12 LVM Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 LVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 Physical disk layout Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21 VGSA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30 Physical disk layout IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31 LVM Passive Mirror Write Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36 AIX 5 LVM Hot Spare Disk in a Volume group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40 LVM Hot spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42 LVM split mirror AIX 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45 LVM Variable logical track group (LTG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46 LVM command overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47 LVM Problem Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48 Trace LVM commands with the trace command . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-51 LVM Library calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56 logical volume device driver LVMDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57 Disk Device Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58 Disk low level Device Calls such as SCSI calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-61 Enhanced Journaled File System J2 - Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 Allocation Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7 Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8 Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 Binary Trees of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 File Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23 Exercise 1 - fsdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24 Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27 Directory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-31 Exercise 2 - Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-35 Logical and Virtual File Systems General File System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2 Logical File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 User File Descriptor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 System File Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7 Virtual File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9 Vnode/vfs interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 vi AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Student Guide

Vnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11 vfs and vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 File and Filesystem Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16 vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17 The Gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20 Lab Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21 Lab Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26 AIX 5L boot What is boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2 Various Types of boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3 Systems types and Kernel images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5 RAMFS and prototype files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6 Boot Image Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8 AIX 5L Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12 The Power Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 Power boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 AIX 5L Power boot record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20 Power boot images structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21 RSPC boot image hints header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22 CHRP Boot image ELF structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-24 CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-25 CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-26 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-28 Power ROS and Softros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-30 IPLCB on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-31 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-33 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34 The IA-64 Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-35 IA-64 boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37 EFI boot manager and boot maintenance manager overview . . . . . . . . . . . . . . . . 14-39 EFI Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-40 IA-64 Boot Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-43 IA-64 Initial Program Load Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-44 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-45 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-46 Hard Disk Boot process (rc.boot Phase I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-47 Hard Disk Boot process (rc.boot Phase II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48 Hard Disk Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-49 CDROM Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-50 Tape Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-51 Network Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-52 Common Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-53 © Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Contents vii

Student Guide

Draft Version for review, Sunday, 15. October 2000, intTOC.fm

Network boot $RC_CONFIG files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-54 The init process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-56 ODM Structure and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-57 boot and installation logging facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-63 Debugging boot problems using KDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-65 Debugging boot problems using IADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-67 Packaging Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-69 Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-71 Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-72 /proc Filesystem Support /proc Filesystem Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2 Types of Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4 The as File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5 The ctl File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6 The status File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7 The psinfo file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10 The map File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11 The cred File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13 The sigact File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14 lwp/lwpctl file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-15 The lwp/lwpstatus File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16 The lwp/lwpsinfo File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19 Control Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20 PCSTOP, PCDSTOP, and PCWSTOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21 PCRUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23 PCSTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25 PCCSIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26 PCKILL, PCUNKILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-27 PCSHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28 PCSFAULT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-34 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-35

viii AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Unit 1. Kernel Overview This overview describes the concepts used in the AIX 5L kernel.

What You Should Be Able to Do After completing this unit, you should be able to • Identify major components of the kernel. • Identify the major differences between AIX 5L and previous versions of AIX. • Determine what kernel to use. • Determine what the kernel limits are. • Find out if a thread is in user or kernel model. • Define the kernel address layout. • Describe the steps the kernel takes in handling an interrupt or exception.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview Introduction

Up until AIX 5L, the kernel was a 32-bit kernel for Power architecture only. AIX version 4.3 introduced the 64-bit application enabling on Power, which meant there was still a 32-bit kernel, but an 64-bit environment was available through a kernel extension which did the appropriate Now AIX 5L features both a 32-bit and a 64-bit kernel on Power systems, and a 64-bit kernel on the IA-64 architecture. This overview describes the concepts used in the kernel in general and in the 64-bit kernel specifically.

Kernel description

The kernel is the base program of the computer. It is an intermediary between the applications and the computer hardware. There is no need for applications to have specific knowledge of any kind of hardware. Processes, that is, programs in execution or running programs, just ask for a generic task to complete (like ‘give me this file’) and the kernel will go out and get it. The kernel is the first and most important program on the computer. It can access things other programs can not. It can create and destroy processes and it controls the way programs run. Resource usage is balanced by the kernel in order to keep everybody happy.

Functions of the kernel

The kernel provides the system with the following functions: • Create, manage and delete processes. • Schedule and balance resources. • Provide access to devices. • Handle asynchronous events. The kernel manages resources so they can be shared simultaneously among many processes and users. Resources can be physical like the CPU, the memory or an adapter, or it can be virtual, like a lock or a slot in the process table.

Uniprocessor support

The 64-bit kernel is aimed at the high-end server environment and multiprocessor hardware. As a result, it is optimized strictly for the multiprocessor environment and no separate uniprocessor version is provided. Continued on next page

-2 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview 64-bit vs. 32bit kernel

Guide

-- continued

The primary purpose of the 64-bit AIX kernel is to address the fundamental need for workload scalability. This is achieved through a kernel address space which is large enough to support increases in software resources. The demands placed on the system software by customer applications will soon outstrip the existing AIX 32-bit kernel because of the 32-bit kernel’s limited address space. At 4GB, this address space is simply too small to efficiently and/or effectively handle the amount of software resources needed to support the projected 2001 workloads and hardware. In fact, a number of software resources pools within the 32-bit kernel are now under pressure from today’s application workloads.

32-bit kernel life time

Customers have made and will continue to make significant investment in 32-bit RS/6000 hardware systems and need system software that protects this investment. Thus, AIX also offers a 32-bit kernel.The RS/6000 software plan is to eventually drop support for the 32-bit kernel. However, support will not be withdrawn before 2002 and after the initial 64-bit kernel release. This process is driven by end-of-life plans for 32-bit hardware systems, as well as the fact that customers require a bridge period under which both the 32-bit and 64-bit kernels are available for 64-bit hardware systems and offer the same basic functionality. This period is needed to ease migration to the 64-bit kernel.

Compatibility

Customers need system software that protects their investment in existing applications and provides binary and source compatibility. AIX 5L will therefore maintain support for existing 32-bit applications. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview Kernels supported by hardware platform

-- continued

The table below shows which kernels are supported on different systems. In general, a 64-bit kernel and application can only run on 64-bit hardware, but 64-bit hardware can execute 32- and 64-bit kernels and applications.

32-bit Power 64-bit Power Intel IA64 32-bit Kernel 32-bit applications 32-bit applications 64-bit Kernel Not supported; 32-bit applications 32-bit applications 64-bit kernel is not 64-bit applications 32-bit applications supported at 32-bit CPUs

Currently, there are three different CPUs types in the RS/6000 systems (only the PowerPC 604e CPU is 32-bit).

CPU PowerPC 604e Power3-II RS64 II RS64 III Binary compatibility and limitations

Type 32-bit 64-bit 64-bit 64-bit

The 64-bit kernel offers binary compatibility to existing applications for both 32-bit and 64-bit applications. However, it does not extend to the minority of applications that are built non-shared or have intimate knowledge of internal details, such as programs accessing /dev/kmem or /dev/mem. This is consistent with the general AIX policy for these two classes of applications. In addition, binary compatibility will not be provided to applications that are dependent on existing kernel extensions that are not ported to the 64-bit kernel environment. Only 64-bit kernel extensions will be supported. This direction is taken to avoid the significant cost of providing 32-bit kernel extension support under the 64-bit kernel, and is consistent with the directions taken by other UNIX vendors such as SUN, HP, DEC and SGI. On the plus side, this direction also forces kernel extensions to migrate to the more scalable and strategic 64-bit environment (to better face the next century). Continued on next page

-4 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview

Guide

-- continued

Compatibility for kernel extensions

There is no change to the compatibility provided for 32-bit kernel extensions under the 32-bit kernel. 64-bit kernel extensions will not be supported under the 32-bit kernel.

Compatibility for system calls

One important aspect of binary compatibility involves the required functional behavior of system call APIs when supplied invalid user addresses. Under today’s 32-bit kernel, this behavior differs in many ways for 32-bit and 64-bit applications. For 32-bit applications, APIs return errors (that is, EFAULT errno) to the application if presented with an invalid address. This behavior is due to the fact that all user space accesses that are made under an API inside the kernel, and under the protection of kernel exception handling. For 64-bit applications, an invalid user address will cause a signal (SIGSEGV) to be sent to the application. This occurs because structure reshaping is done in supporting API libraries and it is the user mode library routine that accesses the invalid user (structure) address. Today’s kernel behaviors is preserved by the 64-bit kernel for 32-bit applications but not for 64-bit applications. This is because the behavior for 64-bit applications under the 32-bit kernel will be changed and made consistent with that now provided for 32-bit applications. This is done for a number of reasons. First, it is difficult to fully preserve the present behavior for 64-bit applications. Reshaping is not required for these applications under the 64-bit kernel, so there will be no library accesses. Signals could be sent as part of kernel exception handling, but it would be hard to produce the same signal context as is seen under the 32-bit kernel. Next, the functional behaviors of 32-bit and 64-bit applications should only differ in places where there are fundamental application differences, like address space layout. Introducing different behaviors in other places only complicates matters for application writers. Finally, both the errno and signal behaviors are allowable under the standards, but the errno behavior offers a more friendly application programming model. In order to provide a consistent behavior across kernels and applications, all structure reshaping is performed inside both kernels for both application types.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Overview

-- continued

Source compatibility

Source code compatibility is preserved for applications and 32-bit kernel extensions. Consistent with general AIX policy, this extends to makefiles (build mechanisms), but not to the small set of applications that rely upon shipped header file contents that are provided only for use by the kernel. Programs accessing /dev/mem or /dev/kmem serve as an example of such applications.

32-bit vs. 64bit kernel Performance on Power

The 64-bit kernel is intended to increase scalability of the RS/6000 product family and is optimized for running 64-bit applications on the upcoming Gigaprocessor systems (Power4, which will be announced in 2001). The performance of 64-bit applications running on the 64-bit kernel on Gigaprocessor-based systems is better than if the same application was running on the same hardware with the 32-bit kernel. This is because the 64-bit kernel allows 64-bit applications to be supported without requiring system call parameters to be remapped or reshaped. The 64-bit kernel may also be compiler-optimized specifically for the Gigaprocessor system, whereas the 32-bit kernel may be optimized to a more general platform.

32-bit application Performance on 32-bit and 64-bit kernels

The 64-bit kernel will also be optimized for 32-bit applications (to the extent possible). This is because 32-bit applications now dominate the application space and will continue to do so for some time. In fact, performance tradeoffs involving 32-bit versus 64-bit applications should be made in favor of 32-bit applications. However, 32-bit applications on the 64-bit kernel will typically have less performance than on the 32-bit kernel, because call parameter reshaping is required for 32-bit applications on the 64-bit kernel.

64-bit application and 64-bit kernel performance at non Gigaprocessor systems

The performance of 64-bit applications under the 64-bit kernel on nonGigaprocessor systems may be less than that of the same applications on the same hardware under the 32-bit kernel. This is due to the fact that the non-Gigaprocessor systems are intended as a bridge to Gigaprocessor systems and lack some of support that is needed for optimal 64-bit kernel performance. In addition, efforts should be made to optimize 64-bit kernel performance for non-Gigaprocessor system, but performance trade-offs are made in the favor of the Gigaprocessor. Continued on next page

-6 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel overview

Guide

-- continued

32-bit and 64bit kernel extension performance at Gigaprocessor systems

The performance of 64-bit kernel extensions on Gigaprocessor systems should be the same or better than their 32-bit counterparts on the same hardware. However, the performance of the 64-bit kernel extension on non-Gigaprocessor machines may be less than 32-bit kernel extensions on the same hardware. This flows from the fact that 64-bit kernels are optimized for Gigaprocessor systems.

Kernel characteristics

Since the kernel is a program itself, it behaves almost like any other program. It’s features are: • Preemptable • Pageable • Segmented • 64-bit • Dynamically loadable Preemptable means that the kernel can be in the middle of a system call and be interrupted by a more important task. The preemption causes a context switch to another thread inside the kernel. Some parts of the kernel are pageable, which means they are not needed in memory all the time, and can be paged to paging space. Both the 32-bit kernel and the 64-bit kernel implement virtual address translation by using segments. In previous versions of AIX, segment registers were used to map segments to thread contexts. Now segment tables are being used. The kernel can be dynamically extended with extra functionality.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel states kernel system diagram trap (Power)

user programs

libraries

User level Kernel level system call interface

file subsystem

process

buffer cache character

control

block

subsystem

device driver

Inter-process Communication scheduler memory management

hardware control Kernel level Hardware level hardware

Roughly there are three distinct layers: • The user level • The kernel level • The hardware level This diagram shows how the kernel is the interface between the user level and the hardware. Applications live at the user level, and they can only access hardware, like a disk or printer, through the kernel. Process execution modes

Processes can run in two different execution modes: kernel mode and user mode.These modes are also referred to as Supervisor State and Problem State. Continued on next page

-8 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel states User mode protection domain

Guide

-- continued

A process running in user mode can only affect its own execution environment and runs in the processor’s unprivileged state. In user mode, a process has read/write access to the user data process private segment and the shared library data segment. It also has access to the shared memory segments using the shared memory functions. The process in user mode has read access to the user text and shared library text segment. User mode processes can still use kernel functions by means of a system call. Access to functions that directly or indirectly invoke system calls are typically provided by programming libraries which gives access to operating system functions.

Kernel mode protection domain

Code running in this mode has read/write access to global kernel space and access to kernel data in the process private segment when running within the process context. Code in interrupt handlers, the base kernel and kernel extensions run in kernel mode. If a program running in kernel mode needs to access user data, a kernel service is used to do so. Programs running in kernel mode can use kernel services, can access global system data, are exempt from all security restraints, and run in the processor privileged state In short: User mode or problem state: • User programs and applications run in this mode. • Kernel data and global structures are protected from access/ modification. Kernel mode or supervisor state: • Kernel and kernel extensions run in this mode. • Can access or modify anything. • Certain instructions limited to supervisor state only. The kernel state is part of the thread state, so this information typically is kept in the threads Machine State area (MST).

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel exercise Exercise: figuring out thread state on Power

Look at the value of the Machine State Register (MSR) for thread of interest: # echo “mst ”| kdb | grep msr iar : 0000000000009444 msr : A0000000000010B2

cr

: 31384935

From /usr/include/sys/machine.h : #define

MSR_PR

0x4000

/* Problem state */

This means that if bit 15 from the MSR is set, the thread is running in user mode, that is, when the fourth nibble from the right is either 4,5,6,7 or C,D,E,F. Continued on next page

-10 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Kernel exercise Exercise: figuring out thread state on IA-64

-- continued

Look at the value of the Interrupt Processor State Register (IPSR) for thread of interest. On an interrupt, and if PSR.ic (Interrupt Collection) is 1, the IPSR receives the value of the PSR. The IPSR, IIP and IFS are used to restore the processor state on a Return From Interrupt (rfi). The IPSR has the same format as PSR. IPSR.ri is set to 0, after any interruption from the IA-32 instruction set. # iadb (0)> ut -t *ut_save: 0x0003ff002ff3b400 *ut_rsesave: 0x0003ff002ff3bf50 System call state:

ut_psr: 0x00001053080ee030

... more stuff... (0)>mst 0x0003ff002ff3b400 mst at address 0003FF002FF3B400 prev : 0000000000000000

intpri : INTBASE

stackfix : 0000000000000000

backt :

kjmpbuf : 0000000000000000

emulator : NO

excbranch : E000000000020A80

excp_type : EXTINT(10)

ipsr : 00001010080AE030

isr : 0000000000000000

iip : E00000000000B970

ifa : E000009729F4F22A

iipa : E00000000000B960

ifs : 8000000000000716

iim : 00000000000000F4 fpsr : 0009804C0270033F

fpowner : LOW/HIGH fpeu : YES

... tons of more stuff ... (0)> q

From /usr/include/sys/machine.h : #define PSR_PK

15

00001010080AE030 (HEX) = 100000001000000001000000010101110000000110000 (Binary)

Bit 15 is set, which means that the thread has the Protection Key set, and hence is in a problem state.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits Kernel Limits

Most of the settings in the kernel are dynamic and don’t need to be tuned. Their maximum values are considered to be chosen in such a way that they will never be reached during normal system usage. Some limits chosen as a maximum can technically be even higher. The following table lists kernel system limits as of AIX 5L Version 5.0

Semaphores Maximum number of semaphore IDs Maximum semaphores per semapore IDs Maximum operations per semop call Maximum undo entries per process Size in bytes of undo structure Semaphore maximum value Adjust on exit maximum value

32-bit kernel 131072

64-bit-kernel 131072

65535

65535

1024

1024

1024

1024

8208

8216

32767

32767

16384

16384

Message Queues Maximum message size Maximum bytes on queue Maximum number of message queue IDs Maximum messages per queue ID

32-bit kernel 4 MB

64-bit kernel 4 MB

4 MB

4 MB

131072

131072

524288

524288

Continued on next page

-12 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Kernel Limits

-- continued

Kernel Limits

Shared Memory Maximum region size Minimum segment size Maximum number of shared memory IDs Maximum number of segments per process

32-bit kernel 2 GB 1

64-bit kernel 2 GB 1

131072

131072

11

268435465

There are a couple of kernel parameters which affect the availability of semaphores (semaem, semmap, semmni, semmns, semmnu, semume). Please check them by referencing the working system. Please keep in mind that other applications could also affect the availability of semaphores.

LVM Maximum number of VGs Maximum number of PP’s per hdisk

32-bit kernel 255

64-bit kernel 4095

1016

1016

Maximum number of 256 LVs Maximum number of 65535 major numbers (see note 1) Maximum number of 1024 VMM-mapped devices (see note 2)

512

Maximum number of disks per VG

128

32

1073741823

1024

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits

-- continued

Kernel Limits

Filesystems Maximum file system size (see note 3) Maximum file size (see note 4) Maximum size of log device Maximum number of file system inodes Maximum number of file system fragments Maximum number of hard links

JFS

JFS2

1 TB

32 PB

64 GB

32 PB

256 MB

32 PB

2^24

Unlimited

2^28

N/A

32767

32767

Miscellaneous 32-bit kernel Maximum number of 131072 processes per system Maximum numbers of 262143 threads per system

64-bit kernel 131072

Maximum number of open files per system Maximum number of open files per process Maximum number of threads per process Maximum number of processes per user Maximum physical memory size Minimum physical memory size Maximum value for the wall

32767

Unlimited (resource bound) 32767

32767

32767

131072

131072

4 GB

1 TB

32

256 MB

1 GB

4 GB

1000000

262143

Continued on next page

-14 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits Kernel Limits

Guide

-- continued

Notes: 1. Each volume group takes one major number; some are reserved for the OS and for other device drivers. Run "lvlstmajor" to see the range of free major numbers; rootvg always uses 10. 2. VMM-mapped devices are mounted JFS/CDRFS file systems, open JFS log devices, and paging spaces. Of 512, 16 are pre-reserved for paging spaces. These devices have are indexed through the kernels Page Device Table (PDT), which is a fixed size array. 3. To achieve 1TB, the file system must be created with npbi=65536 or higher and frag=4096. 4. To achieve around 64 GB files, the file system must be created with the -a bf=true flag AND the application must support files greater than 2 GB.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Kernel Limits Exercises Checking kernel values

The purpose for this exercise is to find actual limit or settings in a running kernel. From the file /usr/include/sys/msginfo, we obtain the structure msginfo that holds four integers. To list the content in the running kernel, we use kdb fat Power and iadb at IA-64 platform. From both systems, we display 16 bytes equal to four integers. /* *

Message information structure.

*/

struct msginfo { int msgmax, /* max message size */ msgmnb, /* max # bytes on queue */ msgmni, /* # of message queue identifiers

*/

msgmnm; /* max # messages per queue identifier

*/

};

Power

# kdb (0)> d msginfo

msginfo+000000: 0040 0000 0040 0000 msgmax msgmnb

IA-64

0002 0000 msgmni

0008 0000 msgmnm

# iadb > d msginfo 4 4 e00000000415cfb0: 00400000 00400000 00020000 00080000 msgmax

-16 of 38 AIX 5L Internals

msgmnb

msgmni

msgmnm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

64-bit Kernel base enablement 64-bit Kernel base enablement

Several components of base enablement support are provided to make it possible for kernel subsystems and kernel extensions to run in 64-bit mode and use a large address space.

State management support

Support is provided for saving and restoring 64-bit kernel context, including full 64-bit GPR contents. This support also extends to the area of kernel exception handling where setjmpx() and longjmpx() must deal with 64-bit kernel context. In addition, state management is extended to include the 64-bit kernel address space as part of the kernel context.

Temporary attachment

The 64-bit kernel provides kernel subsystems and kernel extensions with the capability to change the contents of the kernel address space. This includes the capability to change segments within the address space temporarily for a specific thread of execution and is consistent with the segmented virtual memory architecture of the hardware and the legacy 32bit kernel programming model. A total of four concurrent temporary attachments will be supported under a single thread of execution. This limitation is consistent with the limitation imposed by the 32-bit kernel and is made to restrict the amount of kernel state that must be saved and restored at context switch.

Global attachment

While the temporary attachment model is maintained, the 64-bit kernel also provides a model under which subsystem data is placed within the global kernel address space and made visible to all kernel code for the entire life of its usefulness, rather than temporarily attaching segments as needed and in the context of a single thread. This global attachment model does more than allow the 64-bit kernel to provide sufficient space for subsystems to place their data in the global kernel heap. Rather, it includes the capability to place subsystem segments within the global address space. This capability is needed for two reasons: • Different memory characteristics • Data organized around segment Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement Global attachment

-- continued

Some subsystems require virtual memory characteristics that are different from those of the kernel heap. For the most part, these characteristics are defined at the segment level and typically must be reflected by segment types that are different from those used for the kernel heap. Also, some subsystems organize their data around segments and require sizes and alignments that are inappropriate for the kernel heap. The global attachment model is of importance for a number of reasons. First, it is more scalable than the temporary attachment model. This is particularly true for subsystems that require large portions of their data to be accessible at the same time for a single operation. As the volume of this data increases to meet workload or hardware requirements, the temporary attachment model proves impractical for these subsystems, as increasing numbers of segments must be attached and detached. An example of such a subsystem is the VMM, where page fault resolution and virtual memory kernel services require access to all page frames and segment descriptors. The global attachment model is also of value in cases where only a small number of subsystem segments are involved. Segments are attached to the global kernel addresses space only once, typically at subsystem initialization, and are accessible from then on without requiring individual subsystem operations to incur the path length cost of segment attachment. This is not to say that the global attachment model is without its own path length costs; specifically, use of this model may result in more segment lookside buffer (SLB) reloads. This is because it provides no opportunity to prime the SLB table with virtual segment IDs (VSIDs) for soon-to-beaccessed segments. Rather, it relies upon the caching nature of the SLB table and updates SLBs with new VSIDs only when satisfying reload faults. This differs from the temporary attachment model where VSIDs are placed in the SLB as part of segment attachment.

Continued on next page

-18 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement Global attachment

Guide

-- continued

Finally, this model simplifies the general kernel programming model. Subsystems are not required to deal with the complexity of segments, segment offsets or segment attachments in accessing their data. Rather, data accesses are made simply and naturally using addresses within the flat kernel address space. The specific subsystem segments that will be placed in the kernel address space under the global attachment model include: • Kernel Heap Although traditionally part of the global address space, the kernel heap segments will be placed in this space through global attachment. • File System Segments The global segments used to hold the file and inode tables will be provided through global attachment. • mbuf Segments The mbuf pool has long been a part of global space and this will continue under the 64-bit kernel. • VMM Segments These segments are privately attached in the 32-bit kernel legacy and hold the software page frame table, segment control blocks, paging device table, file system lockwords, external page tables, and address space map entries. • Process and Thread Tables Global attachment is used for the segments required for the globally addressable process and thread tables. All segments added to the global kernel address space through global attachment will be strictly read/write for the kernel and no-access for users. In addition, unaligned accesses to these segments will not be supported and will result in a protection exception.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement Data isolation

-- continued

While placing subsystem data in the global kernel address space provides significant benefits, it eliminates the data isolation that is provided by the temporary attachment model. Under this model, data is typically made accessible only while running subsystem code and is not generally exposed to other subsystems. Unrelated interrupt handlers may gain accessibility to data by interrupting subsystem code. However, this exposure is more limited than that which occurs by placing data in global space where all kernel code has accessibility. Isolation is critical for some classes of subsystem data. As a result, not all subsystem data should be placed in the global kernel address space. In particular, file systems should continue to use temporary attachments to provided isolation for user data.

Kernel address space layout

The kernel address space layout preserves the existing 32-bit and 64-bit user address layouts that is now found under the 32-bit kernel legacy. In addition, a common global kernel and per-process user address space is provided. This is required for a number of performance reasons: • Efficient transition between kernel and user mode • Preservation of SLBs • Reduces complexity • Single per-process segment table To begin, a common address space improves the efficiency of transition between kernel and user mode since there is no need to switch address spaces. Next, it preserves SLBs. This is because the segments within the user and kernel address space are common, so there is no need to use separate SLBs or perform SLB invalidation at user/kernel transitions. Also, a common address space reduces the complexity and path length that is associated with kernel access to user space. There is no longer a need for the kernel to gain address ability to segments from a separate user address space in performing accesses or to serialize accesses against changes in the user address space. Rather, user segments are already in place and properly serialized in the common address space. Finally, the common address space supports the efficiency of a single per-process segment table. Continued on next page

-20 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement

Guide

-- continued

Kernel address space layout

Temporary attachments are not included as part of the common address space. This is for a number of reasons. First, data isolation would be impacted for temporary attachments if they were placed in the common address space. This is because the attached data would be accessible in the kernel by all threads of a process rather than only by the thread that performed the temporary attachment. Second, it would be inefficient for the common address space to include temporary attachments. This is due to the fact that changes to the common address space would have to be serialized among all threads of a process.

I/O space mapping

The 64-bit kernel supports I/O space at locations below and above 4 GB within the hardware system memory map. Under the 64-bit kernel, I/O space is virtually mapped through the page translation hardware and made accessible through segments on all supported hardware system implementations. In the 32-bit kernel legacy on current hardware systems, I/O space virtual access is achieved through block address translation (BAT) registers, but this capability is not provided by the Gigaprocessor hardware.

Performance when accessing I/O addresses

The capability to place portions of I/O space within the global kernel address must be provided to allow temporary attachment overhead to be avoided. This capability is built upon the global attachment model. Along with services to support this, others services are provided that allow portions of I/O space to be temporarily attached. However, these services will form an I/O space temporary attachment model that is slightly different from the one now found under the 32-bit kernel. Specifically, I/O space mappings must be created prior to any temporary attachments and destroyed once all temporary attachments are complete. These mapping operations are performed by individual device drivers through new services and typically occur at the time of device configuration and deconfiguration. Compare to the existing model under the 32-bit kernel, where no separate mapping operations are present.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement I/O mapping in 64-bit kernel mode

-- continued

The mapping operations are provided under the 64-bit kernel model for a number of reasons. The first is performance. While the 32-bit kernel model does not require I/O space to be mapped before it is attached, it does require each temporary attachment to perform some level of mapping. Under the 64-bit kernel model, each device driver maps its portion of I/O once at initialization time and incurs no additional mapping overhead in performing temporary attachments. Next, the presence of the mapping operations provide efficient use of system resources. I/O space is mapped in virtual memory through the page table and segments under the 64-bit kernel and these system resources are only consumed for portions of I/O space that are actually in use. In the absence of mapping operations, the 64-bit kernel itself would have to map all of I/O space into virtual memory and possibly waste resources for unused portions. In addition to potentially wasting resources, arming the kernel with the responsibility of mapping I/O space would lead to arbitrary layouts of I/O space in virtual memory and would not support data isolation. Finally, the interfaces for performing temporary attachments are simplified, as no I/O mapping information must be specified. This implies new interfaces for attaching and detaching from I/O space. The new I/O space temporary attachment model and supporting services is not only provided under the 64-bit kernel but under the 32-bit kernel as well. This is required to ease the migration of 32-bit device drivers to the 64-bit kernel environment and to make it simpler to maintain 32-bit and 64bit versions of a single device driver. Rather than placing their respective portions of I/O space in the global kernel address space, most device drivers should continue to access I/O space through temporary attachments. This is because a large proportion of these accesses occur under interrupts and would more than likely miss the SLB table if the accesses were performed using the global attachment model. While the temporary attachment model adds overhead to I/O space accesses, it typically avoids the SLB miss performance penalty by priming the SLB table. Continued on next page

-22 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel base enablement

Guide

-- continued

LP64 C language data model

The 64-bit kernel uses the LP64 (Long Pointer 64-bit) C language data model. This data model was chosen for a number of reasons. First, the LP64 data is also used by 64-bit AIX applications, and this allows the 64bit kernel to support these applications in a straightforward manner. Of the prevailing 64-bit data models, including ILP64 and LLP64, the LP64 data model is most consistent with the ILP32 data model used by 32-bit applications. This consistency simplifies 32-bit application support under the 64-bit kernel and allows 32-bit and 64-bit applications to be supported in fairly common ways. Next, LP64 has been chosen as the data model for the 64-bit kernel implementations provided by key UNIX vendors, including SGI, SUN, and H-P. Use of a common data model simplifies matters for ISVs, and enables AIX to use industry wide solutions to some problems. Finally, the 64-bit kernel requires no new compiler functionality and can use the existing 64-bit mode compiler.

Register conventions

The register conventions used in the 64-bit kernel environment are the same as those used in the 64-bit application environment. This means that general purpose register 13 will be reserved for operating system use.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel stack Kernel stack

64-bit code has greater stack requirements than 32-bit code. This is for two reasons. First, the amount of stack space required to hold subroutine linkage information increases for 64-bit code, since this information is made up of register and pointer values and these values are larger 64-bit quantities. Second, long and pointer values are 64-bit quantities for 64-bit code and consume more space when maintained as stack variables. The larger stack requirements of 64-bit code also means that stack-related sizes under the 64-bit kernel are increased over those of the 32-bit kernel. In fact, most existing stack sizes will double.

Minimum stack size

Under the 64-bit kernel, the components of the common subroutine linkage, such as the link register and TOC pointer, are 64-bit quantities. As a result, the minimum stack frame size is 112 bytes.

Process context stack size

Consistent with the 32-bit kernel, the kernel stacks for use in process context are 96 KB in size. This size should prove to be sufficient for the 64bit kernel, since it has been found to be twice that of what is actually needed for the 32-bit kernel.

Interrupt stack size

The interrupt stack will be 8 KB in size under the 64-bit kernel. This size is clearly warranted, since some interrupt handlers find the 4 KB interrupt stack size of the 32-bit kernel to be insufficient.

Dynamic resource pools

To allow scalability, resource pools are allocated dynamically from the kernel heap and through separately created segments intended for this purpose. This means that some existing resource pools, like the shared memory, message queue, and semaphore ID pools, are relocated from the kernel BSS. Continued on next page

-24 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

64-bit Kernel stack Kernel heap

Guide

-- continued

The kernel heap is the home of most kernel data structures, and is sufficiently large to allow subsystems to scale fixed resource pools, while at the same time, providing adequate space for dynamically allocated resources. To provide this, the kernel heap is expanded to encompass a larger number of segments and placed above 4 GB within the global kernel address space to accommodate its larger size. While the kernel heap is extended and moved above 4 GB, the interfaces provided for the allocation and freeing from this heap are the same as those provided under the 32-bit kernel. The use of these interfaces is pervasive, so common interfaces eases the 64-bit kernel porting effort for kernel subsystems and kernel extensions and makes it simpler to support both kernels. The kernel heap is now expanded to 16 segments, for a total of about 4GB of allocatable space. This is more than eight times larger than the space available under the 32-bit kernel. Allocation requests are only limited in size by the amount of available heap space, rather than by some arbitrary limit. This means that the segments that make up the kernel heap are laid out contiguously within the address space, and requests for more than a segment size worth of data is granted if sufficient free space is available. It also means that a request can be satisfied with space that crosses segment boundaries. A separate global heap reserved for the loader is provided in segment zero (that is, the kernel segment). This heap is used to hold the system call table and svc_instructions code for 32-bit applications and must be placed in segment zero, because it is the only global segment that is mapped into the 32-bit user address space. This heap is also used to hold the system call table for 64-bit applications and loader sections for kernel extensions. This data is located in the loader heap because it must be readable in user mode. This type of access is not supported for the kernel heap.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

CPU big- and little-endian Memory view for big and little endian systems

Although both Power and IA-64 architectures support big-endian and little-endian implementations, the endian of AIX 5L running on IA-64 and AIX 5L on PowerPC are different. AIX 5L for IA-64 is little-endian, and AIX 5L for PowerPC is big-endian. Logically, in multi-digit numbers, leftmost digits are more significant, and rightmost least. For example, in the four-digit number 8472, the 4 is more significant than the 7. Now, when you look at the system memory, we can look at it in two ways. The example shows a 100 byte memory seen the two ways. Try to write the number 1234567890 at address 0-9 in both figures. What is the digit in the byte at address two?

address address 99 89 79 69 59 49 39 29 19 09

90 80 70 60 50 40 30 20 10 0

address

address

00 10 20 30 40 50 60 70 80 90

09 19 29 39 49 59 69 79 89 99

Continued on next page

-26 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

CPU Big and little Endian Register and memory byte order

-- continued

Computers address memory in bytes while manipulating data in words (of multiple bytes). When a word is placed in memory, starting from the lowest address, there are only two options: Either place the least significant byte first (known as little-endian) or place the most significant byte first (known as big-endian). register bit 63

0

a b c d

e f

g h

e f

g h

4

6

big-endian memory a b c d address

0

1

2

3

5

7

little-endian memory h g address

0

1

2

d c

e

f 3

4

5

b a 6

7

In the register layout shown in the figure above, “a” is the most significant byte, and “h” is the least significant byte. The figure also shows the byte order in memory. On big-endian systems, the most significant byte will be placed at the lowest memory address. On little-endian systems, the least significant byte will be placed at the lowest memory address. Power, PowerPC, most RISC-based computers, IBM 370 computers, and Internet protocol (IP) are some examples of things that use the big-endian data layout. Intel processors, Compaq Alpha processors, and some networking hardware are examples of things that use the little-endian data layout.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Multi Processor dependent designs Kernel lock

The kernel lock is not supported under the 64-bit kernel. This lock was originally provided to allow subsystems to deal with the pre-emptive nature of AIX kernel on uniprocessor hardware, while later being used as a mean for ensuring correctness for non-MP-safe subsystems on MP hardware. At a minimum, all 64-bit kernel subsystems and kernel extension must be MP-safe, with most required to be MP-efficient to meet performance requirements. As a result, the kernel lock is no longer required.

Device funneling

Under the 64-bit kernel, no support will be provided for device funneling. This means that all device drivers must be MP-safe and identify themselves as such when registering devices and interrupt handlers. Device funneling was originally provided under the 32-bit kernel so that non-MP-safe device drivers could run correctly on multi-processor hardware with no change. However, all device drivers must change to some extent under the 64-bit kernel and this provides the opportunity to simplify the 64-bit kernel by not providing device funneling support and requiring additional changes for the set of device drivers that are not MPsafe. Of the existing IBM Austin-owned device drivers, only the X.25 and graphics device drivers are not MP-safe. However, this is of no concern, since X.25 will not be provided under the 64-bit kernel and the (new) graphics drivers that will be provided in the time frame of the 64-bit kernel will be MP-safe.

-28 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Command and Utility compatibility for 32-bit and 64-bit kernels Commands and utilities

A number of AIX-supplied commands and utilities deal directly with kernel details and require different implementation under the different kernels. Commands based upon /dev/kmem or /dev/mem serve as an example. While two different implementations may be required, the AIX-supplied commands and utilities must use a common binary. This is required to support a common system base and means that a single binary front-end must be used, but does not dictate that only a single binary be used. In fact, two binaries make sense in cases where kernel data structures are used (like vmstat) and these data structures have different sizes or formats under 32-bit and 64-bit compilations. Rather than duplicating data structures for a single binary, both a 32-bit and a 64-bit binary version are provided; one of these serves as a front-end and executes the other when the bit-ness of the kernel does not match its own. This implementation ensures that there is one common command interface for both 32-bit and 64-bit kernels utilities.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Exceptions Exceptions and interrupts distinction

The distinction between the terms "exception" and "interrupt" is often blurred. The bulk of AIX documentation refers to both classes generically as "interrupts," while the hardware documentation (like the PowerPC 60x User’s Manuals) makes the distinction. We will try to keep the terms separate.

Definition of exceptions

Exceptions are synchronous events that are normally caused by the process doing something illegal. An exception is a condition caused by a process attempting to perform an action that is not allowed, such as writing to a memory location not owned by the process, or trying to execute illegal operations. For illegal operations, the kernel traps the offending action and delivers a signal to the process causing the exceptions, (or crashes, if the process was in kernel mode). Exceptions can also be caused by a page fault. A page fault is a reference to a virtual memory location for which the associated real data is not in physical memory.

Determine the action taken on an exception

The result of an exception is either to send a signal to the process or to crash the machine. The decision is based upon what kind of exception occurred and whether the process was executing in user mode or kernel mode: •

Exceptions are caused within the context of a process.



A process may NOT decide how to react to the exception.



Exception handlers are kernel code and run without regard to the process, except to cleanly handle the exception generated by the process.



Some exceptions result in the death of the process.

• Some exception types can be found in V\VPBH[FHSWK! A process can decide how to respond to the signal generated by the exception in certain cases. For example, a process can decide to catch the signal for SIGILL, which is delivered when a process in user mode executes an illegal instruction. An exception is also a mechanism to change to supervisor state as a result of: •

Program errors



Unusual conditions



Program requests

Continued on next page

-30 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Exceptions Branching to exception handlers

Guide

-- continued

After an exception, the system switches to supervisor state and branches to an exception handler routine. The branch address is found from the content of a specific memory location called "vector." Examples of exceptions vectors: • System reset • Machine check • Data storage interrupt (DSI) • Instruction storage interrupt (ISI) • Alignment • Program (invalid instruction or trap instruction) • Floating-point unavailable • Decrementer • System call

System reset exception

The system reset exception is used when a system reset is initiated by the system administrator. This generally causes a "soft" reboot of the system.

Machine check exception

The machine check exception is generated when a hardware machine check occurs. This generally indicates either a hardware bus error or bad real address access. If a machine check occurs with the ME bit off, then a machine checkstop occurs. Generally, a machine check exception causes a kernel crash dump to be generated. A machine checkstop causes no kernel crash dump to be generated, though a checkstop record is generated.

Data storage exception

Data storage interrupt (DSI) and instruction storage interrupt (ISI) exceptions are caused by hardware not being able to find a translation for a instruction fetch or load/store operation. These generally result in a page fault. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Exceptions

-- continued

Alignment exception

Alignment exceptions are generated when an instruction generates an unaligned memory operation that can not be completed by the hardware. Which unaligned operations can not be handled by the hardware are processor dependent. This exception generally results in AIX performing the unaligned operation with special purpose code.

Invalid instruction exception

The program instruction is generated when an illegal instruction or trap instruction is generated. This is generally caused by debugger breakpoints in a process being hit. This exception generally results in a call to an application or kernel debugger.

Floating point unavailable exception

The floating point unavailable exception is caused when a thread executes a floating point instruction when floating point operations are not allowed. This generally indicates that a thread has not executed any floating point instructions yet or that another thread’s floating point data is currently in the processor’s floating point registers. AIX does not save a thread’s floating point register values until it first uses the floating point registers. On UP systems, AIX does not save off floating point registers for the currently running thread when another thread is dispatched. Often, no other thread will use the floating point registers before the thread is again dispatched. This saves AIX having to save and restore the floating point registers on every thread dispatch.

Decrementer exception

The decrementer exception is caused when the decrementer register has reached the value zero. This indicates that a timer operation has completed.

System call exception

The system call exception occurs whenever a thread executes a system call.

-32 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Interrupts Description of interrupts

Interrupts are asynchronous events that may be generated by the system or a device, and "interrupts" the execution of the current process. Interrupts usually occur when a process is running and some asynchronous event occurs such as disk I/O completion or a clock tick. The event usually has nothing to do with the current running process. The kernel immediately preempts the current running process to handle the interrupt. The state of the machine is saved on the stack and the interrupt is handled. The user process has no knowledge that the interrupt occurred. Interrupts are one of the major reasons that AIX cannot be a hard real-time system. No guarantee can be made as to how long it may take for some action to occur as it may get interrupted any number of times during the action. Interrupts are caused outside the context of a process. In general, a process may NOT *decide how to react to the interrupt. Interrupt handlers are kernel code and run without regard to the process unless the nature of the interrupt is to update some process related structure, *statistics, and so on.

Interrupt levels

Each interrupts has a level and an associated priority; the level is a value that is used to differentiate between interrupts. The priority ranks the importance of each one. Devices, such as adapter cards, with interrupt facilities have a interrupt level associated. When the system receive an interrupt with that level, AIX then knows that it was caused by the device at that level. In AIX, devices may share interrupt levels such that more than one adapter may share the same level.

Controlling Interrupts

A kernel process can disable some or all types of interrupts for short periods. The interrupted process will safely return to continue execution. Some interrupt types can be found in Most interrupts are not concerned with which process is getting interrupted. The major counter example is the clock interrupt. This is used to update the run-time statistics for the currently running process.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Interrupts

-- continued

Critical sections

A critical section is a code section that must be executed without any break. For example: if data is examined and changed based on the value. A process would disable interrupts across a critical section to ensure that the section is executed without breaks.

Out of order instruction sets and Interrupts

On modern processors, such as Power and IA-64, many instructions are being executed at one time. When a hardware interrupt occurs, instructions are executed to completion and any following instructions are terminated with no effect on the processor registers or memory; results from out of order instructions are discarded. This is what is meant by "interrupts are guaranteed to occur between the execution of instructions." The processor makes sure that the effect of its operations are equivalent to an interrupt occurring between the execution of instructions.

-34 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Interrupt handling in AIX Interrupt handling

When an interrupt is received, AIX performs several steps to handle the interrupt properly: • Saves the current state of the machine. • Determines the real handler for the interrupt. • Calls that handler to "service" the interrupt. • Restores the machine state if and when the handler completes.

Interrupt priorities

Interrupt priorities have no relationship to process and thread scheduling priorities. AIX associates priorities with each type of interrupt. A lower priority number means a more favored interrupt. Interrupt processing can itself be interrupted, but only by a more favored (lower priority number) interrupt. Interrupt routines usually allow themselves to be interrupted by higher prioritized interrupts, but refuse to take less favored interrupt; however, interrupt routines and other programs running in kernel mode can manually raise or lower their interrupt priority. This is called "disabling or enabling interrupts." The reason for this is that a high prioritized disk handler must complete in time before new data arrives, and it does not want to be interrupted by less prioritized interrupts.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-35 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Handling CPU state information at interrupt Saving and restoring machine state

AIX maintains a set of machine state save (mstsave) areas. Each processor has a pointer to the mstsave area it should use when the next interrupt occurs. This pointer is called the current save area, or csa pointer. When state needs to be saved, AIX: • Saves almost all registers into the mstsave pointed to by this processor’s csa. • Gets the next available mstsave area from this processor’s pool. • Links just-saved mstsave to new mstsave. • Updates this processor’s csa to point to a new area. When an interrupt handler returns, AIX must restore the machine state that was in effect when the interrupt occurred. AIX does this by: • Reloading registers from the processor’s previous mstsave area. • AIX then sets the processor’s csa pointer to the (now unused) previous mstsave area. • If returning to base interrupt level, AIX generally reruns the dispatcher to determine which thread to resume. • The interrupt might have made another thread runnable.

-36 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Guide

Handling CPU state information at interrupt mstsave area description

Because the mstsave (machine state) areas are linked together, the mstsave areas provide an interrupt history stack.

csa prev

prev

prev

mstsave

mstsave

mstsave

mstsave

unused (next interrupt goes here)

high priority interrupt

low priority interrupt

base interrupt level

Whenever AIX receives an interrupt that is of higher priority than what it is currently doing, it must save the state of the machine into an mstsave area. The csa (Current Save Area) pointer points to an unused mstsave area that AIX can use if another, higher-priority interrupt comes in. This area may contain stale data from being used for a previously-handled interrupt, but its prev pointer always points to the previous mstsave area (or is null if there aren’t any more in use at that time). These areas are linked together from most-recently to least-recently used, so this means that they go from higher to lower interrupt priority. At the end of the mstsave chain is the mstsave area for the base interrupt level. This mstsave area contains the state of the machine when it was last doing something other than interrupt processing (that is, the machine state when the oldest interrupt that we are currently processing came in).

Size limitation on mstsave area and interrupt stack

The stack used by an interrupt handler is kept in the same page as the mstsave area. This limits the stack to 4K on the 32-bit kernel and 8k on 64bit kernel minus the size of the mstsave area. Using this area for the stack ensures that the stack is pinned, which is required for interrupt handlers. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-37 of 38

Guide

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm

Handling CPU state information at interrupt

-- continued

Saving base level machine state

base level mst save area

The thread’s base level state save area is in the initial thread’s uthread block.

initial thread’s uthread block

The initial thread’s ublock is in the process’ ublock In the 32-bit kernel, there is also the user64 area, which is used to save the 64-bit user registers for 64-bit processes.

user area

user64 (32-bit kernel only)

process ublock

The user64 area is only used when the process is a 64-bit process in a 32bit kernel. If the user64 area is being used it is initialized and pinned. The area is created when a process calls exec() for a 64-bit executable. It is destroyed when a 64-bit process exits or calls exec() for a 32-bit executable. The portion of the base level state save area that contains the 32-bit registers is unused for 64-bit processes. At a 32-bit kernel, only the base level state save (MST) area needs to have a 64-bit register state save area (user64) associated with it. Since all interrupt handlers run in 32-bit kernel mode, all state save areas other than the base level state save area only needs to save 32-bit states (even on 64-bit hardware). At a 64-bit kernel all MST areas are 64-bit.

-38 of 38 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

Unit 2. IA-64 Hardware Overview This unit describes: The /proc filesystem in the AIX 5L kernel.

What You Should Be Able to Do • list the registers available to programs • describe how EPIC improves performance

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Hardware Overview Introduction to IA-64

IA-64 is Intel’s 64-bit architecture, based on the Explicitly Parallel Instruction Computing (EPIC) design philosophy. These are the IA-64 goals: • Overcome the limitations of today’s architectures. • Provide world class floating point performance. • Support large memory needs with 64-bit addressability. • Protect existing investments with IA-32 compatibility. • Support growing high-end application workloads for e-business, enterprise, and technicalcomputing.

Performance

IA-64 increases performance by using available compile-time information to reduce current performance limiters, thereby moving some of the performance burden from the microarchitecture to the compiler. This enables designing simpler processors, which are more likely to achieve higher frequencies. To achieve improved performance, IA-64 code: • Increases instruction level parallelism (ILP) • Improves branch handling • Hides memory latencies • Supports modular code IA-64 increases ILP by providing more architectural resources: large register files, and a 3-instruction wide word. The architecture also enables the compiler/assembly writer to explicitly indicate parallelism. Branch handling is improved by providing the means to minimize branches in the code, increase branch prediction rate for the remaining branches and providing specific support for typical branches. Memory latency is reduced by allowing the compiler to schedule loads earlier in the code and enabling memory hierarchy cache management. IA-64 supports the current compiler trend to produce modular code by providing specific hardware support for function calls and returns.

-2 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 formats

Data types

The following data types are supported : • Integer: 1, 2, 4 and 8 byte(s) • Floating-point single, double and double-extended formats • Pointers: 8 bytes

Integer data types

63

31

15

7

0

Floating-point Data Types

79

63

31

0

The basic IA-64 data type is 8 bytes. Apart from a few exceptions, all integer operations are on 64-bit data, and registers are always written as 64 bits. Therefore, 1, 2 and 4 byte operands loaded from memory are zeroextended to 64 bits.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 formats Instruction format

-- continued

A typical IA-64 instruction is a three operand instruction, with the following syntax: [(qp)] mnemonic[.comp1][.comp2] dests = srcs

(qp)

mnemonic [comp1][comp2]

dests, srcs

A qualifying predicate is a predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not explicitly preceded by a predicate, assume the first predicate register, p0, which is always true. Some instructions cannot be predicated. A unique name identifying the instruction. Some instructions may include one or more completers. Completers indicate optional variations on the basic mnemonic. Most IA-64 instructions have at least two source operands and a destination operand. Source operands are used as input. Typically, the source operands are registers, or immediates. The destination operand(s) is typically a register to which the result is written.

Some examples of different IA-64 instructions: Simple Instruction add r1 = r2, r3 Predicated instruction (p4)add r1 = r2, r3 Instruction with immediate add r1 = r2, r3, 1 Instruction with completer cmp.eq p3 = r2, r4

-4 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 memory Memory organization

IA-64 defines a single, uniform, linear address space of 2^64 bytes which is divided into 8 regions of size 2^61. A single space means that both data and instructions share the same memory range. Uniform means that there are no address regions with predefined functionality. Linear means that the address space contains no segments; all 2^64 bytes are consecutive. All code is stored in little-endian byte order in memory. Data is typically stored in little-endian byte order. IA-64 also provides support for big-endian code and operating systems. Moving data between registers to and from memory is performed strictly through the load (ld) and store (st) instructions. IA-64 supports loads and stores of all data types. Because registers are written as 64-bit, loads are zero-extended. Stores always write the exact number of bytes for the required format. The size of memory location is specified in the opcode as a number • st1/ld1 = byte (8bits) • st2/ld2 = halfword (16 bits) • st4/ld4 = word ( 32 bits) • st8/ld8 = doubleword ( 64 bits) Examples : // Loads 32 bits from address 4 + r30 into r31 High 32-bits cleared on 64-bit processor add r31 = 4, r30 ld4 r31 = [r30] //Stores 64 bits from r3 to address r29 - 8 add r24 = -8, r29 st8 [r24] = r3 //Loads 8 bits from address 27+r1 into r3 add r2 = 0x27, r1 ld1 r3 = [ r2 ] Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 memory Region Usage

-- continued

On IA64, the 64-bit linear address space consists of 8 regions of size 2^61 with the upper 3-bits of the address selecting a virtual region, a physical region register, and an associated region identifier. The region identifier (RID), much like the POWER segment identifier (SID), participates in the hardware address translation such that in order to share the same address translation, the same RID must be used. The sharing semantic (private, globally shared, shared-by-some) is determined by whether or not multiple processes utilize the same RID. For example, a process’s private storage resides within a region whose RID is mapped only by that process. Therefore, address space usage is in a large part determined by assigning the desired sharing semantics to each of the 8 virtual regions and mapping the appropriate objects into those regions that require those semantics. There are two imporant properties associated with this region usage. First, the mapping of objects to regions is many-to-one. That is, multiple objects map into a single region. Second, mapping the same object to different regions results in aliases. This is a distinct difference from the POWER architecture where an object (a.k.a. SID) is addressed the same regardless of the virtual address used. Aliases simply additional address translations on IA64 and thus a likelyhood for decreased performance and so their use should be minimized. Another significant departure from AIX is that the majority of the 64-bit address space is managed using Single Address Space (SAS) semantics. This is necessary to achieve the desired degree of sharing of address translations for shared objects: to achieve a single translation for an object all accesses must be made through a common global address. Such a semantic is possible by virtue of the IA64 protection keys which provide additional access control beyond address translations. So, a process that maps a region only has accessibility to those objects within that region for which it has the appropriate protection key. Note that AIX manages some parts of the process address space as SAS -- for example, the shared library text segment contains mappings whose addresses are common across all processes. The AIX use of the SAS style of management is minimal because the POWER architecture provides for sharing on a segment basis regardless of the virtual address used to map the segment. To achieve the same degree of sharing on IA64 a shared object must be mapped at a global address. Continued on next page

-6 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 memory region usage continued

Guide

-- continued

In addition to the sharing semantics there are additional properties that influence the location of objects within regions. First, to preserve the flataddress space with a logical boundary between user and kernel space it is useful to place user and kernel objects at opposite ends of the address space whenever feasible. Next, the IA64 architecture provides for multiple page sizes and a preferred page size per region so objects with similar page size requirements are most naturally colocated within the same region. Finally, certain object types such as executable text have properties and uses which mandate that they be isolated to a separate region. Given these general guidelines, the following table shows the selected region usage and subsequent sections describe each region use in greater detail. These selections provide for 4 regions dedicated to user space and 3 for kernel for the initial release.

VRNStyle 0 MAS

Private

1 SAS/MAS

Text

Example Uses process data, stack, heap, mmap, ILP32 shared library Private text, ILP32 main text, u-block, kernel thread stacks/msts LP64 shared library text, LP64 main text

Temp Kernel2 Kernel

LP64 shmat LP64 shmat w/ large superpage reserved kernel temporary attach, global buffer pool kernel global w/ large page size kernel global Virtual Region Usage

2 SAS 3 SAS 4 n/a 5 SAS 6 SAS 7 SAS

© Copyright IBM Corp. 2000

Name

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Instructions Instruction level parallelism (ILP)

IA-64 enables improving instruction level parallelism (ILP) by: • Enabling the compiler/assembly writer to explicitly indicate parallelism. • Providing a three-instruction-wide word, called a bundle, that facilitates parallel processing of instructions. •Providing a large number of registers, enabling using different registers for different variables and avoiding register contention.

Parallel Instruction Processing IA-64 processor

A-64 instructions are bound in instruction groups. An instruction group is a set of instructions which do not have read-after-write (RAW) or write-afterwrite (WAW) dependencies between them and may execute in parallel. In any given clock cycle, the processor executes as many instructions from one instruction group as it can, according to its resources. An instruction group must contain at least one instruction; the number of instructions in an instruction group is not limited. Instruction groups are indicated in the code by cycle breaks (;;) placed in the code by the assembly writer or compiler. An instruction group may also end dynamically during run-time by a taken branch. Instruction groups reduces the need to optimize the code for each new micro architecture. Processors with additional resources will take advantage of the existing ILP in the instruction group. Continued on next page

-8 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Instructions Instruction groups and bundles

-- continued

Instruction groups are composed of 41-bit instructions contained in bundles. Each bundle contains three instructions, and a template field, which are set during code generation, by a compiler, or the assembler. The code generation process ensures instruction group assignment without RAW or WAW dependency violations within the instruction group. The template field maps each instruction to an execution unit. This allows the processor to dispatch all three instructions in parallel. Bundles are aligned at 16-byte boundaries. template

Bundle structure

instruction slot 2 127

instruction slot 1 86

instruction slot 0 4

45

0

Template

The set of templates define the combinations of functional units that can be invoked by a executing a single bundle. This in turn lets the compiler schedule the functional units in an order that avoids contention. The template can also indicate a stop. The 24 available templates are listed opposite. M - is a memory function I - is an integer function F - is a floating point function B - is a branch function L - is a function involving a long immediate "s" indicates a stop.

MII MIsI MLX* MMI MsMI MFI MMF MIB MBB BBB MMB MFB

MIIs MIsIs MLXs* MMIs MsMIs MFIs MMFs MIBs MBBs BBBs MMBs MFBs

* L+X is an extended type that is dispatched to the I-unit. The template field can end the instruction group either at the end of the bundle, or in the middle of the instruction group. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Instructions Instruction set

-- continued

A basic IA-64 instruction has the following syntax: [qp] mnemonic[.comp] dest=srcs Where :

qp

Specifies a qualifying predicate register. The value of the qualifying predicate determines whether the results of the instruction are committed in hardware or discarded. When the value of the predicate register is true (1), the instruction executes, its results are committed, and any exceptions that occur are handled as usual. When the value is false (0), the results are not committed and no exceptions are raised. Most IA-64 instructions can be accompanied by a qualifying predicate.

mnemonic Specifies a name that uniquely identifies an IA-64 instruction. comp Specifies one or more instruction completers. Completers indicate optional variations on a base instruction mnemonic. Completers follow the mnemonic and are separatedby periods. dest Represents the destination operand(s), which is typically the result value(s) produced by an instruction. srcs Represents the source operands. Most IA-64 instructions have at least two input source operands. Continued on next page

-10 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Instructions Branch instructions

Guide

-- continued

All instructions beginning with “br.” are branches. The IA-64 architecture provides three branch types: • Relative direct branches, using 21-bit displacement that is appended to the instruction pointer of the bundle containing the branch. • Long branches goes to an explicit address by using an 60 bit displacement from the current instruction pointer. • Indirect branches, using 64-bit addresses in the branch registers IA-64 allows multiple branches to be evaluated in parallel. The first taken branch which is predicated true is taken. Extended mnemonics are defined by assembler to cover most combinations : br.cond, br.call, br.ia, br.ret, br.cloop, br.ctop, br.cexit Branch prediction hints can be provided with branch hints as part of a branch instruction, or with separate Branch Predict instructions (brp)

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers Registers

IA-64 provides several register files that are visible to the programmer: • 128 General registers • 128 Floating-point registers • 64 Predicate registers • 8 Branch registers • 128 Application registers • Instruction Pointer (IP) register Registers are referred to by a mnemonic denoting the register type and a number. For example, general register 32 is named r32.

General Registers

Branch registers 63

63

0

0 br7 gr127 0 br0

gr0

Floating-point registers 81

Predicate registers

0

fr127

pr63

1

0.0

p0

fr0

Application registers 63

Instruction pointer

0

63

0

ar127 ar0

Continued on next page

-12 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Registers

-- continued

General registers

IA-64 provides 128 64-bit general purpose registers for all integer and multimedia computation.

63 gr0

0 0

nat 0

gr1

• Register gr0 is a read-only register and is always zero (0). • 32 registers are static and global to the process. • 96 registers are stacked. These registers are for argument passing and local register stack frame. A portion of these registers can also be used for software pipelining.

gr2

gr31 gr32

gr127

Each register has an associated NaT bit, indicating whether the value stored in the register is valid.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers

-- continued

Floating-point registers

IA-64 provides 128 82-bit floating-point registers, for floating-point computations. All floating-point registers are globally accessible within the process. There are:

81

• 32 static floating-point registers

0

fr0

0.0

fr1

0.1

fr2

• 96 rotating floating-point registers, for software pipelining

fr31

The first two registers (fr0 and fr1) are readonly:

fr32

• fr0 is read as +0.0 • fr1 is read as +1.0. fr127

Each register contains three fields: • 64-bit significand field • 17-bit exponent field • 1-bit sign field.

Continued on next page

-14 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers Predicate registers

Guide

-- continued

64 one-bit predicate registers enable controlling the execution of instructions. When the value of a predicate register is true (1), the instruction is executed. The predicate registers enable:

• validating/invalidating instructions • eliminating branches in if/then/else logic blocks There are: • 16 static predicate registers • 48 rotating predicate registers for controlling software pipelining

0 pr0 pr1 pr2

pr15 pr16 pr63

Instructions that are not explicitly preceded by a predicate, defaults to the first predicate register, pr0, which is read-only, and is always true (1).

Whenever in a program encounters a branch condition, like the ‘if-thenelse’ condition, it depends on the outcome of the condition which branch gets executed. Branch prediction used to be an often used solution, where the processor tried to predict which branch would be taken and then execute that branch in advance. Ofcourse, if the outcome was wrong, then a performance penalty was met because the branch taken was discarded and the other branch had to be executed... The IA-64 executes all branches in parallel, where the predication register is used to stop that branch of execution. This way the processor can process ‘out-of-order execution’ by just executing all branches without performance penalty. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers

-- continued

Branch registers

63

Eight 64-bit branch registers are used to specify the branch target addresses for indirect branches.

0

br0 br1 br2

The branch registers streamline call/return branching br7

IA-64 improves branch handling by: • providing the means to minimize branches in the code through the use of qualifying predicates • providing support for special branch A qualifying predicate is a predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not preceded by a predicate explicitly, assume the first predicate register, p0, which is always true. Predication enables you to convert a control dependency to a data dependency, thus eliminating branches in the code. An instruction is control dependent if it depends on a branch instruction to execute. Instructions are considered to be data dependent if the first produces a result that is used by the second, or if the second instruction is data dependent on the first through a third instruction. Dependent instructions cannot be executed in parallel. You cannot change the execution sequence of dependent instructions. Continued on next pag

-16 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Registers

-- continued

Application registers 63

128 special purpose registers are used for various functions. Some of the more commonly used application registers have assembler aliases. For example, ar66 is used as the Epilogue Counter (EC) and is called ar.ec.

0

ar0

KR0

ar7

KR7

ar16

RSC

ar17

BSP

ar18

BSPSTORE

ar19

RNAT

ar32

CCV

ar36

UNAT

ar40

FPSR

ar44

ITC

ar64

PFS

ar65

LC

ar66

EC

ar127

Instruction pointer (IP)

The 64-bit instruction pointer holds the address of the bundle of the currently executing instruction. The IP cannot be directly read or written, it increments as instructions are executed. Branch instructions set the IP to a new value. The IP is always 16-byte aligned.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Registers

-- continued

Register validity

63

Speculative memory access creates a need to delay exception handling. This is enabled by propagating exception conditions.

gr0

0 0

nat 0

gr1 gr2

Each general register has an a corresponding NaT (Not a Thing) Bit. The NaT bits enable propagating validity/invalidity of a speculative load result. Floating-point registers use a special instance of pseudo-zero, called NaTVal. NaTVal is a floating-point register value used to propagate valid/invalid results of speculative loads of floating-point data.

gr31 gr32

gr127

If data needs to get from the memory to the processor, there’s always a delay because it’ll take a while to get there. This is called ‘memory latency’. In an attempt to eliminate this time, the processor tries to read the memory beforehand. If data has been read in in advance and then other data has been written back to that exact location, the already read in data becomes invalid.

-18 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Operations Software pipelining loops

Loop performance is traditionally improved through software techniques. However, these techniques entail significant additional code: • Loop unrolling requires multiple copies of the original loop in the unrolled loop. The loop instructions are replicated and the end code adjusted to eliminate the branch. • Software pipelining requires adding prolog code to fill the execution pipe and epilog code that drains it. Software pipelining is a method that enables the processor to execute, in any given time, several instructions in various stages of the loop. IA-64 provides hardware support for software pipelining loops, eliminating the need for additional prolog and epilog code through the use of: • special branch instructions • Loop count (LC) and epilogue count (EC) application registers • rotating registers Rotating registers are registers which are rotated by one register position on each loop execution. The logical names of the registers are rotated in a wrap-around fashion, so that logical register X is logical register X+1 after one rotation. The predicate, floating-point and general registers can be rotated. IA-64 provides support for special branch instructions. One example is the br.cloop instruction, used for simple counted loops. The cloop branch instruction uses the LC application register, and not a qualifying predicate to determine the branch condition. The cloop branch checks whether the LC register is zero. If it is not, it decrements LC and the branch is taken. After the last iteration LC is zero and the branch is not taken, avoiding a branch misprediction. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations Reduced memory access costs

-- continued

As current processors increase in speed and parallelism, more scheduling opportunities are lost while memory is accessed. IA-64 allows you to eliminate many memory accesses through the use of large register files to manage work in progress, and by allowing better control of the memory hierarchy. Furthermore, the cost of the remaining memory accesses is dramatically reduced by moving load instructions earlier in the code. Thus hiding memory latency, which is the time required by the processor, between an issuance of a load instruction and the moment when the result of this instruction can be used. This enables the processor to bring the data in time, and avoid stalling the processor. Memory latency is hidden through the use of: • Data speculation - the execution of an operation before its data dependency is resolved. • Control speculation - the execution of an instruction before its control dependency is resolved.

Hiding memory latency early load

dependency

ld

dependency

check validity

The large number of registers in IA-64 enable multiple computations to be performed without having to store temporary data in memory. This reduces the number of memory accesses. Continued on next page

-20 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Operations

-- continued

Memory access is supported through the load (ld) and store (st) instructions. All other integer, floating-point and branch instructions use the registers as operands. IA-64 enables you to hide the memory latencies of the remaining load instructions, by placing speculative loads, prior to coding barriers. Thus the stall caused by memory latency is minimized. This also enables more opportunities for parallelism. When you use speculative loads, error/ exception detection is deferred until final result is actually required: • If no error/exception is detected the latency is hidden. • If an error/exception is detected then memory accesses and dependent instructions must be redone by an exception handler. A-64 provides an advanced load instruction (ld.a), that allows you to move potentially data dependent loads earlier in the code. To verify the data speculation, a check load instruction (ld.c) must be placed at the location of the original load instruction. If the contents of the memory address have not changed since the advanced load, the speculation succeeded, and the memory latency is hidden. If the contents of the memory address have changed by a store instruction, the ld.c instruction repeats the load. Data speculation does not defer exceptions. For example page faults are taken immediately. Also, IA-64 provides a control-speculative load instruction (ld.s), which executes the load while speculating the results of the governing branch. Control-speculative loads are also referred to as speculative loads. To verify the load, a check instruction (chk.s) is placed at the location of the original load. IA-64 uses a NaT bit/NaTVal, to track the success of the load. If the NaT bit/NaTVal indicates a deferred exception, the chk.s instruction jumps to correction code that repeats all dependent instructions. The correction code is generated by a compiler or assembly writer. If the load is successful, the speculation succeeded, and the memory latency is hidden. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

-- continued

Then there’s also a combined speculation load (ld.sa) which enables placing a load before a control and a data barrier. Use this type of speculative load to advance a load around a procedure call. To verify the speculation, a special check instruction (chk.a) is placed at the location of the original load instruction. If the load is successful, the speculation succeeded, and the memory latency is hidden. If an exception was generated, or the data was invalidated, the chk.a instruction jumps to correction code that repeats all dependent instructions. The correction code is generated by a compiler or assembly writer. Procedure calls

The traditional use of a procedure stack in memory for procedure call management demands a large overhead. IA-64 uses the general register stack for procedure call management, thus eliminating the frequent memory accesses. The general register stack consists of 96 general registers, starting at r32, used to pass parameters to the called procedure and store local variables for the currently executing procedure. The new structure of a register stack allows: • the caller procedure to pass parameters through registers to the called procedure • dynamic allocation of local registers for the currently executing procedure • allocating a maximum of 96 logical registers for each function

IA-32

IA-64

Procedure A call B ...

Procedure A call B

Procedure B save current register state ... restore previous register state return

Procedure B alloc no save! ... no restore! return Continued on next page

-22 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Operations

-- continued

The general register stack is divided into two subsets: gr0

• Static: The first 32 physical registers (r0-r31) are permanent registers, visible to all procedures, in which global variables are placed.

Global Registers gr31 gr32

Procedure Frame

• Stacked: The other 96 physical registers behave like a stack. The procedure code allocates up to 96 input and output registers for a procedure frame. An integral mechanism ensures that a stack overflow or underflow never occurs.

Stacked Registers

As each procedure frame is allocated, the previous frame is hidden, and the first register in the frame is renamed as logical register r32. Using small register frames eliminates or reduces the need for saving and restoring registers to and from memory, when allocating a new register stack frame.

gr127

When a procedure call is executed, the called procedure receives a procedure frame which contains the output registers of the caller as input. The called procedure can resize the frame to include its own input, local and output area, using the alloc instruction. For each subsequent call, this sequence is repeated, and a new procedure frame is created. When the procedure returns, the processor unwinds the register stack, the current frame is released, and the previous procedure’s frame is restored. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

-- continued

Register stack engine

Global Registers gr31 gr32

Procedure Frame

Stacked Registers

memory

IA-64 uses a hardware mechanism called a Register Stack Engine (RSE), which operates transparently in the background, to ensure that an overflow does not occur, and that the contents of the registers are always available. The RSE is not visible to the software.

gr0

RSE

Using a register stack reduces the need to perform memory saves. However, when a procedure tries to use more physical registers than remain on the stack, a register stack overflow could occur.

When the stack fills up, the RSE saves logical registers to memory, thus freeing them. The stored registers are restored in the same way when necessary. gr127

Through this mechanism, the RSE offers an unlimited number of physical registers for allocation.

Continued on next page

-24 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Operations Floating point and multimedia

-- continued

IA-64 provides high floating-point performance with full IEEE floating-point support for single, double, and double-extended formats. Also special support is provided for multimedia, or data-parallel applications: • integer data and SIMD computations, similar to the MMX[tm] technology. • floating-point data and SIMD-FP computations, similar to IA-32 Streaming SIMD Extensions . These floating-point features help improve IA-64 floating-point performance: • 128 floating-point registers. • A multiply and accumulate instruction (fma), with four different floating-point registers for operands (f=a * b + c). This instruction enables performing a multiply and add in the same number of cycles as one add or multiply instruction. • Load and store to and from memory. You can also load from memory into two floating-point registers. • Data transfer between floating-point and general registers. • Multiple status fields register, enables speculation on floating-point operations. • Quick conversion from integer to floating-point and vice-versa. • Rotating floating-point registers.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

-- continued

Integer multimedia is provided by defining a set of instructions which treat the general registers as 8x8, 4x16, or 2x32 bit elements, and by providing specific instructions for operating on these data elements. IA-64 multimedia support is semantically compatible with the MMX[tm] Technology. Three major types of instructions are provided: • Addition and subtraction (including 3 forms of saturating arithmetic)

63

0

a3

a2

a1

a0

b3

b2

b1

b0

a3+b3

a2+b2

a1+b1

a0+b0

• Multiplication • Left shift, signed and unsigned right shift • Pack and unpack to convert between different element sizes. Floating-point multimedia is provided through a set of instructions which treat the floating-point registers as 2x32 bit elements. IA-64 provides 128 82-bit floating-point registers. However the floatingpoint data type is 80 bits. Intermediate computation values can contain 82 bits. This enables software divide and square root computation, comparable to hardware functions, while taking advantage of wide machines. These fast software divides and square roots result in valid 80-bit IEEE values. Floating-point Register 81 80 Exponent

0

63 Significand

Sign Continued on next page

-26 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

Guide

IA-64 Operations

-- continued

For floating-point multimedia operations the floating-point register is divided as shown in the graphic below

31

63

81 80

0

Exponent Single-precision FP Single-precision FP

IA-64 provides four separate status fields (sf0-sf3) enabling four different computational environments. Each status field contains dynamic control and status information for floating-point operations. The FPSR contains the four status fields and a traps field that enable masking the IEEE exception events and denormal operand exceptions. This register also includes 6 reserved bits which must be 0.

Floating-point status register 63

6

0 sf3

sf2

sf1

sf0

13

13

13

13

traps 6

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

-- continued

Multimedia instructions

Multimedia instructions treat the general registers as concatenations of eight 8-bit, four 16-bit, or two 32-bit elements. They operate on each element independently and in parallel. The elements are always aligned on their natural boundaries within a general register. Most multimedia instructions are defined to operate on multiple element sizes. Three classes of multimedia instructions are defined: arithmetic, shift and data arrangement.

Processor Abstraction Layer (PAL)

IA-64 firmware consists of three major components • Processor Abstraction Layer (PAL) • System Abstraction Layer (SAL) • Extensible Firmware Interface (EFI) layer PAL provides a consistent firmware interface to abstract processor implementation-specific features. The System Abstraction Layer (SAL) is a firmware layer which isolates operating system and other higher level software fromimplementation differences in the platform, while PAL is the firmware layer that abstracts the processor implementation.

Operating System Software Transfers to OS Entrypoints for Hardware Events

OS Boot Handoff

EFI Procedure Calls

Extensible Firmware Interface (EFI) OS Boot Selection

SAL Procedure Calls

Platform/System Abstraction Layer (SAL) Access to Platform Resources

Instruction Execution

Interrupts, Traps and Faults

PAL Procedure Calls Transfers to SAL Entrypoints Processor Abstraction Layer (PAL) Processor (Hardware) Non-performance Critical Hardware Events, e.g Reset, Machine Checks

Performance Critical Hardware Events e.g. Interrupts

Platform (Hardware)

Continued on next page

-28 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations Interrupts

Guide

-- continued

Interrupts are events that occur during IA-32 or IA-64 instruction processing, causing the flow control to be passed to an interrupt handling routine. In the process, certain processor state is saved automatically by the processor. Upon completion of interrupt processing, a return from interrupt (rfi) is executed which restores the saved processor state. Execution then proceeds with the interrupted IA-32 or IA-64 instruction. From the viewpoint of response to interrupts, the processor behaves as if it were not pipelined. That is, it behaves as if a single IA-64 instruction (along with its template) is fetched and then executed; or as if a single IA-32 instruction is fetched and then executed. Any interrupt conditions raised by the execution of an instruction are handled at execution time, in sequential instruction order. If there are no interrupts, the next IA-64 instruction and its template, or the next IA-32 instruction, are fetched.

Interrupt definitions

Depending on how an interrupt is serviced, interrupts are divided into: IVAbased interrupts and PAL-based interrupts. • IVA-based interrupts are serviced by the operating system. IVAbased interrupts are vectored to the interrupt Vector Table (IVT) pointed to by CR2, the IVA control register • PAL-based interrupts are serviced by PAL firmware, system firmware, and possibly the operating system. PAL-based interrupts are vectored through a set of hardware entry points directly into PAL firmware. interrupts are divided into four types: Aborts, Interrupts, Faults, and Traps. Aborts A processor has detected a Machine Check (internal malfunction), or a processor reset. Aborts can be either synchronous or asynchronous with respect to the instruction stream. The abort may cause the processor to suspend the instruction stream at an unpredictable location with partially updated register or memory state. Aborts are PAL-based interrupts. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

-- continued

Machine Checks (MCA)

A processor has detected a hardware error which requires immediate action. Based on the type and severity of the error the processor may be able to recover from the error and continue execution. The PALE_CHECK entry point is entered to attempt to correct the error.

Processor Reset (RESET)

A processor has been powered-on or a reset request has been sent to it. The PALE_RESET entry point is entered to perform processor and system self-test and initialization.

External device Interrupts

An external or independent entity (e.g. an I/O device, a timer event, or another processor) requires attention. Interrupts are asynchronous with respect to the instruction stream. All previous IA-32 and IA-64 instructions appear to have completed. The current and subsequent instructions have no effect on machine state. Interrupts are divided into Initialization interrupts, Platform Management interrupts, and External interrupts. Initialization and Platform Management interrupts are PAL-based interrupts; external interrupts are IVA-based interrupts.

Initialization Interrupts (INIT)

A processor has received an initialization request. The PALE_INIT entry point is entered and the processor is placed in a known state.

Platform Management Interrupts (PMI)

A platform management request to perform functions such as platform error handling, memory scrubbing, or power management has been received by a processor. The PALE_PMI entry point is entered to service the request. Program execution may be resumed at the point of interrupt. PMIs are distinguished by unique vector numbers. Vectors 0 through 3 are available for platform firmware use and are present on every processor model. Vectors 4 and above are reserved for processor firmware use. The size of the vector space is model specific. Continued on next page

-30 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations

Guide

-- continued

External Interrupts (INT)

A processor has received a request to perform a service on behalf of the operating system. Typically these requests come from I/O devices, although the requests could come from any processor in the system including itself. The External Interrupt vector is entered to handle the request. External Interrupts are distinguished by unique vector numbers in the range 0, 2, and 16 through 255. These vector numbers are used to prioritize external interrupts. Two special cases of External Interrupts are Non-Maskable Interrupts and External Controller Interrupts.

Non-Maskable Interrupts (NMI)

Non-Maskable Interrupts are used to request critical operating system services. NMIs are assigned external interrupt vector number 2.

External Controller Interrupts (ExtINT)

External Controller Interrupts are used to service Intel 8259A-compatible external interrupt controllers. ExtINTs are assigned locally within the processor to external interrupt vector number 0.

Faults

The current IA-64 or IA-32 instruction which requests an action which cannot or should not be carried out, or system intervention is required before the instruction is executed. Faults are synchronous with respect to the instruction stream. The processor completes state changes that have occurred in instructions prior to the faulting instruction. The faulting and subsequent instructions have no effect on machine state. Faults are IVAbased interrupts. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations Traps

-- continued

The IA-32 or IA-64 instruction just executed requires system intervention. Traps are synchronous with respect to the instruction stream. The trapping instruction and all previous instructions are completed. Subsequent instructions have no effect on machine state. Traps are IVA-based interrupts.

Aborts

RESET MCA

Interrupts

Faults

Traps

INIT PMI INT (NMI,ExtINT,...)

PAL-based interrupts

IVA-based interrupts

Continued on next page

-32 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations Interrupt programming model

Guide

-- continued

When an interrupt event occurs, hardware saves the minimum processor state required to enable software to resolve the event and continue. The state saved by hardware is held in a set of interrupt resources, and together with the interrupt vector gives software enough information to either resolve the cause of the interrupt, or surface the event to a higher level of the operating system. Software has complete control over the structure of the information communicated, and the conventions between the low-level handlers and the high-level code. Such a scheme allows software rather than hardware to dictate how to best optimize performance for each of the interrupts in its environment. The same basic mechanisms are used in all interrupts to support efficient IA-64 low-level fault handlers for events such as a TLB fault, speculation fault, or a key miss fault. On an interrupt, the state of the processor is saved to allow an IA-64 software handler to resolve the interrupt with minimal bookkeeping or overhead. The banked general registers provide an immediate set of scratch registers to begin work. For low-level handlers (e.g. TLB miss) software need not open up register space by spilling registers to either memory or control registers. Upon an interrupt, asynchronous events such as external interrupt delivery is disabled automatically by hardware to allow IA-64 software to either handle the interrupt immediately or to safely unload the interrupt resources and save them to memory. Software will either deal with the cause of the interrupt and rfi back to the point of the interrupt, or it will establish a new environment and spill processor state to memory to prepare for a call to higher-level code. Once enough state has been saved (such as the IIP, IPSR, and the interrupt resources needed to resolve the fault) the low-level code can re-enable interrupts by restoring the PSR.ic bit and then the PSR.i bit. Since there is only one set of interrupt resources, software must save any interrupt resource state the operating system may require prior to unmasking interrupts or performing an operation that may raise a synchronous interrupt (such as a memory reference that may cause a TLB miss). Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 34

Guide

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm

IA-64 Operations PSR.ic Interrupt state collection bit

-- continued

The PSR.ic (interrupt state collection) bit supports an efficient nested interrupt model. Under normal circumstances the PSR.ic bit is enabled. When an interrupt event occurs, the various interrupt resources are overwritten with information pertaining to the current event. Prior to saving the current set of interrupt resources, it is often advantageous in a miss handler to perform a virtual reference to an area which may not have a translation. To prevent the current set of resources from being overwritten on a nested fault, the PSR.ic bit is cleared on any interrupt. This will suppress the writing of critical interrupt resources if another interrupt occurs while the PSR.ic bit is cleared. If a data TLB miss occurs while the PSR.ic bit is zero, then hardware will vector to the Data Nested TLB fault handler.

-34 of 34 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

Unit 3. Power Hardware Overview Objectives

The Objectives for this lesson are: • Provide an overview of the e-server p series systems and their processors. • List the registers available to the program and describe the internal use.

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-1

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power Hardware Overview e-server p-series or RS/6000 introduction

This section introduces RS/6000, giving a brief history of the products, an overview of the RS/6000 design, and a description of key RS/6000 technologies. The RS/6000 family combines the benefits of UNIX computing with IBMs leading-edge RISC technology in a broad product line - from powerful desktop workstations ideal for mechanical design, to workgroup servers for departments and small businesses, to enterprise servers for medium to large companies for ERP and server consolidation applications, up to massively parallel RS/6000 SP systems that can handle demanding scientific and technical computing, business intelligence, and Web serving tasks. Along with AIX, IBMs award winning UNIX operating system, and HACMP, the leading high availability clustering solution, the RS/6000 platform provides the power to create change and has the flexibility to manage it with a wide variety of applications that provide real value.

RS/6000 History

The first RS/6000 was announced February 1990 and shipped June 1990. Since then, over 1,100,000 systems have shipped to over 132,000 customers. The next figure summarizes the history of the RS/6000 product line, classified by machine type. For each machine type, the I/O bus architecture and range of processor clock speeds are indicated. The figure shows the following: • In the past, RS/6000 I/O buses were based on the Micro Channel Architecture (MCA). Today, RS/6000 I/O buses are based on the industry-standard Peripheral Component Interface (PCI) Architecture. • Processor speed, one key element of RS/6000 system performance, has increased dramatically over time. • There have been many machine types over the entire RS/6000 history. In recent years, there has been considerable effort to reduce the complexity of the model offerings without creating gaps in the market coverage.

Continued on next page

-2

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Instructor Guidepower_hardware_overview.fm

Draft Version for Review October 15, 2000 12:13 pm

Power Hardware Overview

-- continued

RS/6000 history 7011 (33 to 80 MHz) Micro Channel Workstations

7248 (100 to 133 MHz) PCI Workstations

7006 (80 to 120 MHz) Micro Channel Entry Desktops 7009 (80 to 120 MHz) Micro Channel Compact Servers 7013 (20 to 200 MHz) Micro Channel Deskside Systems 7012 (20 to 200 MHz) Micro Channel Desktop Systems 7015 (25 to 200 MHz) Micro Channel Rack Systems 7024 (100 to 233 MHz) PCI Deskside Systems 7025 (166 to 500 MHz) PCI Workgroup Servers Deskside Systems 7043 (166 to 375 MHz) PCI Workstations & Workgroup Servers 7044 (333 to 400 MHz) PCI Workstations & Workgroup Servers

7046 (375 MHz) PCI Workgroup Servers - Rack Systems 7026 (166 to 500 MHz) PCI Workgroup Servers - Rack Systems 7017 (125 to 450 MHz) PCI Enterprise Servers SP1, SP2, SP All Node Types

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Today

RISC CPU

320 520 The RISC CPU was the first CPU for the RS/6000 series of systems the CPU consist of four chips and runs at a speed of 33 Mhz. The CPU had a outstanding floating point performance at the time. The CPU was used in the 7012 and 7013 system model 320 - 380 and 520 - 580.

RISC II CPU

390 590 The RISC II has enhanched features over the first RISC design and runs up to 200 Mhz. The CPU was used in the 7012 and 7013 system model 390 and 590. Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-3

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power Hardware Overview

-- continued

PowerPC and Power2 Cpu family

PowerPc CPUs started as a joint effort between Motorola Apple and IBM the family consist of PowerPc, PPc601, PPc604 and PPc604e. These CPUs are very close to those prodused by Motorola and used in Apple systems, currently the PPc604e CPU is used in model f50, b50, and 43p

Power3 and Power3-II CPUs

The POWER3 microprocessor introduces a new generation of 64-bit processors especially designed for high performance and visual computing applications. POWER3 processors replace the POWER2 and the POWER2 Super Chips (P2SC) in high-end RS/6000 workstations and SP nodes. The RS/6000 44P 7044 Model 270 workstation features the POWER3-II microprocessor as well as the POWER3-II based SP nodes. The POWER3 implementation of the PowerPC architecture provides significant enhancements compared to the POWER2 architecture. The SMP- capable POWER3 design allows for concurrent operation of fixed-point instructions, load/store instructions, branch instructions, and floating-point instructions. Compared to the P2SC, which reaches its design limits at a clock frequency of 160 MHz, POWER3 is targeting up to 600 MHz by exploiting more advanced chip manufacturing processes, such as copper technology. The first POWER3-based system, RS/6000 43P 7043 Model 260, runs at 200 MHz as well as the POWER3 wide and thin nodes for the SP. Features of the POWER3, exceeding its predecessor (P2SC), include: • A second load-store unit • Improved memory access speed • Speculative execution

F lo a t in g P o in t U n it

F lo a tin g P o in t U n it

F ix e d P o in t U n it

F ix e d P o in t U n it

F ix e d P o in t U n it

L D /S T U n it

L D /S T U n it

F P U 1

F P U 2

F X U 1

F X U 2

F X U 3

L S 1

L S 2

B r a n c h / D is p a t c h

R e g is te r b u f fe r s f o r r e g is te r r e n a m in g : 2 4 F P 1 6 In te g e r

B r a n c h h is t o r y t a b le 2 0 4 8 e n t r ie s B r a n c h t a r g e t c a c h e 2 5 6 e n t r ie s

3 2 K B , 1 2 8 -w a y

6 4 K B , 1 2 8 -w a y

M e m o r y M g m t U n it In s tr u c t io n C a c h e

M e m o r y M g m t U n it D a ta C a c h e

IU

D U

3 2 B y te s

3 2 B y te s B u s In t e r fa c e U n it L 2 C o n t r o l, C lo c k

B IU

1 6 B y te s @ 1 0 0 M H z = 1 .6 G B /s

3 2 B y te s @ 2 0 0 M H z = 6 .4 G B /s D ir e c t M a p p e d

C P U r e g is te r s : 3 2 x 6 4 -b it in t e g e r ( F ix e d P o in t ) 3 2 x 6 4 -b it F P ( F lo a tin g P o in t )

L 2 C a c h e 1 -1 6 M B

6 X X

B u s

Continued on next page

-4

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Instructor Guidepower_hardware_overview.fm

Draft Version for Review October 15, 2000 12:13 pm

Power Hardware Overview RS64 and RS64 II CPUs

-- continued

The RS64 microprocessor, based on the PowerPC Architecture, was designed for leading-edge performance in OLTP, e-business, BI, server consolidation, SAP, Notesbench, and Web serving for the commercial and server markets. It is the basis for at least four generations of RS/6000 and AS/400 enterprise server offerings. The RS64 processor focuses on commercial performance with emphasis on conditional branches with zero or one cycle incorrect branch predict penalty, contains 64 KB L1 instruction and data caches, has a one cycle load support, four superscalar fixed point pipelines, and one floating point pipeline. There is an on-board bus interface (BIU) that controls both the 32 MB L2 bus interface and the memory bus interface. RS64 and RS64 II are defined by the following specifications: • 125 MHz RS64/262 MHz RS64 II on the RS/6000 Model S70 • 262 MHz RS64 II on the RS/6000 Model S70 Advanced • 340 MHz RS64 II on the RS/6000 Model H70 • 64 KB on-chip, L1 instruction cache • 64 KB on-chip four-way set associative data cache • 32 MB L2 cache • Superscalar design with integrated integer, floating-point, and branch units • Support for up to 64-way SMP configurations (currently 12-way) • 128-bit data bus • 64-bit real memory addressing • Real memory support for up to one terabyte (240) • CMOS 6S2 using a 162 mm2 die, 12.5 million transistors

S im p le F ix e d P o in t U n it

S im p le C o m p le x F ix e d P o in t U n it

F lo a tin g P o in t U n it

Load/ S to re U n it

B r a n c h /D is p a tc h

M e m o r y M g m t U n it In s tr u c tio n C a c h e

M e m o r y M g m t U n it D a ta C a c h e

IU

DU

3 2 B y te s

B IU

3 2 B y te s

B u s In te r fa c e U n it L 2 C o n tr o l, C lo c k

3 2 B y te s

L2 Cache 1 -3 2 M B

1 6 B y te s

6XX B us

Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-5

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power Hardware Overview RS64 III

-- continued

The RS64 III processor is designed to perform applications that place heavy demands on system memory. The RS64 III architecture addresses both the need for very large working sets and low latency. Latency is measured by the number of CPU cycles that elapse before requested data or instructions can be utilized by the processor. The RS64 III processors combine IBM advanced copper chip technology with a redesign of critical timing paths on the chip to achieve greater throughput. The L1 instruction and data caches have been doubled to 128 KB each. New circuit design techniques were used to maintain the one cycle load-to-use latency for the L1 data cache. L2 cache performance on the RS64 III processor has been significantly improved. Each processor has an on-chip L2 cache controller and an on-chip directory of L2 cache contents. The cache is four-way set associative. This means that directory information for all four sets is accessed in parallel. Greater associativity results in more cache hits and lower latency, which improves commercial performance. Using a technique called Double Data Rate (DDR), the new 8 MB Static SRAM used for L2 is capable of transferring data twice during each clock cycle. The L2 interface is 32 bytes wide and runs at 225 MHz (half processor speed), but, because of the use of DDR, it provides 14.4 GBps of throughput. In summary, the RS64 III features include: • 128 KB on-chip L1 instruction cache • 128 KB on-chip L1 data cache with one cycle load-to-use latency • On-chip L2 cache directory that supports up to 8 MB of off-chip L2 SRAM memory • 14.4 GBps L2 cache bandwidth • 32 byte on-chip data buses • 4-way superscalar design • Five stage deep pipeline • The Model S80 uses the 450 MHz RS64 III 64-bit copper-chip technology • The Model M80 uses the 500 MHz RS64 III 64-bit copper-chip technology • The Model F80 and the H80 use 450 or 500MHz RS64 III 64-bit copper-chip technology

Continued on next page

-6

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

Power Hardware Overview

-- continued

Power4 or Gigaprocessor Copper SOI CPU

POWER4 is a new processor initiative from IBM. It is comprised of two 64-bit 1 GHz five issue superscalar cores that have a triple level cache hierarchy. It has a 10 GBps main memory interface with a 45 GBps multiprocessor interface. IBM is utilizing the 0.18 micron copper silicon-on-insulator technology in its manufacture. The targeted market is the Enterprise Server or servers in e-business. It is currently in the design stage.

System Bus information

All current systems in the RS/6000 family are equiped with PCI buses, the PCI architecture provides an industry standard specification and protocol that allows multiple adapters access to system resources through a set of adapter slots. Each PCI bus has a limit on the number of slots (adapters) it can support. Typically, this can range from two to six. To overcome this limit, the system design can implement multiple PCI buses. Two different methods can be used to add PCI buses in a system. These two methods are: • Secondary PCI Bus, The simplest method to add PCI slots when designing a system is to add a secondary PCI bus. This bus is bridged onto a primary bus using a PCI-to-PCI bridge chip. • Another method of providing more PCI slots is to design the system with two or more primary PCI buses. This design requires a more sophisticated I/O interface with the system memory.

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-7

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power CPU Overview 32-bit hardware characteristics

32-bit Power and PowerPC processors all have the following features in common: User registers • 32 general-purpose integer registers, each 32 bits wide (GPRs) • 32 floating-point registers, each 64 bits wide (FPRs) • A 32-bit Condition Register (CR) • A 32-bit Link Register (LR) • A 32-bit Count Register (CTR) System Registers • 16 Segment Registers (SRs) • A Machine State Register (MSR) • A Data Address Register (DAR) • Two Save and Restore Registers (SRRs) • 4 special purpose (SPRG) registers (PowerPC only) All instructions are 32 bits long. The Data Address Register contains the memory address that caused the last memory-related exception. SRRs are used to save information when an interrupt occurs • SRR0 points to the instruction that was running when the interrupt occurred • SRR1 contains the contents of the MSR when the interrupt occurred SPRGs are used for general operating system purposes, requiring per-processor temporary storage. It provides fast state saves and support for multi-processing environments

General purpose registers

General Purpose Registers (GPRs) (often just called Rs) used for loads, stores, and integer calculations No memory-to-memory operations are provided.This always needs to go through registers

Condition register

The condition register (CR) contains bits set by the results of compare instructions. It’s treated as 8 4-bit registers. The bits are used to test for less-than, greater-than, equal, and overflow conditions. Continued on next page

-8

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

Power CPU Overview

-- continued

Link register

The link register (LR) is set by some branch instructions. Its content points to the instruction which has to be executed immediately after the branch. It typically is used in subroutine calls to find out where to return to.

Count register

The Count Register (CTR) has two uses : • It can be decremented, tested, and used to decide whether to take a branch, all from one branch instruction • It can contain the target address for a branch instruction

Machine state register

The MSR controls many of the current operating characteristics of the processor. Among others are : • Privilege Level (Supervisor vs. Problem or Kernel vs. User) • Addressing modes (virtual vs. real) • Interrupt enabling • Little-endian vs. Big-endian mode

Instruction set

A single instruction generally modifies only one register or one memory location. Exceptions to this are “multiple” and “update” operations The format of an instruction is: • An opcode mnemonic • An optional set of option bits • 0, 1, 2, or 3 registers • 0 or 1 memory locations, expressed as an offset added to/subtracted from a register The first two may be combined into an “extended mnemonic” For example of the format  U means the address in r3 + 24 General Purpose Registers are named “r0” - “r31” Although most instructions are the same, the mnemonics for POWER and PowerPC are often different. POWER mnemonics are generally simpler and shorter, while PowerPC mnemonics are longer, but more explicit.These differences are because PowerPC was developed with 64-bit in mind. Note: the actual opcodes generated by the assembler for these instructions are identical Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-9

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power CPU Overview Register to register operations

-- continued

These types of operations will always have at least 2 registers listed, where the first is the target for the result of the instruction, and the others provide the input to the operation. Immediate operations are shown as a register with an offset. “Immediate” means that a constant value is involved.The value is built right into the instruction. Examples : • RUUUU# Logical ORs r4 and r5, result into r3 • DGGLU[ U # Adds 0x48 to r1, result into r1

Register to memory operations

Register-Memory Operations always have one register and one memory location. The register is always listed first. The size of the memory location is specified in the opcode : • b = byte (8 bits) • h = halfword (16 bits) • w = word (32 bits) • d = doubleword (64 bits) All opcodes beginning with “l” are loads and all opcodes beginning with “st” are stores.

Continued on next page

-10

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

Power CPU Overview Register to memory operation examples

-- continued

Examples: • OZ]U U # Loads 32 bits from address 4+r30 into r31.High 32-bits cleared on 64-bit processor • VWGU U # Stores 64 bits from r3 to address r29 - 8. Invalid operation on 32-bit processor • OE]U[ U # Loads 8 bits from address 27+r1 into r0. Top 24/56 bits are cleared • VWKU[ U # Stores low 16 bits from r3 to address 0x56+r1 Notice that the load instructions also have a “z” in their mnemonics. The “z” stands for “zero,” and is intended to make clear that these instructions clear any bits in the target register that were not actually copied from memory. In case you were wondering, there are load instructions without the “z”. lwa and lha are “algebraic” loads. This means that the value being loaded is sign-extended to fill out the rest of the register. This is used when loading a signed value - if a halfword had a negative value, lhz would make it a positive, but lha would preserve the value’s “negativeness.”

Compare instructions

There are four variations of compare instructions , all beginning with “cmp”. They compare two values : • Register and register, or • Register and immediate value (i.e. constant value) The result of the comparison iis placed in the Condition Register (CR) where the various bits that can be set are : • • • •

LT = less than GT = greater than EQ = equal OV = overflow (a.k.a. carry bit)

Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-11

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power CPU Overview Branch instructions

-- continued

All instructions beginning with a “b” are branches. They change the address for the next instruction to be run. They have three addressing modes : • Absolute - goes to an explicit address • Relative - target address is an offset from current instruction address • Register - Only two registers can contain a branch target : Count (CTR) and Link (LR) Branches can be conditional. That depends upon whether the option bit matches the specified bit in the CR. A branch instruction can specify which CR to use, where CR0 is assumed unless otherwise specified. Extended mnemonics are defined by the assembler to cover most combinations The conditional branch instruction is central to any computer architecture. However, most architectures (including POWER and PowerPC) avoid putting comparisons directly into their branch instructions (to keep things simple). They provide compare instructions that set “condition bits.” These bits are what are used on branch instructions to make the actual decision. The assembler (and crash’s disassembler) provides extended mnemonics that combine a type of branch and the condition register bit that determines whether the branch is taken. Another bit in the branch opcode determines whether the CR bit must be on or off for the branch to take place. This bit is also incorporated into the extended mnemonics (the “not” versions of the branches). For maximum flexibility, the assembler usually also allows you to specify the “not” cases as the logically-opposite case. For example, bnl (branch not less than) can also be written as bge (branch greater than or equal to). Either case is still saying, “branch if the LT bit is turned off.” Examples • EOW[F # Branches to address 38c00 if LT bit is on in CR0 • EJHFU[ # Branches if LT bit is off in CR3 • EQHOUFU # Branches to address in LR if EQ bit is off in CR7 • EOHDFU[ # Branches to absolute address 0x3600 if GT bit is off in CR2

Continued on next page

-12

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

Power CPU Overview Trap instructions

-- continued

Most mnemonics beginning with a “t” are traps, and generate a program exception if the specified condition is met. There are two variations of the trap instruction : • t or tw - compares two registers, traps if specified comparison is true • ti or twi - compares register to immediate value instead “w” mnemonics are the PowerPC indication that these trap instructions are working on 32-bit values. As with branches, there are extended mnemonics defined to provide various traps. In this context ‘lt’, ‘gt’, ‘eq’, etc. have same meaning as on branch mnemonics Examples • WZHTUU# Traps if r3 equals r4 • WZQHLU # Traps if r31 is not equal to 0 Trap instructions are the only instructions in this architecture that perform a comparison and take some action, all in one instruction. They do not set or use condition register bits.

Special register operations

The Special Purpose Registers (SPRs) can only be copied to or from GPRs. • PIVSUU # Copies SPR 8 into r3 • PWVSUU # Copies r3 into SPR 9 Extended mnemonics are defined to cover common SPRs : • PIOUU# Copies the LR (SPR 8) into r3 • PWFWUU # Copies r3 into the CTR (SPR 9)

Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-13

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

Power CPU Overview Interrupt vectors

Interrupt vectors are addresses of short sections of code which saves the state of the processor and then branches to a handler routine. Some examples are : • • • • • • • • • • •

-14

-- continued

system reset - vector 0x100 machine Check - vector 0x200 data storage interrupt (DSI) - vector 0x300 instruction storage interrupt (ISI) - vector 0x400 external interrupt - 0x500 alignment - vector 0x600 program (invalid instruction or trap instruction) - vector 0x700 floating-point unavailable - vector 0x800 decrementer - vector 0x900 system call - vector 0xc00 There are some exceptions unique to each type of processor.

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for Review October 15, 2000 12:13 pm

Instructor Guidepower_hardware_overview.fm

64 bit CPU Overview 64-bit hardware characteristics

With full hardware 32-bit binary compatibility as the baseline, the features that characterize a PowerPC processor as 64-bit include: • 64-bit general registers • 64-bit instructions for loading and storing 64-bit data operands, and for performing 64-bit arithmetic and logical operations. • two execution modes: 32-bit and 64-bit. Whereas 32-bit processors have implicitly only one mode of operation, 32-bit execution mode on a 64-bit processor causes instructions and addressing to behave the same as on a 32-bit processor. As a separate mode, 64-bit execution mode creates a true 64-bit environment, with 64-bit addressing and instruction behavior. • 64-bit physical memory addressing facilities • additional supervisor instructions, as needed to set up and control the execution mode. A key feature the PowerPC 64-bit architecture provides is execution mode on a per-process level, helping AIX to create, at the system level, a mixed environment of concurrent 32-bit and 64-bitprocesses. The Machine Status Register (MSR) bit controls 32-bit or 64-bit execution mode : • Allows support for 32-bit processes on 64-bit hardware • Used by the kernel to run in 32-bit mode in kernel • portions of the VMM run in 64-bit mode on 64-bit hardware (to address large tables to represent large virtual memory) • 32-bit mode on 64-bit hardware looks exactly like 32-bit hardware (ensures binary compatability for 32-bit applications) • 32-bit instructions use only bottom 32-bits of registers for data or addresses

Continued on next page

© Copyright IBM Corp. 2000

Unit . Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-15

power_hardware_overview.fmInstructor Guide

Draft Version for Review October 15, 2000 12:13 pm

64 bit CPU Overview Segment table

-- continued

The 64-bit virtual address space is represented with a segment table, which acts as an in-memory set-associative cache of the most recently used 256 segment number to segment ID mappings. The current segment table is pointed to with the 64 bit Address Space Register (ASR) register. The ASR has a valid bit to indicate that no segment table is valid. This is used in 32-bit mode on 64-bit processors to indicate that the segment table is not being used. IBM "bridge extensions" to PowerPC 64-bit architecture allow segment register operations to work for 32-bit mode. It allows the kernel to continue to manipulate segment registers. The "bridge extensions" are used to load and store "segment registers" instead. A Segment Lookaside Buffer (SLB) is used to cache recently used segment number to segment ID mappings. This is similar to Translation Lookaside Buffer (TLB) for page to frame translations The SLB is similar to segment table but smaller and faster (on chip, not in memory)

-16

Course short title

© Copyright IBM Corp. 2000 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

Guide

Unit 4. SMP Hardware Overview Objectives

The Objectives for this lesson are: • list the three types op multiprocessor design • describe what is meant MP safe

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 6

Guide

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview Symmetric multiprocessing

On uniprocessor systems, bottlenecks exist in the form of the address and data bus restricting transfers to one at a time and the other program counter forcing instructions to be executed in strict sequence. Some performance improvement was achieved by constantly improving the speeds of these uniprocessor machines. With symmetric multiprocessing, more than one CPU work together. There are several categories of MP systems depending on whether the CPU share resources, have their own resources (like memory, operating system, I/O channels, control units, files and devices), how they are connected (whether in a single machine sharing a single bus or in different machines), whether all processors are functionally equal or some are specialized. Types of Multiprocessors: • Loosely-coupled MP • Tightly-coupled MP • Symmetric MP

Loosely coupled MP

Has different systems on a communication link with the systems fuctioning independently and communicating when necessary. The separate systems can access each other’s files and maybe even download tasks to the lightly loaded CPU to achieve some load balancing.

Tightly coupled MP

Uses a single storage shared by the various processors and a single operating system that controls all the processors and system hardware.

Symmetric MP

All of the processors are functionally equivalent and can perform I/O and computation. Continued on next page

-2 of 6 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview Multiprocessor organization

Guide

-- continued

In order to have all CPU’s work together, there must be some sort of organization. There are three ways to do that : • Master/slave multiprocessing organization. • Separate executives organization. • Symmetric multi-processing organization.

Master slave organization

One processor is designated as the master and the others are the slaves. The master is a general purpose processor and performs input/output as well as computation. The slave processors perform only computation. The processors are considered asymmetric (not equivalent) since only the master can do I/O as well as computation. Utilization of a slave may be poor if the master does not service slave requests efficiently enough. Another disadvantage may be I/O-bound jobs, because they may not run efficiently since only the master does I/O.

Separate executives organization

With this organization each processor has its own operating system and responds to interrupts from users running on that processor. A process is assigned to run on a particular processor and runs to completion. It is possible for some of the processors to remain idle while other processors execute lengthy processes. Some tables are global to the entire system and access to these tables must be carefully controlled. Each processor controls its own dedicated resources, such as files and I/O devices.

Symmetric multiprocessing organization

All of the processors are functionally equivalent and can perform I/O and computation. The operating system manages a pool of identical processors, any one of which may be used to control any I/O devices or reference any storage unit. Conflicts between processors attempting to access the same storage at the same time are ordinarily resolved by hardware. Multiple tables in the kernel can be accessed by different processes simultaneously. Conflicts in access to systemwide tables are ordinarily resolved by software. A process may be run at different times by any of the processors and at any given time, several processors may execute operating system function in kernel-mode. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 6

Guide

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview Multiprocessor definitions

-- continued

There are two ways of identifying separate processors. You can identfiy them by : • the physical CPU number • the logical CPU number The lowest number will start from ‘0’ on Power systems, but will start from ‘1’ on IA-64. Where the physical numbers identify all processors on the system, regardless of their state, and the logical numbers identify enabled processors only. The Object Data Manager (ODM) names for processors are based on physical numbers with the prefix /proc. The table below illustrates these naming scheme for a three-processor Power system.

ODM name /proc0 /proc1 /proc2

Physical number 0 1 2

Logical number 0 1

Processor state Enabled Disabled Enabled

Continued on next page

-4 of 6 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

SMP Hardware Overview Funneling

Guide

-- continued

In order to run some Uni-Processor device drivers unchanged because they are not ‘thread-safe’ or ‘MP safe’, their execution had to be “funneled” through one specific processor, which is called the MP master. Funneled code runs only on the master processor; therefore, the current uniprocessor serialization is sufficient. One processor will be known as the default, or Master processor and this concept is used for funneling. It is not a master processor in the sense of master/slave processing - the term is used only to designate which processor will be the default processor. It’s defined by the value of MP_MASTER in the file Note : funneling is NOT supported by the 64-bit kernel !!! Funneling has the following characteristics : • Interrupts for a funneled device driver will be routed to the MP master CPU. • Funneling is intended to support third party device driver and lowthroughput device drivers. • The base kernel will provide binary compatibility for these device drivers. • Funneling only works if all references to the device driver are through the device switch table.

MP safe

MP safe code will run on any processor. It’s modified to prevent resource clashes by adding locking code in order to serialize its execution.

MP efficient

MP efficient code is MP safe code, but has also some data locking mechanisms to serialize data access. This way it will be easier to spread whatever the code does across the availables CPUs. MP efficient device drivers are intended for high-throughput device drivers.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 6

Guide

-6 of 6 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Unit 5. Configuring System Dumps on AIX 5L This lesson describes how to configure and take system dumps on a node running the AIX5L operating system.

What You Should Be Able to Do After completing this unit, you should be able to • Configure an AIX5L system to take a system dump • Test the system dump configuration of an AIX5L system • Verify the validity of a dump file

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-1 of 28

Guide

-2 of 28 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

About This Lesson Purpose

This lesson describes how to configure and take system dumps on a node running the AIX5L operating system.

Objectives

At the completion of this lesson, you will be able to: • Configure an AIX5L system to take a system dump • Test the system dump configuration of an AIX5L system • Verify the validity of a dump file

Table of contents

This lesson covers the following topics:

Topic About This Lesson System Dump Facility in AIX5L Configuring for System Dumps Obtaining a Crash Dump Dump Status and completion codes dumpcheck utility Verify the dump Packaging the dump

See Page 3 5 7 16 17 19 21 26 Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-3 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

About This Lesson

-- continued

Estimated length

This lesson takes approximately 1hour to complete.

Accountability

You will be able to measure your progress with the following: • Exercises using your lab system. • Check-point activity • Lesson review Reference Redbooks

Organization of this lesson

This lesson consists of information followed by exercises that allow you to practice what you’ve just learned. Sometimes, as the information is being presented, you are required to do something - pull down a menu, enter a response, etc. This symbol, in the left hand side-head, is an indication that an action is required.

-4 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

System Dump Facility in AIX5L Introduction

An AIX5L system can generate a system dump (or crash dump) due to encountering a severe system error, such as an exception in kernel mode that was unexpected or that the kernel cannot handle. It can also be initiated by the system administrator when the system has hung. When an unexpected system halt occurs, the system dump facility automatically copies selected areas of kernel data to the primary dump device. These areas include kernel segment 0 as well as other ares registered in the Master Dump Table by kernel modules or kernel extensions. The system dump is a snapshot of the operating system state at the time of the crash or manually initiated dump. The system dump facility provides a mechanism to capture sufficient information about the AIX5L kernel for later analysis. Once the preserved image is written to disk, the system will be booted and returned to production. Analysis of the dump can be done on another machine away from the production machine at a convenient time, or location by a skilled kernel person.

Process

The process of taking a system dump is illustrated in the following chart. The process involves a two stages, in stage one the contents of memory is copied into a temporary disk location. In stage two, AIX5L is booted and the memory image is moved to a permanent location in the /var/adm/ras directory. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-5 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

System Dump Facility in AIX5L

-- continued

Process continued

AIX5L in production

Copycore copies dump into /var/adm/ras

Stage 1

- copycore started in rc.boot

System panics System is booted Memory Dumper Run Stage 2

- memory is copied to disk location specified in SWservAt ODM object class

-6 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Configuring for System Dumps Introduction

When the operating system is installed, parameters regarding the dump device are configured with default settings. To ensure that a system dump is taken successfully, the system dump parameters need to be configured properly. The system dump parameters are stored in system configuration objects within the SWservAt ODM object class. Objects within the SWservAt object class define where and how a system dump should be handled.

SWservAt object class

The SWservAt ODM object class is stored in the /etc/objrepos directory. Objects included within the object class are: name tprimary

default /dev/hd6

description

primary

/dev/hd6

Defines the temporary primary dump device. By default this is the primary paging space logical volume, hd6.

tsecondary

/dev/sysdumpnull

Defines the permanent secondary dump device. By default this is the device sysdumpnull.

secondary

/dev/sysdumpnull

Defines the temporary secondary dump device. By default this is the device sysdumpnull.

autocopydump

/var/adm/ras

Defines the directory the dump is copied to at system boot.

forcecopydump

TRUE

TRUE - If a the copy fails to the copy directory, the system boot process will bring up a utility to copy the dump to removable media.

enable_dump

FALSE

FALSE - Disables the ability to force a sysdump using the dump key sequence or the reset button on systems without a key mode switch.

dump_compress

OFF

OFF - specifies that dumps will not be compressed.

Defines the permanent primary dump device. By default this is the primary paging space logical volume, hd6.

Each object can be changed with the use of the sysdumpdev command. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-7 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps sysdumpdev

The sysdumpdev command changes the settings of SWservAt objects. The command provides you with the ability to: • • • • • •

Dump Device selection rules

-- continued

Estimate the size of the system dump Selecting the primary and secondary dump devices Selecting the directory the dump will be copied to at boot Displaying information from the previous dump invocation Determine if a new system dump exists Display current dump settings

When selecting the primary or secondary dump device the following rules must be observed: • A mirrored paging space may be used as a dump device. • Do not use a diskette drive as your dump device. • If you use a paging device, only use hd6, the primary paging device. Continued on next page

-8 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Configuring for System Dumps Preparing for a system dump

-- continued

To ensure that a system dump will be successfully captured, complete the following steps:

Step 1.

Action Estimate the size of the dump. This can be done through smit by following the fast path: # smit dump_estimate Or, using the sysdumpdev command: # sysdumpdev -e (With Compression turned on) 0453-041 Estimated dump size in bytes:11744051 (With Compression turned off) 0453-041 Estimated dump size in bytes:58720256 Using the above example, the dump will require 12MB (with compression on), or 59MB (with compression turned off) of device storage. This value can change based on the activity of the system. It is best to run this command when the machine is under its heaviest workload. Size the dump device four times the value reported by the sysdumpdev command in order to handle a system dump during peak system activity.

IA-64 Systems - Compression must be turned off to gather a valid system dump. (Eratta) DUMPSPACE requirement for this system: ______MB * 4 = ______MB Note: On AIX5L a new utility called dumpcheck has been created to monitor the system and verify that if a system dump occurred that the resources are properly configured to the system dump. The utility is run as a cron job, and is located in the /usr/lib/ras directory. The time when the command is scheduled to run should be adjusted to when the peak system load is expected. Any warnings will be logged in the errorlog. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-9 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps

-- continued

Preparing for a system dump continued

Step 2

Action Create a primary dump device named dumplv. Calculate the required number of PPs for the dump device. Get the PP size of the volume group by using the lsvg command: # lsvg rootvg VOLUME GROUP: rootvg VG IDENTIFIER: db1010a VG STATE: active PP SIZE: 16 megabyte(s) VG PERMISSION: read/write TOTAL PPs: 1626 (26016 megabytes) MAX LVs: 256 FREE PPs: 1464 (23424 megabytes) LVs: 11 USED PPs: 162 (2592 megabytes) OPEN LVs: 8 QUORUM: 2 TOTAL PVs: 3 VG DESCRIPTORS: 3 STALE PVs: 0 STALE PPs: 0 ACTIVE PVs: 3 AUTO ON: yes MAX PPs per PV: 1016 MAX PVs: 32 LTG size: 128 kilobyte(s) AUTO SYNC: no HOT SPARE: no Determine the necessary number of PPs by dividing the estimated size of the dump by the PP size. For example: 236MB (59*4) / 16MB = 14.75 (required number is 15) Create a logical volume of the required size, for example: #mklv -y dumplv -t sysdump rootvg 15 Continued on next page

-10 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Configuring for System Dumps

-- continued

Preparing for a system dump continued

Step 3.

Action Verify the size of the device /dev/dumplv. Enter the following command: # lslv dumplv LOGICAL VOLUME: dumplv VOLUME GROUP: rootvg LV IDENTIFIER: e59bd8 PERMISSION: read/write VG STATE: active/complete LV STATE:opened/syncd TYPE: dump WRITE VERIFY: off MAX LPs:512 PP SIZE: 16 megabyte(s) COPIES: 1 SCHED POLICY: parallel LPs: 15 PPs: 15 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: no INTRA-POLICY: middle UPPER BOUND: 32 MOUNT POINT: N/A LABEL: None MIRROR WRITE CONSISTENCY: off EACH LP COPY ON A SEPARATE PV?: yes

4.

In this example, the dumplv logical volume contains 15 16MB partitions giving a total size of 240MB. Assign the primary dump device by using the sysdumpdev command: #sysdumpdev -s /dev/dumplv -P primary /dev/dumplv secondary /dev/sysdumpnull copy directory /var/adm/ras forced copy flag FALSE always allow dump FALSE dump compression OFF Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-11 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps

-- continued

Preparing for a system dump continued

Step 5.

Action Create a secondary dump device. The secondary dump device is used to back up the primary dump device. If an error occurs during a system to dump to the primary dump device, the system attempts to dump to the secondary device (if it is defined). Create a logical volume of the required size, for example:

6.

#mklv -y hd7 -t sysdump rootvg 15 Assign the secondary dump device by using the sysdumpdev command: #sysdumpdev -s /dev/hd7 -P primary /dev/dumplv secondary /dev/hd7 copy directory /var/adm/ras forced copy flag FALSE always allow dump FALSE dump compression OFF Continued on next page

-12 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps

Guide

-- continued

Preparing for a system dump continued

Step 7.

Action Verify the size of the filesystem containing the copy directory is large enough to handle a crash dump. Check the size of the copy directory filesystem with the following command: #df -k /var Filesystem 1024-blocks Free%Used Iused %Iused Mounted on /dev/hd9var 32768 31268 5% 143 64% /var In this example the /var filesystem is 32MB. To increase the size of the /var filesystem to 240MB, use the following command: # chfs -asize=+240000 /var Note: The default copy directory is /var/adm/ras. The rc.boot script is coded to check and mount the /var filesystem to support the copy of the system dump out of the dump device. If an alternate location is selected modification of /sbin/rc.boot maybe necessary. Also you will be required to update the ram filesystem with the bosboot command. Portion of /sbin/rc.boot: # Mount /var for copycore echo "rc.boot: executing \"fsck -fp var\"" \ >>/../tmp/boot_log fsck -fp /var echo "rc.boot: executing \"mount /var\"" \ >>/../tmp/boot_log mount /var [ $? -ne 0 ] && loopled 0x518 # retrieve dump echo "rc.boot: executing \"copycore\"" \ >>/../tmp/boot_log copycore umount /var Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-13 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps

-- continued

Preparing for a system dump continued

Step 8.

9.

10.

Action Configure the force copy flag. If paging space is being used as a dump device, the force copy flag must be set to TRUE. This will force the system boot sequence into menus that allow copy of the dump to external media if the copy to the copy directory fails. This utility will give you the opprotunity to save the crash to removable media if the default copy directory is full or un-available. To set the flag to TRUE, use the following command: # sysdumpdev -PD /var/adm/ras Configure the allow system dump flag. To enable the reset button or dump key sequence to force a dump sequence with the key in the normal position, or on a machine without a key mode switch, the allow system dump flag must be set to TRUE. To set the flag TRUE, use the following command: # sysdumpdev -KP Configure the compression flag. To enable compression of the system dump prior to being written to the dump device, the compression flag must be set to ON. To set the flag to ON, use the following command: # sysdumpdev -CP

IA-64 Systems - Compression must be turned off to gather a valid system dump. (Eratta): # sysdumpdev -cP Note: Turning the compression flag on will cause the dump to be saved in a compressed form on the primary dump device. Also, the copycore utility will generate a compressed vmcore file, vmcore.x.Z. Continued on next page

-14 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Configuring for System Dumps

Guide

-- continued

Preparing for a system dump continued

Step 11.

Action Configure the system for autorestart. A useful system attribute is autorestart. If autorestart is TRUE, the system will automatically reboot after a crash. This is useful if the machine is physically distant or often unattended. To list the system attributes, use the following command: # lsattr -El sys0 To set autorestart to TRUE, use SMIT by following the fast path: # smit chgsys Or use the command: # chdev -l sys0 -a autorestart=’true’

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-15 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Obtaining a Crash Dump Introduction

AIX5L has been designed to automatically collect a system crash dump following a system panic. This section discusses the operator controls and procedure that is used to obtain a system dump.

User initiated dumps

Under unattended hang conditions or for other debugging purposes system administrator may use different techniques to force a dump: • Using sysdumpstart -p command (primary dump device) or sysdump -s command (secondary dump device). • Start a system dump with the Reset button by doing the following (this procedure works for all system configurations and will work in circumstances where other methods for starting a dump will not):

Step 1. 2.

Action Turn the machine’s mode switch to the Service position, or set Always Allow System Dump to TRUE. Press the Reset button. The system writes the dump information to the primary dump device.

Power PC - Pressing the Ctlr-Alt 1 key sequence to write the dump information to the primary dump device, or press the Ctlr-Alt 2 key sequence to write the dump information to the secondary dump device. IA-64 - Pressing the Ctlr-Alt-NUMPAD1 key sequence to write the dump information to the primary device, or Ctlr-Alt-NUMPAD2 key sequence to write the dump information to the secondaray dump device.

-16 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Dump Status and completion codes Progression status codes

A system crash will cause a number of status codes to be displayed. When a system has crashed, the LEDs will display a flashing 888. The system may display the code 0c9 for a short period of time, indicating a system dump is in progress. When the dump is complete, the dump status code will change to 0c0 if the system was able to dump successfully. If the Low-Level Debugger (LLDB) is enabled, a c20 will appear in the LEDs, and an ASCII terminal connected to the s1 or s2 serial port will show an LLDB screen. Typing quit dump will initiate a dump. During the dump process, the following progression status codes may be seen on the LED or LCD displays: LED code

0c8

sysdumpdev status Description 0 Dump successful -4 I/O error during dump. -2 Dump device is too small. Partial dump taken. -3 Internal dump error. It shows only when the dump facility itself fails. This does not include the failure of dump component routines. -1 No dump device defined.

0c2

N/A

0c6

N/A

0c9

N/A

0cc

N/A

Flashing 888 102

N/A N/A

0c0 0c1 0c4

0c5

User-initiated dump in progress. User-initiated dump in progress to secondary dump device. System initiated dump in progress. Dump process switched to secondary dump device. System has crashed This value indicates an unexpected system halt. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-17 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Dump Status and completion codes

LED code nnn

000

2xx

-- continued

sysdumpdev status Description N/A This value is the cause of the system halt (reason code) N/A Unexpected system interrupt (hardware related) N/A Machine check

Error log

If the dump was lost or did not save during system boot, the error log can help determine the nature of the problem that caused the dump. To check the error log, use the errpt command.

Create a user initiated dump

Create a test dump by entering the following command:

Step 1.

# sysdumpstart -p

2.

IA-64 Systems - For a dump that is approximately 120MB in size wait for approximately 15 minutes before shutting down the machine. Reboot the system.

-18 of 28 AIX 5L Internals

Action

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

dumpcheck utility Description

The /usr/lib/ras/dumpcheck utility is used to check the disk resources used by the system dump facility. The command logs an error if either the largest dump device is too small to receive the dump or there is insufficient space in the copy directory when the dump device is a paging space.

Requirements

In order to be effective, the dumpcheck utility must be: • Enabled: • To verify if dumpcheck has been enabled by using the following command: # crontab -l | grep dumpcheck 0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1 • Enable the dumpcheck utility by using the -t flag. This will create an entry in the root crontab if none exists. Example, to set the dumpcheck utility to run at 2PM: # /usr/lib/ras/dumpcheck -t

“0 14

*

*

*”

• Dumpcheck should be run at the time the system is heavily loaded in order to find the maximum size the dump will take. The default time is set for 3PM. dumpcheck overview

Dumpcheck utility will do the following when enabled : • • • • •

Estimate the dump or compressed dump size using sysdympdev -e Find the dump logical volumes and copy directory using sysdumpdev -l Estimate the primary and secondary dump device sizes Estimate the copy directory free space If the dump device is a paging space, dumpcheck will verify if the free space in the copy directory is large enough to copy the dump • If the dump device is a logical volume, dumpcheck will verify it is large enough to contain a dump • If the dump device is a tape, dumpcheck will exit without message. Any time a problem is found, dumpcheck will log a entry in the error log and, if the -p flag is present, will display a message to stdout for crontab, that mean it will mail the stdout to the root user. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-19 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

dumpcheck utility Error log entry sample

-- continued

The following is an example of an errorlog entry created by the dumpcheck utility because of lack of space in the primary and secondary dump devices: ---------------------------------------------------LABEL: DMPCHK_TOOSMALL IDENTIFIER: E87EF1BE Date/Time: Sequence Number: Machine Id: Node Id: Class: Type: Resource Name:

Tue Aug 15 09:49:41 CDT 45 000714834C00 wcs2 O PEND dumpcheck

Description The largest dump device is too small. Probable Causes Neither dump device is large enough to accommodate a system dump at this time. Recommended Actions Increase the size of one or both dump devices. Detail Data Largest dump device testdump Largest dump device size in kb 8192 Current estimated dump size in kb 65536 ----------------------------------------------------

-20 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Verify the dump Description

Before submitting a dump to IBM for analysis, it is important to verify that the dump is valid and readable.

Locating the dump

To locate the dump issue the following command: # sysdumpdev -L The following output shows a good dump: 0453-039 Device name:

/dev/dumplv

Major device number: 10 Minor device number: 2 Size:

8837632 bytes

Uncompressed Size:

32900935 bytes

Date/Time:

Fri Sep 22 13:01:41 PDT 2000

Dump status:

0

dump completed successfully Dump copy filename: /var/adm/ras/vmcore.0.Z

In this case a valid dump was safely save by the system in the /var/adm/ ras directory. The following case shows the command output when the copy failed. Presumably the dump is available on the external media device, for example, tape. 0453-039 Device name:

/dev/dumplv

Major device number: 10 Minor device number: 2 Size:

8837632 bytes

Uncompressed Size: 32900935 bytes Date/Time: Dump status:

Fri Sep 22 13:01:41 PDT 2000 0

dump completed successfully 0481-195 Failed to copy the dump from /dev/dumplv to /var/adm/ras. 0481-198 Allowed the customer to copy the dump to external media.

Note: A dump saved on Initial Program Load (IPL) to external media is not sufficient for analysis. Additional files are required. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-21 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Verify the dump

-- continued

Dump analysis tools

To verify the dump is valid, the dump must be examined by a kernel debugger. The kernel debugger used to validate the dump depends on the system architecture. If the system is running on Power PC, the debugger is kdb. The kernel debugger for IA-64 platforms is iadb.

Verifying the dump

The following procedure should be used to verify the dump

Step 1.

Action Locate the crash dump: # sysdumpdev -L 0453-039 Device name:

/dev/dumplv

Major device number: 10 Minor device number: 2 Size:

8837632 bytes

Uncompressed Size:

32900935 bytes

Date/Time: Dump status:

Fri Sep 22 13:01:41 PDT 2000 0

dump completed successfully Dump copy filename: /var/adm/ras/vmcore.0.Z

2.

3.

Change directory to the dump location. In the above example: # cd /var/adm/ras Decompress the vmcore file if necessary: # uncompress vmcore.0.Z Continued on next page

-22 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Verify the dump

-- continued

Verifying the dump continued

Step 4.

Action Start the kernel debugger;

Power PC: # kdb /var/adm/ras/vmcore.0 The specified kernel file is a UP kernel vmcore.1 mapped from @ 70000000 to @ 71fdba81 Preserving 880793 bytes of symbol table First symbol __mulh KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated KERNEXT COMMANDS SPACE (4096 bytes) allocated Component Names: 1) dmp_minimal [5 entries] .... Dump analysis on CHRP_UP_PCI POWER_PC POWER_604 machine with 1 cpu(s) (32-bit r egisters) Processing symbol table... .......................done (0)> IA-64: # iadb /var/adm/ras/vmcore.0 symbol capture using file: /unix iadb: Probing a live system, with memfd as :4 Current Context: cpu:0x1, thread slot: 77, process Slot: 51, ad space: 0x8e44 thrd ptr: 0xe00000972a13b000, proc ptr: e00000972a12e000 mst at:3ff002ff3b400 (1)> Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-23 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Verify the dump

-- continued

Verifying the dump continued

Step 5.

Action Issue the stat subcommand to verify the details of the dump. Ensure the values are consistent with the dump that was taken.

Power PC: (0)> stat SYSTEM_CONFIGURATION: CHRP_UP_PCI POWER_PC POWER_604 machine with 1 cpu(s) (32-bit registers) SYSTEM STATUS: sysname... AIX nodename.. kca41 release... 0 version... 5 machine... 000930134C00 nid....... 0930134C time of crash: Thu Oct 5 10:37:57 2000 age of system: 3 min., 11 sec. xmalloc debug: disabled

IA-64: (1)>stat SYSTEM_CONFIGURATION: IA64 machine with 2 cpu(s)(64-bit registers) SYSTEM STATUS: sysname... AIX nodename.. kca40 hostname.. kca40.hil.sequent.com release... 0 version... 5 machine... 000000004C00 nid....... 0000004c current time: Fri Oct 6 12:20:56 2000 age of system: 1 day, 1 hr., 1 min., 43 sec. xmalloc debug: disabled Continued on next page

-24 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Guide

Verify the dump

-- continued

Verifying the dump continued

Step 6.

Action Exit the kernel debugger:

Power PC: (0) > q IA-64: (1) > q

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-25 of 28

Guide

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Packaging the dump Overview

Once a valid dump has been identified, the next step is to package the dump to be send in for analysis.

Packaging the dump

The following procedure will automatically collect the required files pertaining to the system dump

Step 1.

2.

Action Compress the vmcore file: # compress /var/adm/ras/vmcore.0 Gather all of the files and information regarding the dump using the following command: # snap -Dkg Checking space requirement for general information............................................ ........... done. Checking space requirement for kernel information.......... done. Checking space requirement for dump information..... done. Checking for enough free space in filesystem... done. ********Checking and initializing directory structure Creating /tmp/ibmsupt directory tree... done. Creating /tmp/ibmsupt/dump directory tree... done. Creating /tmp/ibmsupt/kernel directory tree... done. Creating /tmp/ibmsupt/general directory tree... done. Creating /tmp/ibmsupt/general/diagnostics directory tree... done. Creating /tmp/ibmsupt/testcase directory tree... done. Creating /tmp/ibmsupt/other directory tree... done. ********Finished setting up directory /tmp/ibmsupt Gathering general system information........................done. Gathering kernel system information........... done. Gathering dump system information...... done.

Continued on next page

-26 of 28 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Packaging the dump

Guide

-- continued

Packaging the dump continued

Step 3.

Action Copy the dump to external media. To copy the gathered files to the /dev/rmt0 tape device, issue the following command: # snap -o /dev/rmt0 Once this command completes, the tape can be removed and sent in for analysis. Write protect the tape and label appropriately

Packaging a dump stored on external media

A dump saved to external media needs to be gathered with other files to provide a dump which is readable. To gather and pack the files follow the following steps: Step 1.

Action Create a skeleton directory to contain the dump information. # snap -D

2.

3.

This will fail stating the dump device is no longer valid. Overcome this by restoring the dump from the media used on IPL to save the dump. Restore the dump from external media. For example, a dump saved to the /dev/rmt0 device is restored by commands: # cd /tmp/ibmsupt/dump # tar -xvf /dev/rmt # mv dump_file dump Copy the dump to external media. To copy the gathered files to the /dev/rmt0 tape device, issue the following command: # snap -o /dev/rmt0 Once this command completes, the tape can be removed and sent in for analysis. Write protect the tape and label appropriately

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-27 of 28

Guide

-28 of 28 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, crashdump.fm

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

Unit 6. Introduction to Dump Analysis Tools This lesson describes the different tools that are available to debug a system dump taken from an AIX5L system.

What You Should Be Able to Do After completing this unit, you should be able to:

At the completion of this lesson, you will be able to: • Describe available tools for system dump analysis • Invoke the IADB/iadb and KDB/kdb kernel debuggers

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-1 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

About This Lesson Purpose

This lesson describes the different tools that are available to debug a system dump taken from an AIX5L system.

Prerequisites

You should have completed the following lesson: • Configuring System Dumps on AIX5L

Objectives

At the completion of this lesson, you will be able to: • Describe available tools for system dump analysis • Invoke the IADB/iadb and KDB/kdb kernel debuggers

Table of contents

This lesson covers the following topics: Topic About This Lesson System Dump Analysis Tools dump components Dump creation process Component dump routines bosdebug command Memory Overlay Detection System System Hang Detection truss command KDB kernel debugger kdb command KDB miscellaneous sub commands KDB dump/display/decode sub commands KDB modify memory sub commands KDB trace sub commands KDB break point and step sub commands KDB name list/symbol sub commands

See Page 3 7 8 9 10 11 12 15 21 24 26 27 30 34 37 39 43 Continued on next page

-2 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

About This Lesson

-- continued

Table of contents continued

Topic KDB watch break point sub commands KDB machine status sub commands KDB kernel extension loader sub commands KDB address translation sub commands KDB process/thread sub commands KDB Kernel stack sub commands KDB LVM sub commands KDB SCSI sub commands KDB memory allocator sub commands KDB file system sub commands KDB system table sub commands KDB network sub commands KDB VMM sub commands KDB SMP sub commands KDB data and instruction block address translation sub commands KDB bat/brat sub commands IADB kernel debugger iadb command

See Page 44 46 48 50 51 59 61 63 66 70 73 78 81 87 88 90 91 93

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-3 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

About This Lesson

-- continued

Table of contents continued

Topic IADB break point and step sub commands IADB dump/display/decode sub commands IADB modify memory sub commands IADB name list/symbol sub commands IADB watch break point sub commands IADB machine status sub commands IADB kernel extension loader sub commands IADB address translation sub commands IADB process/thread sub commands IADB LVM sub commands IADB SCSI sub commands IADB memory allocator sub commands IADB file system sub commands IADB system table sub commands IADB network sub commands IADB VMM sub commands IADB SMP sub commands IADB block address translation sub commands IADB bat/brat sub commands IADB miscellaneous sub commands Exercise

See Page 94 97 101 106 107 109 111 112 113 115 116 117 118 119 120 121 123 124 125 126 128 Continued on next page

-4 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

About This Lesson

Guide

-- continued

Estimated length

This lesson takes approximately 1.5 hours to complete.

Accountability

You will be able to measure your progress with the following: • Exercises using your lab system. • Check-point activity • Lesson review Reference • AIX5L docs

Organization of this lesson

This lesson consists of information followed by exercises that allow you to practice what you’ve just learned. Sometimes, as the information is being presented, you are required to do something - pull down a menu, enter a response, etc. This symbol, in the left hand side-head, is an indication that an action is required.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-5 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Dump Analysis Tools Introduction

AIX5L introduces new debugging tools, the main change from the previous releases of AIX is that the crash command has been replaced by: • IADB and KDB kernel debuggers for live systems debugging • iadb and kdb commands for system image analysis In addition the following tools/commands are available, to assist you with debug: • bosdebug • Memory Overlay Detection System (MODS) • System Hang Detection • truss

Typographic conventions

In the following sections we will use uppercase IADB and KDB when speaking about the live kernel debuggers, and lowercase iadb and kdb when speaking about the commands.

-6 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

dump components Introduction

In AIX5L, a dump image is not actually a full image of the system memory but a set of memory areas dumped by the dump process

The Master dump Table

A master dump table entry is a pointer to a function provided by the kernel extension that will be called by the kernel dump routine when a system dump occurs. These functions must return a pointer to a component dump table structure. These functions and the component dump table entry both must reside in pinned global memory. They must be registered to the kernel by the dmp_add and unregistered using dmp_del kernel services. Kernel specific areas are preloaded by kernel initialization.

Component dump tables

Dump component tables are structures of type struct_cdt. Component dump tables are returned by the dmp registered functions when the dump process start. Each one is a structure made of: • a CDT Header • an array of CDT entries

CDT Header

The CDT Header contains: • A magic number that can be one of: • DMP_MAGIC_32 for 32 -bit CDT • DMP_MAGIC_VR for 32-bit CDT that may contain virtual or real addresses • DMP_MAGIC_64 for 64-bit CDT • the component dump name • the length of component dump table

CDT entries

CDT entries in the component dump tables will be one of cdt_entry64, cdt_entry_vr or cdt_entry32 according to the DMP_MAGIC number has defined in /usr/include/sys/dump.h

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-7 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Dump creation process Introduction

This section will describe the dump process.

Process overview

The following steps will be used to write a dump to the dump device: Step 1 2 3 4

5

-8 of 128 AIX 5L Internals

Action Interrupts are disabled 0c9 or 0c2 are written to the LED display, if present Header information about the dump is written to the dump device The kernel steps through each entry in the master dump table, calling each component dump routine twice • Once to indicate that the kernel is starting to dump this component 1 is passed as a parameter • Again to say that the dump process is complete 2 is passed After the first call to a component dump routine, the kernel processes the CDT that was returned For each CDT entry, the kernel : • Checks every page in the identified data area to see if it is in memory or paged out • Builds a bitmap indicating each page's status • Writes a header, the bitmap, and those pages which are in memory to the dump device Once all dump routines have been called, the kernel enters an infinite loop, displaying 0c0 or flashing 888

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

Component dump routines Description

Component Dump Routines • When called with a 1: • Make any necessary preparations for dumping • For example, they may read device-specific information from an adapter. The FDDI device driver does this • Fill in the component dump table • Most device drivers do this during their initialization • Return the address of the component dump table • When called with a 2: • Clean up after themselves • In reality, most routines either return immediately, do some debug printfs and then return, or else they ignore the parameter entirely and return the same thing every time

Note

A component dump routine may or may not do a lot of work when called with a 1. Many simply return the address of some previously-initialized CDT, but some (for example, the thread table and process table dump routines) actually build the CDT from scratch. The original rationale for the second call to each dump routine was to provide notification that the dump process had finished with that component's dump data. In practice, however, no one really cares. The routines that just return an address don't even bother to look at the parameter they were passed. The routines that build the data on the fly look for a 2 and return immediately. The most that any routine today does with this second call is to issue some debug printf call. This is generally used to debug the component dump routine itself, by verifying that the system dump facility was able to successfully process its CDT.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-9 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

bosdebug command Introduction

The bosdebug command can be used to enable or disable the MODS feature as well as other kernel debugging parameters. Any changes made with the bosdebug command will not take effect until the system is rebooted.

bosdebug parameters

The bosdebug command accept the following parameters : • -I: Causes the kernel debug program to be loaded and invoked on each subsequent reboot. • -D: Causes the kernel debug program to be loaded on each subsequent reboot. • -M: Causes the memory overlay detection system to be enabled. Memory overlays in kernel extensions and device drivers will cause a system crash. • -s sizelist Causes the memory overlay detection system to promote each of the specified allocation sizes to a full page, and allocate and hide the next subsequent page after each allocation. This causes references beyond the end of the allocated memory to cause a system crash. sizelist is a list of memory sizes separated by commas. Each size must be in the range from 16 to 2048, and must be a power of 2. • -S: Causes the memory overlay detection system to promote all allocation sizes to the next higher multiple of page size (4096), but does not hide subsequent pages. This improves the chances that references to freed memory will result in a crash, but it does not detect reads or writes beyond the end of allocated memory until that memory is freed. • -n sizelist: Has the same effect as the -s option, but works instead for network memory. Each size must be in the range from 32 to 2048, and must be a power of 2. This causes the net_malloc_frag_mask variable of the 'no' command to be turned on during boot. • -o: Turns off all debugging features of the system. • -L: Displays the current settings for the kernel debug program and the memory overlay detection system. • -R on | off: Sets the real-time extensions for multiprocessor systems only.

-10 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

Memory Overlay Detection System Introduction

The Memory Overlay Detection System (MODS) helps detect memory overlay problems in the kernel, kernel extensions, and device drivers. The MODS can be enabled using the bosdebug command.

Problems detected

Some of the most difficult types of problems to debug are what are generally called "memory overlays." Memory overlays include the following: • Writing to memory that is owned by another program or routine • Writing past the end (or before the beginning) of declared variables or arrays • Writing past the end (or before the beginning) of dynamically-allocated memory • Writing to or reading from freed memory • Freeing memory twice • Calling memory allocation routines with incorrect parameters or under incorrect conditions. In the kernel environment (including the kernel, kernel extensions, and device drivers), memory overlay problems have been especially difficult to debug because tools for finding them have not been available. Starting with Version 4.2.1, however, the Memory Overlay Detection System (MODS) helps detect memory overlay problems in the kernel, kernel extensions, and device drivers. Note: This feature does not detect problems in application code; it only watches kernel and kernel extension code.

When to use MODS

This feature is useful in the following circumstances: • When developing your own kernel extensions or device drivers and want to test them thoroughly. • When asked to turn this feature on by IBM technical support service to help in further diagnosing a problem that you are experiencing. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-11 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Memory Overlay Detection System How MODS works

-- continued

The primary goal of the MODS feature is to produce a dump file that accurately identifies the problem. MODS works by turning on additional checking to help detect the conditions listed above. When any of these conditions is detected, your system crashes immediately and produces a dump file that points directly at the offending code. (Previously, a system dump might point to unrelated code that happened to be running later when the invalid situation was finally detected.) If your system crashes while the MODS is turned on, then MODS has most likely done its job. To make it easier to detect that this situation has occurred, the IADB/ iadb and KDB/kdb commands have been extensively modified. The stat subcommand now displays both: • Whether the MODS (also called "xmalloc debug") has been turned on • Whether this crash was the result of the MODS detecting an incorrect situation. The xmalloc subcommand provides details on exactly what memory address (if any) was involved in the situation, and displays mini-tracebacks for the allocation and/or free of this memory. Similarly, the netm command displays allocation and free records for memory allocated using the net_malloc kernel service (for example, mbufs, mclusters, etc.). You can use these commands, as well as standard crash techniques, to determine exactly what went wrong.

MODS limitations

There are limitations to the Memory Overlay Detection System. Although it significantly improves your chances, MODS cannot detect all memory overlays. Also, turning MODS on has a small negative impact on overall system performance and causes somewhat more memory to be used in the kernel and the network memory heaps. If your system is running at full CPU utilization, or if you are already near the maximums for kernel memory usage, turning on the MODS may cause performance degradation and/or system hangs. Our practical experience with the MODS, however, is that the great majority of customers will be able to use it with minimal impact to their systems. Continued on next page

-12 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Memory Overlay Detection System MODS and kdb

Guide

-- continued

If a system crash occurs due to an MODS problem, the kdb xm sub command will be able to display status and traces on memory overlay problems

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-13 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection Introduction

System hang management allows users to run mission critical applications continually while improving application availability. System hang detection alerts the system administrator of possible problems and then allows the administrator to log in as root or to reboot the system to resolve the problem.

System Hang Detection

All processes (also know as threads) run at a priority. This priority is numerically inverted in the range 40-126. Forty is highest priority and 126 is the lowest priority. The default priority for all threads is 60. The priority of a process can be lowered by any user with the nice command. Anyone with root authority can also raise a process’s priority. The kernel scheduler always picks the highest priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high priority threads to completely tie up the machine such that low priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The System Hang Detection (SHD) feature provides a mechanism to detect this situation and allow the system administrator a means to recover. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval. The actions and their defaults are: Continued on next page

-14 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

System Hang Detection

-- continued

System Hang Detection continued

Action

Default Enabled

Default Priority

Default Device

60

Default Timeout (Seconds) 120

Log an error in errlog Display a warning message Give a recovery getty Launch a command Reboot the system

disabled disabled

60

120

/dev/console

enabled

60

120

/dev/tty0

disabled

60

120

disabled

39

300

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-15 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection shconf Script

-- continued

The shconf command is invoked when System Hang Detection is enabled. shconf configures which events are surveyed and what actions are to be taken if such events occur. The user can specify the five actions described below and can specify the priority level to check, the time out while no process or thread executes at a lower or equal priority, the terminal device for the warning action and the getty action: • Log an error in the error log file • Display a warning message on the system console (alphanumeric console) or on a specified TTY • Reboot the system • Give a special getty to allow the user to log in as root and launch commands • Launch a command For the Launch a command and Give a special getty options, SHD will launch the special getty or the specified command at the highest priority. The special getty will print a warning message specifying that it is a recovering getty running at priority 0. The following table lists the default values when the SHD is enabled. Only one action is enabled per type of detection. Note: When Launch a recovering getty on a console is enabled, the shconf script adds the -u flag to the getty line in the inittab that is associated with the console login. Continued on next page

-16 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection Process

Guide

-- continued

The shdaemon is in charge of handling the detection of system hang. It retrieves configuration information, initiates working structures, and starts detection times set in by the user. The shdaemon is started by init with a priority zero. The shdaemon will be set (off/respawn) in the inittab each time the shconf command will (disable/enable) the sh_pp option.

SMIT Interface

You can manage the SHD configuration from the SMIT System Environments menu. From the System Environments menu, select Manage System Hang Detection. The options in this menu allow system administrators to enable or disable the detection mechanism. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-17 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection Configuration of the SHD

Options

-- continued

The shconf command can be used to configure the System Hang Detection. The following parameters maybe used with shconf: • -d: Display the System Hang Detection status. • -R -l prio: will reset effective values to default. • -D[O] -l prio: Display the default values (Optional O will output values separated by colons • -E[O] -l prio: Display the effective values (Optional O will output values separated by colons • -l prio [-a Attribute=Value]: will change the Attribute to the nue Value The following options can be used to customize the System Hang Detection : name default description sh_pp enable Enable Process Priority Problem pp_errlog pp_eto pp_eprio pp_warning pp_wto pp_wprio pp_wterm pp_login pp_lto pp_lprio pp_lterm pp_cmd pp_cto pp_cprio pp_cpath pp_reboot pp_rto pp_rprio

disable 2 60 disable 2 60 /dev/console enable 2 56 /dev/tty0 disable 2 60 / disable 5 39

Log Error in the Error Logging Detection Time-out Process Priority Display a warning message on a console Detection Time-out Process Priority Terminal Device Launch a recovering login on a console Detection Time-out Process Priority Terminal Device Launch a command Detection Time-out Process Priority Script Automatically REBOOT system Detection Time-out Process Priority Continued on next page

-18 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

System Hang Detection example

Guide

-- continued

The following output represent various use of the chconf command: # shconf -R -l prio !ls -al /dev/cd0 lke 57 dev 0xd trb cpu 1 7 to_next...............0000000000000000 trb->knext.................F10000971E27AD00 trb->kprev.................0000000000000000 Owner id (-1 for dev drv)..00000000000042A1 Owning processor...................00000001 Timer flags........................00000010 INCINTERVAL trb->timerid...............0000000000000000 trb->eventlist.............FFFFFFFFFFFFFFFF trb->timeout.it_interval...0000000000000000 sec. 00000000 nsec. Next scheduled timeout ....0000000039BE55A6 sec. 19B39935 nsec. Completion handler.........00000000001DA910 .rtsleep_end+000000 Completion handler data....F100009715F8B7B0 Int. priority .....................FFFFFFFF Timeout function...........0000000000000000 CPU #1 TRB #2 of 13 on Free List . (0)> iplcb -mem trace ifnet SLOT 1 ---- IFNET INFO ----(@ 007545E0)---name........ lo0 unit........ 00000000 mtu......... 00004200 flags....... 0E08084B (UP|BROADCAST|LOOPBACK|RUNNING|SIMPLEX|NOECHO|BPF|GROUP_ROUTING... ...|64BIT|CANTCHANGE|MULTICAST) timer....... 00000000 metric...... 00000000 address: 127.0.0.1 init()...... 00000000 output().... 001DBF38 start()..... 00000000 done()...... 00000000 ioctl()..... 001DBF20 reset()..... 00000000 watchdog().. 00000000 ipackets.... 000000B5 ierrors..... 00000000 opackets.... 000000B5 oerrors..... 00000000 collisions.. 00000000 next........ F10000971614F000 type........ 00000018 addrlen..... 00000000 hdrlen...... 00000000 index....... 00000001 ibytes...... 00003448 obytes...... 00003448 imcasts..... 00000000 omcasts..... 00000000 iqdrops..... 00000000 noproto..... 00000000 baudrate.... 00000000 arpdrops.... 00000000 ifbufminsize 00000000 devno....... 00000000 chan........ 00000000 multiaddrs.. F1000082C0157468 tap()....... 00000000 tapctl...... 00000000 arpres().... 00000000 arprev().... 00000000 arpinput().. 00000000 ifq_head.... 00000000 ifq_tail.... 00000000 ifq_len..... 00000000 ifq_maxlen.. 00000032 ifq_drops... 00000000 ifq_slock... 00000000 slock....... 00000000 multi_lock.. 00000000 6_multi_lock 00000000 addrlist_lck 00000000 gidlist..... 00000000 ip6tomcast() 00000000 ndp_bcopy(). 00000000 ndp_bcmp().. 00000000 ndtype...... 01000000 multiaddrs6. F1000082C0158F00 SLOT 2 ---- IFNET INFO ----(@ F10000971614F000)---name........ tr0 unit........ 00000000 mtu......... 000005D4 . . (0)> tcpcb @ F1000082C0031C34 E00000000008C910 waitproc_find_run_queue()+210: { .mib ==>0:

adds

sp = 0x40, sp

1:

mov.i

ar.lc = r33

2:

br.ret.sptk.few

rp

;;

}

>CPU0>

• An application make a call to the breakpoint() kernel services or to the breakpoint system call. • A breakpoint previously set using the IADB has been reached • A fatal system error occurs. A dump might be generated on exit from the IADB.

IADB concept

When the IADB Kernel Debugger is invoked, it is the only running program until you exit IADB or you use the start sub command to start another cpu. All processes are stopped and interrupts are disabled. The IADB Kernel Debugger runs with its own Machine State Save Area (mst) and a special stack. In addition, the IADB Kernel Debugger does not run operating system routines. Though this requires the kernel code be duplicated within IADB, it is possible to break anywhere within the kernel code. When exiting the IADB Kernel Debugger, all processes continue to run unless the debugger was entered via a system halt.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-91 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

iadb command Introduction

The iadb command, unlike the IADB kernel debugger, allows examination of an operating system image issued on IA-64 systems. The iadb command may be used on a running system but will not provide all functions available with the IADB kernel debugger.

Parameters

The iadb command maybe used with the following parameters : • no parameter : the iadb will use /dev/mem as the system image file and /usr/lib/ boot/unix as the kernel file. In this case root permissions are required. • -d system_image_file : the iadb will use the image file provided. • -u kernel_file : the iadb will use the kernel file. This is required to analyze a system dump on a system that has a different unix level. • -i include file list(may be comma separated) • -u user modules list for any symbol retrieval(comma separated list)

Loading errors

If the system image file provided doesn’t contain a valid dump or the kernel file doesn’t match the system image file, the following message may be issued by the iadb command: # iadb -u /usr/lib/boot/unix -d dump_file **TBD**

-92 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

IADB break point and step sub commands Introduction

The following table represents the breakpoint and step sub commands and their matching crash/lldb sub commands when available breakpoint and step function set/list break point set/list local break point clear local break point clear break points clear all breakpoint go to end of function go until address single step step a bundle step to next branch step on bl/blr step on branch

br sub command

IADB sub commands br

c sr s/so sb stb

iadb sub commands N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

The br subcommand can be used to set and display software break points. The br subcommand accept the following options : • • • • • • • •

c sub command

crash/lldb sub commands

None : will display the currently set break points. -a ‘N’ : will break after ‘N’ occurrences -c {expr} : will break if the condition {expr} is true -d : deferred, will set the break point when the module will be loaded -e ‘N’ : will break every ‘N’ occurrences -t ‘tid’ : will break only if current thread id is ‘tid’ -u ‘N’ : break up to ‘N’ occurrences address : the break point address

The c sub command can be use to clear some or all break points. The c sub command accept the following parameters : • index : index of the break point as listed in the br output • address : address of the break point • all : clear all break points. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-93 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB break point and step sub commands Examples

-- continued

The following example will show the use of br,c and s sub commands : # ps -mF "THREAD" 1: adds r8 = 0x1, r41 ;; 2: cmp.eq p6, p0 = 0, r40 } > dis kread+90 br -a 5 -t 2A71 kread br go 0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }

Continued on next page

-94 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

IADB break point and step sub commands

-- continued

Examples continued

> sb E0000000002E2230 kread()+10: { .mii ==>0: adds r8 = 0x18, sp 1: adds r40 = 0x20, sp 2: adds r9 = 0x28, sp } > stb E0000000002E2620 rdwr()+0: { .mii ==>0: alloc r41 = ar.pfs, 11, 0, 6, 0 1: adds sp = -0x50, sp 2: mov r42 = rp ;; } > sr E0000000002E22A0 kread()+80: { .mii ==>0: adds r9 = 0, r8 1: nop.i 0 ;; 2: cmp4.eq p6, p7 = 0, r9 > c all br go CPU0> d dbg_avail CPU0> dp 0x1000 2 5 CPU0> dio 0x3f6 1 8 CPU0> dis kread CPU0> dpci 0 0x58 0 0x20 4 CPU0> d enter_dbg CPU0> m enter_dbg 4 0x43 CPU0> d enter_dbg CPU0> dp 0x5000 CPU0> mp 0x5000 8 0x1122334455667788 CPU0> dp 0x5000 CPU0> b CPU0> iip CPU0> kr CPU0> p CPU0> r CPU0> rr CPU0> mio 0x408 8 0 CPU0> map (r34) CPU0> map 0xe000000000000000 CPU0> map foo+0x100 CPU0> dbr CPU0> dbr foo CPU0> dbr -t foo CPU0> cdbr 3 CPU0> cdbr 0xe000000000011cc0 CPU1> sys CPU1> reason E00000000008E000 waitproc()+160: { .mii ==>0: alloc r35 = ar.pfs, 5, 0, 5, 0 1: adds sp = -0xA0, sp 2: mov r36 = rp ;; }

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-109 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB kernel extension loader sub commands Introduction

The following table represents the kernel extension loader sub commands and their matching crash/lldb sub commands when available kernel extension loader function

list loaded extension

crash/ IADB iadb lldb sub sub sub comma comma comma nds nds nds le kext/ ldsyms/ unldsy ms

list loaded symbol tables remove symbol table list export tables

kext

The kext sub command will display all loaded kernel extensions and their text and data load addresses

ldsyms and unldsyms sub commands

The ldsyms and unldsyms will load or unload a kernel extension symbols using :

examples

(0) kext x foo+0x4000 CPU0> x 0x20000000 CPU0> x (r1) CPU0> cpu 1 E00000000008C7F2 waitproc_find_run_queue()+F2: { .mii 0: adds r20 = 0x1, r10 1: shr.u r19 = r11, r10 ;; ==>2: and r21 = r17, r19 } >CPU1>

-122 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

IADB block address translation sub commands Introduction

The following table represents the block address translation sub commands and their matching crash/lldb sub commands when available block address translation function display dbats display ibats modify dbats modify ibtas

crash/lldb sub commands

IADB sub commands

iadb sub commands

parameters

Examples

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-123 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB bat/brat sub commands Introduction

The following table represents the bat/brat sub commands and their matching crash/lldb sub commands when available bat/brat function branch target clear branch target local branch target clear local branch target

crash/lldb sub commands

IADB sub commands

iadb sub commands

parameters

Examples

-124 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

IADB miscellaneous sub commands Introduction

The following table represents the miscellaneous sub commands and their matching crash/lldb sub commands when available miscellaneous function reboot the machine display help run an aix command set kdbx compatibility exit set debugger parameters display elapsed time enable/disable debug calculate/convert an hexadecimal expression calculate/convert a decimal expression

crash/lldb sub commands help/? !

set

IADB sub commands

iadb sub commands

help kdbx go set

set

calc

help sub command

‘The help sub command can be used with out parameter to display the command listing or with a command as parameter to display an help related to that command.

kdbx sub command

The kdbx sub command can be used to set the symbol needed to use kdb with the kdbx interface. The following variables are set by kdbx and will modify output of certain sub commands : • kdbx_addrd : Display breakpoint address instead of symbol name • kdbx_bindisp : Display output in binary format instead of ASCII format

go sub command

The go sub command is used to leave the KDB, this will start the dump process if the KDB was entered while the system was crashing. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-125 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

IADB miscellaneous sub commands set sub command

-- continued

The set sub command can be used to set or display the following kdb parameters : • rows=number : set number of rows on current display • mltrace={on|off} : mltrace on/off; only on DEBUG kernel • sctrace={on|off} : verbose syscall prints on/off; only on DEBUG kernel • itrace={on|off} : enable/disable tracing on/off; only on DEBUG kernel • umon={on|off} : enable/disable umon performance tool • exectrace={on|off} : verbose exec prints on/off; only on DEBUG kernel • excpenter={on|off} : debugger entry on exception on/off • ldrprint={on|off} : verbose loader prints on/off; only on DEBUG kernel • kprintvga={on|off} : kernel prints to VGA on/off • dbgtty={on|off} : use debugger TTY as console on/off • dbgmsg={on|off} : Tee Console and LED output to TTY • hotkey={on|off} : enter debugger on key press on/off; only on DEBUG kernel

Examples

-126 of 128 AIX 5L Internals

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Guide

Exercise Introduction

In this exercise you will configure the system to enable the live debugger and invoke both the live and image debugger for your system. Complete the following steps:

Step 1.

2. 3. 4.

Action Enable the Memory Overlay Detection System (MODS) using the bosdebug command. Enable the live debugger with the bosboot command. Reboot the system, and login as root. Verify MODS is enabled with the debugger.

Reference

> stat

5.

xmalloc debug: ________________ Verify the debugger is available:

Power PC:kdb > dw kdb_avail >q

6.

IA-64: iadb > d dbg_avail > go Execute the following truss command: # truss -t kread -i ksh Hit the enter key. How many kread functions were executed? __________ Enter the exit command to exit truss: # exit Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

-127 of 128

Guide

Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm

Exercise

-- continued

Step 7. 8. 9.

Action Change directory to /var/adm/ras. Start the image debugger against the crash dump captured in the previous lesson. Execute the following commands: • iadb: reason

Reference

Why was the debugger entered? ___________________________ • kdb: p * or iadb: pr * What is the process id for the errdemon? ____________________________ • Execute the ls command:

kdb: !ls or iadb: ! ls • iadb: sys What build of AIX5L was the crash dump taken on? 10. 11.

12.

13.

-128 of 128 AIX 5L Internals

__________________________ Exit the debugger: q Enter the live debugger: Ctrl-Alt-NUMAPAD4 Enter the cpu command. What is the status of CPU0? ________________________________ Exit the live debugger.

Version 20001015 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Unit 7. Process Management Platform This lesson is independent of platform.

Lesson Objectives At the end of the lesson you will be able to: • List and describe the states of a process. • List the steps taken by the kernel to create a new process as the result of a fork() system call, and the steps taken to create a new thread of execution. • Describe what happens when a process terminates. • List the three thread models available in AIX 5. • Identify the relationship between the internal structures proc, thread, user and u_thread. • Use the kernel debugging tool to locate and examine processes, proc, thread, user and u_thread data structures. • Manage process scheduling using available commands, manage processes and threads on a SMP system (to best employ cache affinity scheduling), and manage processes on a ccNUMA system (to best employ quad affinity scheduling). • List the factors determining what action the threads of a process will take when a signal is received. • Write a simple C program that use the fork() system call to spawn new processes, that uses the wait() system call to retrieve the exit status of a child process, that creates a simple multi-threaded program by using the pthread_create() system call, and that uses exec() system call to load a new program into memory.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process Management Fundamentals Process definition

A process can be defined by the list of items which builds it. A process consists of: • A process table entry • A process ID (PID) • Virtual address space -

User-area (U-area)

-

Program “text”

-

Data

-

User and kernel stacks

• Statistical information Definition of process management

Process management consists of the tools and ability to have many processes and threads existing simultaneously in a system, and to share usage of the CPU or, in a SMP system, CPUs. Process management also includes the ability to start, stop, and force a stop of a process.

The tools and information used to manage the processes

• A process is a self-contained entity that consists of the information required to run a single program, such as a user application. • The kernel contains a table entry for each process called the proc entry. • The proc entry contains information necessary to keep track of the current state and location of page tables for the process. • The proc entry resides in a slot in an array of proc entries. • The kernel is configured with a fixed number of slots. • All processes have a process ID or PID. • The PID is assigned when the process is created and provides a convenient way for users to refer to the other processes. • The process contains a list of virtual memory addresses that the process is allowed to access. • The user-area (u_area) of a process contains additional information about the process when it is running. • The kernel tracks statistical information for the process, such as the amount of time the process uses the CPU, the amount of memory the process is using, etc. The statistical information is used by the kernel for managing its resources and for accounting purposes.

-2 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations fork() system call Process operations

Four basic operations define the lifetime of a process in the system: • fork - Process creation • exec - Loading of programs in process • exit - Death of process • wait - The parent process notification of the death of the child process.

Fork new processes

The fork system call is the way to create a new process • All processes in the system (except the boot process) are created from other processes through the fork mechanism. • All processes are descendants of the init process (process 1). • A process that forks creates a child process that is nearly a duplicate of the original parent process. • The child has a new proc entry (slot), PID, and registers. • Statistical information is reset, and the child initially shares most of the virtual memory space with the parent process. • The child process initially runs the same program as the parent process. The child may use the exec() call to run another program.

The fork() system call

The parent process has an entry in the process and thread table before the fork() system call; after the fork() system call, another independent process is created with entries in the Process and Thread tables. $,;.HUQHO 6\VWHPFDOO

Parent Process ...... ...... ...... fork() ......

Thread Table

3DUHQWHQWU\

&KLOGHQWU\

Child Process Process Table

3DUHQWHQWU\

&KLOGHQWU\

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations fork() system call Inherited attributes after a fork() system call

-- continued

The illustration shows what happens when the fork() system call is issued. The caller creates a child process that is almost an exact copy of the process itself. The child process inherits many attributes of the parent, but receives a new user block and dataregion. The child process inherits the following attributes from the parent process: • Environment • Close-on-exec flags and signal handling settings • Set user ID mode bit and Set group ID mode bit • Profiling on and off status • Nice value • All attached shared libraries • Process group ID and tty group ID • Current directory and Root directory • File-mode creation mask and File size limit • Attached shared memory segments and Attached mapped file segments • Debugger process ID and multiprocess flag, if the parent process has multiprocess debugging enabled (described in the ptrace subroutine).

Continued on next page

-4 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations fork() system call Attributes not inherited from the parent process

-- continued

Not all attributes are inherited from the parent. The child process differs from the parent process in the following ways: • The child process has only one user thread; it is the one that called the fork subroutine, no matter how many threads the parent process had. • The child process has a unique process ID. • The child process ID does not match any active process group ID. • The child process has a different parent process ID. • The child process has its own copy of the file descriptors for the parent process. However, each file descriptor of the child process shares a common file pointer with the corresponding file descriptor of the parent process. • All semadj values are cleared. • Process locks, text locks, and data locks are not inherited by the child process. • If multiprocess debugging is turned on, the trace flags are inherited from the parent; otherwise, the trace flags are reset. • The child process utime, stime, cutime, and cstime are set to 0. • Any pending alarms are cleared in the child process. • The set of signals pending for the child process is initialized to the empty set. • The child process can have its own copy of the message catalogue for the parent process. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations fork() system call The fork() system call code example

-- continued

The following code illutrates the usage of the fork() system call. After the call there will be two processes executing two different copies of the same code. A process can determine if it is the parent or the child from the return code. int statuslocation; pid_t proc_id;

tproc_id=fork();

if ( proc_id < 0 ) { printf ("fork error \n"); exit (-1); } if ( proc_id > 0 ) { /*Parent process

waiting for child to terminate */

proc_id2 = wait(&statuslocation); }

if ( proc_id == 0 ) { /* I’m the child proces */ {.............}

Continued on next page

-6 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations fork() system call Listing processes with the ps command after fork()

Executing the test program creates two processes, which can be listed with the ps command. The program name in the example is fork and that name is listed as the command for both the parent and the child. Note that the child’s PPID is equal to the PID of the parent. F S UID

Processes without the parent process

PID

PPID

C

PRI NI ADDR

SZ

TTY

TIME CMD

240001 A

0 10346 10236

0

60 20

5b8b

496

pts/1 0:00 ksh

200001 A

0 10742 10346

0

68 24

9bb3

44

pts/1 0:00 fork

1 A

0 10990 10742

0

68 24

dbbb

44

pts/1 0:00 fork

In the previous example, it was shown how the PID of the calling process becomes the PPID of the child process. This example shows what happens if the parent process terminates before the child process terminates. If we rewrite the program so that the parent process terminates after fork() without waiting for the child, the system will replace the PPID with 1, which is the init process. The init process will then pickup the SIGCHLD signal so that the system can free the process table, even though the parent process does not exist. This situation is shown below: F S UID

PID

240001 A 40001 A 200001 A

Zombie processes

-- continued

PPID

C PRI NI ADDR

SZ

TTY

TIME CMD

0 10346 10236

0

60 20 5b8b 496

pts/1 0:00 ksh

0 10996

1

0

68 24 8330

pts/1 0:00 fork

0 11216 10346

3

61 20 dbbb 244

44

0:00 ps

If, for some reason, no processes receive the SIGCHLD signal from the child, the empty slot will remain in the process table, even though other resources are released. Such a process is called a zombie, and is listed in ps as . The example below shows some of these zombie processes. .....F S UID

PID

PPID

200003 A

0

1

0

0

60 20 500a

240401 A

0

2502

1

0

240001 A

0

2622

2874

0

40001 A

0

2874

1

0

50005 Z

0

3776

1

C PRI NI ADDR

1

TTY

TIME CMD

704

SZ

-

0:03 init

60 20 d2da

40

-

0:00 uprintfd

60 20 2965

5208

-

0:46 X

60 20 c959

384

-

68 24

0:00 dtlogin 0:00

40401 A

0

3890

1

0

60 20 91d2

480

-

0:00 errdemon

240001 A

0

4152

1

0

60 20 39c7

88

-

0:21 syncd

240001 A

0

4420

4648

0

60 20 4b29

220

-

0:00 writesrv

240001 A

0

4648

1

0

60 20 b1d6

308

-

0:00 srcmstr

50005 Z

0

10072

1

0

68 24

0:00

50005 Z

0

10454

1

0

68 24

0:00

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exec() system call Exec system call to load a new program

The exec subroutine does not create a new process; it loads a new program into the process. • To execute a new program, a process uses the exec set of system calls to load the new program into memory and execute the program. • Each program can successively exec other programs to load and execute in the process.

Valid program files for the exec() system call

The fork() system call creates a new process with a copy of the environment, and the exec() system call loads a new program into the current process, and overlays the current program with a new one (which is called the new-process image). The new-process image file can be one of three file types: • An executable binary file in XCOFF file format. • An executable text file that contains a shell procedure. • A file that names an executable binary file or shell procedure to be run.

Inherited attributes after the exec() system call

The new-process image inherits the following attributes from the calling process image: session membership, PID, PPID, supplementary group IDs, process signal mask, and pending signals. Continued on next page

-8 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations exec() system call The exec() system call

-- continued

The illustration show how the process and thread table remain unchanged after the exec() system call.

6\VWHPFDOOV

Parent Process ...... ...... ...... exec() ......

Thread Table

Process Table

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exec() system call The exec() system call code example

The following code illustrates the usage of the execv() system call. After the call, the current process will be overlaid with the new program. To illustrate the function, the output from the program is listed after the program. The program first defines two variables. The first is a pointer to the program name to be executed, and the second is a pointer to the arguments (by convention the first argument parsed is the program name itself). The program source for sleeping.c is not supplied, as any program can be used for this example. #include int returncode;

char *argumentp[3],arg1[50],arg2[50],arg3[50]; const char *Path="/home/olc/prog/thread/sleeping";

main(argc,argv) int argc; char **argv; { strcpy (arg1,"/home/olc/prog/thread/sleeping"); strcpy (arg2,"test param 1"); strcpy (arg3,"test param 2"); argumentp[0]=arg1; argumentp[1]=arg2; argumentp[2]=arg3; /* ArgumentV=*arguments; */ printf ("before execv

\n");

returncode = execv(Path,argumentp); printf ("after execv

\n");

exit (0); }

and the program output: before execv I’m the sleeping process

Continued on next page

-10 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations exec() system call The exec() system call

-- continued

While the program in the example is being executed, we can examine the process status with the ps command. Notice that the program name for the example is “exec,” and the program name for the called program is “sleeping.” As we see in the listing from the ps command, the current program is replaced with the new one, and we never reach the print statement "after execv\n." The program prints “I’m the sleeping process,” because the main program has been replaced with the program in the path variable. If we look closer at the output from the ps -l command before and after the system call, we can tell that the program name has been replaced, but the process ID and PPID remains the same. Before the exec system call take place: #> ps -l .

F S UID

PID

PPID

C PRI NI ADDR

SZ

TTY

TIME CMD

240001 A

0 10346 10236

0

60 20 5b8b

492 pts/1

0:00 ksh

200001 A

0 10696 10346

2

61 20 6bad

240 pts/1

0:00 ps

200001 A

0 10964 10346

0

68 24 4388

40 pts/1

0:00 exec

And after the exec() system call, the exec program is replaced with sleeping: #> ps -l .

F S UID

PID

PPID

C PRI NI ADDR TTY

TIME CMD

240001 A

0 10346 10236

0

60 20 5b8b pts/1 0:00 ksh

200001 A

0 10698 10346

2

61 20 a354 pts/1 0:00 ps

200001 A

0 10964 10346

0

68 24 4388 pts/1 0:00 sleeping

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process operations exit system call Exit: what happens when a process terminates

The exit system call is executed at the end of every process, the system call cleans up, releases memory, text and data, but leaves an entry in the process table so that a return value and other status information can be passed to the parent process if needed. • exit - termination of a process • When a program no longer needs to run or execute other programs, it can exit. • A program that exits causes the process to enter the zombie state.

Exiting from a program

There are basically three ways that a process can terminate: the program can have reached the end of the program flow and meet an explicit exit(exit_value) statement, the program flow can end without an exit() statement (in which case the linker automatically inserts a call to the exit system call), or the running program receives a signal from an external source such as keyboard interrupt () from the user. If the program receives an interrupt, the program path will switch to the interrupt handling routine, either in the program, or the system default routine, which will terminate the program with an exit. When executing the exit() system call, all memory and other resources are freed, and the parameter supplied to exit(0 is placed in the process table as the exit value for the process. After the completion of the exit() system call, a signal SIGCHLD is issued to the parent process (the process at this stage is nothing but the process table entry). This state is called the zombie state, when the parent process reacts to the SIGCHLD signal and reads the return code from the process table, the system can remove the process table entry, clean up, and free the process table entry. In rare occasions the parent process can not respond to the signal immediately, we can see the zombie in the process table with the ps command. A zombie will be listed as .

-12 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process operations, wait() system call Waiting for the death of a child process

The wait system call is placed at the end of a program; normally it is placed there by the programmer as the system call wait(), but if not, the system will automatically add a wait one. The wait call is used to notify the parent process of the death of the child process and for releasing the child’s process slot. • The parent process can be notified of the death of the child by waiting with a system call or catching the proper signal. • Once the parent process acknowledges the death of a child process, the child process' slot is freed.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process states Process states

-- continued

In AIX, processes can be in one of five states: • Idle • Active • Stopped • Swapped • Zombie

Idle state

When processes are being created, they are first in the idle state. This state is temporary until the fork mechanism is able to allocate all of the necessary resources for the creation and fill in the process table for a new process.

Active state

Once the new child process creation is completed, it is placed in the active state. The active state is the normal process state, and threads in the process can be running or be ready-to-run.

Stopped processes

Processes can also be stopped or in a stopped state. Process can be stopped by the SIGSTOP signal. Stopped processes can be restarted by the SIGCONT signal. If a process is stopped, all threads are in the stopped state.

Swapped processes

If a process is swapped, it means that another process is running, and the process, or any threads, cannot run until scheduler makes it active again. Continued on next page

-14 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Process states Zombie process

-- continued

When a process terminates with an exit system call, they first goes into the zombie state, such processes have most of their resources freed. However, a small part of the process remains, such as the exit value that the parent process uses to determine why the child process died. If the parent process issues a wait system call, the exit status is returned to the parent, and the remaining resources of the child process are freed, and the process ceases to exist. The slot can then be used by another newly created process. If the parent process no longer exists when a child process exits, the init process frees the remaining resources held by the child. Sometimes we can see a Zombie staying in the process list for a longer time; one example of this situation could be that a process exited, but the parent process is busy or waiting in the kernel and unable to read the return code.

State transitions for AIX processes

The illustration show how a process is being started with a fork() system call, turns into an active process, and how active process can change between swapped, active and stopped state. A terminating process becomes a zombie until the entire process is removed.

Idle

fork()

Active

Swapped

Stopped

Zombie

Non existing

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Kernel Processes Kernel processes: Kernel processes Kproc

• Are created by the kernel. • Have a private u-area/kernel stack. • Share "text" and data with the rest of the kernel. • Are not affected by signals. • Cannot use shared library object code or other user protection domain code. • Run in the Kernel Protection Domain. Some processes in the system are kernel processes. Kernel processes are created by the kernel itself to execute independent of threads action. Even though a kernel process shows up in the process table, through "Berkeley" ps, it is part of the kernel. The scheduler is one example of a kernel process. Kernel processes are scheduled like user processes, but tend to have higher priorities. Kernel processes can have multiple threads, as can user processes.

-16 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Thread Fundamentals Thread definition

Like a process, a thread can be defined by separate components. A thread consists of: • A thread table entry • A thread ID (TID)

Processes and threads

• Process holds address space • Thread holds execution context • Multiple threads can run within one process -

Threads

One CPU can run one thread at a time, on SMP systems, threads can actually run truly concurrent

• Threads allow multiple execution units to share the same address space. • The thread is the fundamental unit of execution. • Thread has IDs (TIDs) like a process has IDs (PIDs). • An independent flow of control within a process. • In a multi threaded process, each thread can execute on a different code concurrently. • Managing threads needs fewer resources than managing processes. • Inter-thread communication is more efficient than inter-process communication, especially because variables can be shared.

Threads share data and address space

Threads reduce the need for IPC operation, because they allow multiple execution units to share the same address space, and thereby easily share data. On the other hand, it adds complexity and risk to the programming. For example: synchronization and locking has to be controlled by the threads.

Threads are the unit of execution

The thread is the fundamental unit of execution and the scheduler and dispatcher only work with threads. Therefore, every process has at least one thread. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread Fundamentals Thread IDs (TID) and Process IDs (PID)

-- continued

TIDs are listed for all threads in the threads table; TIDs are always odd. PIDs are listed for all processes in the process table; PIDs are always even, except for the init process, where PID = 1. Threads represent independent flows within a process; the system does not provide synchronization, and the control must be in the thread itself. In a multi-threaded process, each thread can execute on a different code concurrently controlled by the program paths. One of the main reasons for using threads is that managing threads requires fewer resources than managing processes. Inter-thread communication is more efficient than inter-process communication.

-18 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

AIX Thread AIX Threads

• A thread is an independent flow of control that operates within the same address space as other independent flows of controls within a process. In other operating systems, threads are sometimes called "lightweight processes," or the meaning of the word "thread" is sometimes slightly different. • Multiple threads of control allow an application to overlap operations such as reading from a terminal or writing to a disk file. This also allows an application to service requests from multiple users at the same time. • Multiple threads of control within a single process are required by application developers to be able to provide these capabilities without the overhead of multiple processes. • Multiple threads of control within a single process allow application developers to exploit the throughput of multiprocessor (MP) hardware.

TID format

Threads IDs have the following format for 32-bit kernels: 31

24

0 0 0 0 0 0

8 7 INDEX

0

1 COUNT

1

And for 64-bit kernels the TID is 64-bit 63

56

0 0 0 0 0 0

8 7 INDEX

0

1 COUNT

1

• INDEX identifies the entry in the thread table corresponding to the designated TID (thread[INDEX]). • COUNT is a generation count that is intended to avoid the rapid reallocation of TIDs. When a new TID is to be allocated, its value is calculated on the first available thread table entry. Slots are recycled.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

AIX threads TID format listed with kdb

-- continued

The following is a 64-bit slot in the thread table listed with kdb; the TID is 002143 HEX =>, the index = 21, and the COUNT= 43, 21 hex = 33 decimal. According to the figure, this is the slot number in the thread table; the value is listed in the next line of the output from kdb. (0)> thread 33 SLOT NAME pvthread+001080

STATE

TID PRI

33 sendmail SLEEP 002143 03C

RQ CPUID

CL WCHAN

0

0

If we look in the memory at address pvthread+0001080 we can se the 64-bit TID structure. (0)> d pvthread+001080 pvthread+001080: 0000 0000

0000 2143

0000 0000

0000 0000

(0)>

-20 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Thread Concepts Threads concepts

• An application is said to be thread safe when multiple threads in a process can run the application successfully without data corruption. • A library is thread safe when multiple threads can be running a routine in that library without data corruption (another word for this is reentrant). • A kernel thread is a thread of control managed by the kernel. • A user thread is a thread of control managed by the application. • User threads are attached to kernel threads to gain access to system services. • In a multi-threaded system such as AIX:

Thread mapping models

-

The process is the swappable entity.

-

The thread is the schedulable entity.

• User threads are mapped to kernel threads by the threads library. The way this mapping is done is called the thread model. There are three possible thread models, corresponding to three different ways, to map user threads to kernel threads: •

M:1 model



1:1 model



M:N model.

• The AIX Version 4.1 and later threads support is based on the OSF/1 libpthreads implementation. It supports what is referred to as the 1:1 model. This means that for every thread visible in an application, there is a corresponding kernel thread. Architecturally, it is possible to have a M:N libpthreads model, where "M" user threads are multiplexed on "N" kernel threads. This is supported in AIX 4.3.1 and AIX 5L. • The mapping of user threads to kernel threads is done using virtual processors. A virtual processor (VP) is a library entity that is usually implicit. For a user thread, the virtual processor behaves as a CPU for a kernel thread. In the library, the virtual processor is a kernel thread or a structure bound to a kernel thread. • The libpthreads implementation is provided for application developers to develop portable multi-threaded applications The libpthreads.a library has been written as per the POSIX 1003.4a Draft 10 specification in AIX 4.3. Previous versions of AIX support the POSIX 1003.4a Draft 7 specification. The libpthreads is a linkable user library that provides user space threads services to an application. The libpthreads_compat.a provides the POSIX 1003.4a Draft 7 specification pthreads model on AIX 4.3.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Threads Models M:1 threads model

In the M:1 model, all user threads are mapped to one kernel thread and all user threads run on one VP. The mapping is handled by a library scheduler. All user threads programming facilities are completely handled by the library. This model can be used on any systems, especially on traditional single-threaded systems.

User Threads

User Threads

Library Scheduler

VP

Threads Library

Kernel Thread

M:1 Threads Model

Continued on next page

-22 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Threads Models 1:1 threads model

-- continued

In the 1:1 model, each user thread is mapped to one kernel thread and each user thread runs on one VP. Most of the user threads programming facilities are directly handled by the kernel threads.

User Threads

VP

User Threads

VP

VP Threads Library Kernel Threads

1:1 Threads Model

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Threads Models M:N threads model

-- continued

In the M:N model, all user threads are mapped to a pool of kernel threads and all userthreads run on a pool of virtual processors. A user thread may be bound to a specific VP, as in the 1:1 model. All unbound user threads share the remaining VPs. This is the most efficient and most complex thread model; the user threads programming facilities are shared between the threads library and the kernel threads.

User Threads

Library Scheduler

VP

VP

VP Threads Library Kernel Threads

M:N Threads Model

-24 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Thread states Thread states

In AIX, the kernel allows many threads to run at the same time, but there can only be one thread executing on each CPU at a time. The thread state is kept in t_state in the thread table (for detailed information look in the / usr/include/sys/thread.h file). Each thread can be in one of the following five states: • Idle • Ready to run • Running • Sleeping • Stopped • Swapped • Zombie

Idle state

When processes and threads are being created, they are first in the idle state. This state is temporary until the fork mechanism is able to allocate all of the necessary resources for the creation and fill in the thread table for a new thread.

Ready to run

Once the new thread creation is completed, it is placed in the ready to run state. The thread waits in this state until the thread is ran. When the thread is running, it continues to run until it has used a time slice, gives up the CPU or is preempted by a higher priority thread.

Running thread

A thread in the running state is the thread executing at the CPU. The thread state will change between running and ready to run until the thread finishes execution; the thread then goes to the Zombie state

Sleeping

Whenever the thread is waiting for an event, the thread is said to be sleeping. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread states

-- continued

Stopped

A stopped thread is a thread stopped by the SIGSTOP signal. Stopped threads can be restarted by the SIGCONT signal.

Swapped

Though swapping takes place at the process level and all threads of a process are swapped at the same time, the thread table is updated whenever the thread is swapped.

Zombie

The zombie state is a intermediate state for the thread lasting only until the all resources owned by the thread are given up.

State transitions for AIX threads

The illustration show the states for AIX threads. Threads are typically changing between running, ready to run, sleeping and stopped during the life time of the thread.

fork()

Being Created

Ready to run

Sleeping

Running

Swapped

Zombie

Stopped

Non existing

-26 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Thread Management Thread / process relationship

• The diagram below shows how the process shares most of the data among the threads; although each thread has its own copy of the registers, some kernel thread have specific data, and therefore have a private stack. Thus, data can be passed between threads via global variables. • A conventional unithreaded UNIX process can only harm itself (if incorrectly coded). • All threads in a process share the same address space, so in an incorrectly coded program, one thread can damage the stack and data areas associated with other threads in that process. • Except for such areas as explicitly shared memory segments, a process cannot directly affect other processes. • There is some kernel data that is shared between the threads, but the kernel also maintains thread specific data. • Per-process data is needed even when the process is swapped out is in the pvproc structure. The pvproc structure is pinned. • Per-process data is needed only when the process is swapped in is in the user structure. • Per-thread data is needed even when the process is swapped out is in the pvthread structure. The pvthread thread structure is pinned. • Per-thread data is needed only when the process is swapped in is in the uthread structure. Data placement overview Thread

Thread

Kernel Process Data

Registers

Registers

Registers

BSS Program Data

Stack

Stack

Stack

Kernel Thread Data

Kernel Thread Data

Kernel Thread Data

Code

© Copyright IBM Corp. 2000

Thread

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process swapping Process swapping

-28 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Guide

Thread Scheduling Thread scheduling

Scheduling and dispatching is the ability to assign CPU time to threads in the system in a efficient and fair way. The problem is to design the system to handle many simultaneous threads and at the same time still be responsive to events.

Clock tics and time slices

The division of time among the threads on the AIX system relies on clock tics. Every 1/100 of a second, or 100 times a second, the dispatcher is called and does the following: • Increases the running tic counter for the running process. • Scans run queues for the thread with the highest priority. • Dispatchs the most favored thread. Every real second the scheduler is awake, it recalculates the priority for all threads.

Thread priority

• AIX priority has 128 (0-127) levels that are called run queue levels. • The higher the run queue level, the lower priority. • Priority 127 can only be used by the wait process. • User processes can get priority changed from -20 to + 20 levels (renice). • User processes are in the range 40 - 80. • A clock tick interrupt decreases thread priority. • The scheduler (swapper) increases thread priority. The priority is based on the basic priority level, the initial nice value, the renice value and a penalty.

Penalty based on runtime Renice value -20 - +20

Higher value = lower priority

Nice value default = 20 Base Priority default value = 40

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Thread Scheduling Thread dispatching

-- continued

• Dispatcher chooses the highest priority thread to execute. • Threads are the dispatchable unit for the AIX scheduler. • Each thread has its own priority (0-127) and scheduling algorithm. • There are three Scheduling algorithms:

SCHED_RR threads scheduling algorithms

-

SCHED_RR Round Robin

-

SCHED_FIFO

-

SCHED_OTHER

• SCHED_RR

SCHED_FIFO threads scheduling algorithms

SCHED_ OTHER threads scheduling algorithms

-

This is a Round Robin scheduling mechanism in which the thread is time-sliced at fixed priority.

-

This scheme is similar to creating a fixed priority, real time process.

-

The thread must have root authority to be able to use this scheduling mechanism.

• SCHED_FIFO -

A non-preemptive scheduling scheme.

-

The thread runs at fixed priority and is not time-sliced.

-

It will be allowed to run on a processor until it voluntarily relinquishes by blocking or yielding.

-

A thread using SCHED_FIFO must also have root authority to use it.

-

It is possible to create a thread with SCHED_FIFO that has a high enough priority that it could monopolize the processor.

• SCHED_OTHER -

The default AIX scheduling.

-

Priority degrades with CPU usage. Continued on next page

-30 of 62 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Thread Scheduling Thread scheduling

Guide

-- continued

Like most UNIX systems, AIX uses a multilevel round-robin model for process and thread scheduling. Processes and threads at the same priority level are linked together and placed on a run queue. AIX has 128 run queues, 0-127, each representing one of the 128 possible priorities. When a process starts running is determined by a given priority based on the nice value, and the process is linked with other processes at the same level. As the process is running and consumes CPU resources, the priority decreases until it it finishes, or until the priority is so low that other processes get CPU time. If a process does not run, the priority increases until it can get CPU time again. The drawing below illustrates the 128 run queue levels and six processes: three at priority 60 and three at 70. 0 20 40 60 80 100 120 127

Idle process

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 62

Guide

Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm

Process and Thread Scheduling Thread scheduling algorithm

-- continued

The scheduler is using the following algorithm to calculate priorities for the running processes: For every clock tick (1/100 sec.): • The running thread is charged for one tick. • The dispatcher is called, scans the run queues, and dispatches the one with the highest priority. The scheduler runs every second: • It calculates new priority for all threads. • For each thread set, the number of used ticks is equal to (used ticks)* d/ 32 where 0 th

4

© Copyright IBM Corp. 2000

The thread structure displayed will be the thread for the running iadb process. Look for the field labeled t_procp this will contain a pointer to the proc structure. Examine this address. What region is this address in?

5

Look for the field labeled userp this will contain a pointer to the threads user area. Examine this address. What region is this address in?

6

Of the two address you examined witch one is in the process’s privet region?

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 18

Guide

-18 of 18 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Guide

Unit 10. IA-64 Linkage Convention

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 6

Guide

-2 of 6 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Guide

© Copyright IBM Corp. 2000

-3 of 6

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

Guide

-4 of 6 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Guide

© Copyright IBM Corp. 2000

-5 of 6

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

Guide

-6 of 6 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Unit 11. LVM Lesson Objectives At the end of the module the student should have gained knowledge about: Have an overview of the LVM, and Identify the LVM components such as • Logical volume • Physical volume • Mirroring, and parameters for mirroring • Striping and parameters for striping Physical disk layout Power Physical disk layout IA-64 LVM Physical layout including VGDA and VGSA Know the function of LVM Passive Mirror Write Consistency Know the function of LVM Hot spare disk Know the function of LVM Hot spot management Know the function of LVM Online backup (4.3.3.) Know the function of LVM Variable logical track group (LTG) Know the function of each of the High-Level LVM commands Trace LVM commands with the trace command Know the function of LVM Library calls Know briefly about Disk Device Calls Know briefly about Disk low level Device Calls such as SCSI calls and SSA Furthermore it is an objective that the student get experience from exercises with the content of this section. The exercises will • Examine the physical disk layout of a logical volume and a physical volume. • Examinine the impact of LVM Passive Mirror Write Consistency • Examinine the function of LVM LTG • Trace some LVM system activity.

Platform This lesson is independent of platform.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

References http://w3.austin.ibm.com/:/projects/tteduc/

-2 of 64 AIX 5L Internals

Technology Transfer Home Page

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Logical Volume Manager overview Introduction

The Logical Volume Manager (LVM) is the layer between the operating system (AIX) and the physical hard drives, the LVM provides reliable data storage (Logical volumes) to the OS. The LVM make use of the underlying physical storage, but hides the actual physical drives and drive layout. This section will explain how its done, how the data can be traced, and which parameters impacts the performance in different scenarios.

Physical volume

A hierarchy of structures is used to manage fixed-disk storage. Each individual fixed-disk drive, called a physical volume (PV) has a name, such as /dev/hdisk0. Every physical volume in use belongs to a volume group (VG). All of the physical volumes in a volume group are divided into physical partitions (PPs) of the same size (by default 2MB in volume groups that include physical volumes smaller than 300MB, 4MB otherwise). For space-allocation purposes, each physical volume is divided into five regions (outer_edge, inner_edge, outer_middle, inner_middle and center). The number of physical partitions in each region varies, depending on the total capacity of the disk drive. Within each volume group, one or more logical volumes (LVs) are defined.

Logical volume

Logical volumes are groups of information located on physical volumes. Data on logical volumes appears to be contiguous to the user but can be discontiguous on the physical volume. This allows file systems, paging space, and other logical volumes to be resized or relocated, span multiple physical volumes, and have their contents replicated for greater flexibility and availability in the storage of data. Each logical volume consists of one or more logical partitions (LPs). Each logical partition corresponds to at least one physical partition. If mirroring is specified for the logical volume, additional physical partitions are allocated to store the additional copies of each logical partition. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview Physical disks

-- continued

A disk must be designated as a physical volume and be put into an available state before AIX can assign it to a volume group. A physical volume has certain configuration and identification information written on it. This information includes a physical volume identifier and for IA-64 partition information for the disk. When a disk becomes a physical volume, it is divided into 512-byte physical blocks. The first time you start up the system after connecting a new disk, AIX detects the disk and examines it to see if it already has a unique physical volume identifier in its boot record. If it does, the disk is designated as a physical volume and a physical volume name (typically, hdiskx where x is a unique number on the system) is permanently associated with that disk until you undefine it.

Volume groups

The physical volume must become part of a volume group before it can be utilized by LVM. A volume group is a collection of 1 to 32 physical volumes of varying sizes and types. A physical volume may belong to only one volume group. The system will as default allow you to define up to 256 logical volumes per volume group, but the actual number you can define depends on the total amount of physical storage defined for that volume group and the size of the logical volumes you define. There can be up to 255 volume groups per system. A VG that is created with standard physical and logical volume limits can be converted to big format which can hold up to 128 PVs and up to 512 more LVs. This operation requires that there be enough free partitions on every PV in the VG for the Volume group descriptor area (VGDA) expansion. MAXPVS: 32 (128 big VG) MAXLVS: 255 (512 big VG) Logical Storage Management Volume groups Physical volume Physical partition

Logical volumes Logical partitions

255 per system (MAXPVS / volume group factor) per volume group (1016 x volume group factor) volume group factor = 1, 2, 4, 8, 16, 32, 64, 28, or 256 MB MAXLVS per volume group (MAXPVS * 1016) per logical volume Continued on next page

-4 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview Physical partitions PP

Guide

-- continued

In the design of LVM, each logical partition maps to one physical partition. And, each physical partition maps to a number of disk sectors. The design of LVM limits the number of Physical Partitions that LVM can track per disk to 1016. In most cases, not all the possible 1016 tracking partitions are used by a disk. The default size of each physical partition during a "mkvg" command is 4 MB, which implies that individual disks up to 4 GB can be included in a volume group. If a disk larger than 4 Gb is added to a volume group (based on usage of the 4 MB size for Physical Partition) the disk addition will fail with a warning message that the physical partition size needs to be increased. There are two instances where this limitation will be enforced. The first case is when the user tries to use "mkvg" to create a volume group where the number of physical partitions on one of the disks in the volume group would exceed 1016. In this case, the user must pick from the available physical partition size ranges of 1, 2, (4), 8, 16, 32, 64, 128, and 256 megabytes and use the "-s" option to "mkvg". The second case is where the disk which violates the 1016 limitation is attempting to join a preexisting volume group with the "extendvg" command. The user can either recreate the volume group with a larger physical partition size (which will allow the new disk to work with the 1016 limitation) or the user can create a stand-alone volume group (consisting of a larger physical partition size) for the new disks. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview Device drivers, hierachy and interface to LVM devices

-- continued

The figure shows the interfaces to the LVM at different layers, starting top down, the file system JFS or J2, use the LVMDD API interface to access LV’s, the LVMDD use the disk DD to access the physical disk which is handles by the SCSI DD or the SSA DD depending on the type of disk. we do also have interface and commands to manipulate the LVM system, the high level commands are complex commands written as shell scripts as the mklv command. These scripts use basic LVM commands, such as lcreatelv, which are AIX binaries to perform the operations. The basic commands are written in C and use the LVM API liblvm.a access the LVM.

JFS

High level commands

LVM DD

commands

Disk DD

liblvm.a

SCSI DD

SSA DD

Continued on next page

-6 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview VGDA description

Guide

-- continued

The VGDA is an area at the front of each disk which contains information about the volume group, the logical volumes that reside on the volume group and disks that make up the volume group. For each disk in a volume group, there exists a VGDA concerning that volume group. This VGDA area is also used in quorum voting. The VGDA contains information about what other disks make up the volume group. This information is what allows the user to just specify one of the disks in the volume group when they are using the "importvg" command to import a volume group into an AIX system. The importvg will go to that disk, read the VGDA and find out what other disks (by PVID) make up the volume group and automatically import those disks into the system. The information about neighboring disks can sometimes be useful in data recovery. For the logical volumes that exist on that disk, the VGDA gives information about that logical volume so anytime some change is done to the status of the logical volume (creation, extension, or deletion), then the VGDA on that disk and the others in the volume group must be updated. The VGDA space, that allows for 32 disks, is a fixed size which is part of the LVM design. Large disks require more management mapping space in the VGDA, which causes the number and size of available disks to be added to the existing volume group to shrink. When a disk is added to a volume group, not only does the new disk get a copy of the updated VGDA, but as mentioned before, all existing drives in the volume group must be able to accept the new, updated VGDA.

VGSA description

The Volume Group Status Area (VGSA) records information on stale partitions for mirroring. The VGSA is comprised of 127 bytes, where each bit in the bytes represents up to 1016 physical partitions that reside on each disk. The bits of the VGSA are used as a quick bit-mask to determine which physical partitions, if any, have become stale. This is only important in the case of mirroring where there exists more than one copy of the physical partition. Stale partitions are flagged by the VGSA. Unlike the VGDA, the VGSA’s are specific only to the drives which they exist. They do not contain information about the status of partitions on other drives in the same volume group. The VGSA is also used to determine which physical partitions must undergo data resyncing when mirror copy resolution is performed. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview BIG VGDA Volume Group Design (BigVG) implemented in AIX 4.3.2

-- continued

The original design of the VGDA and VGSA limit the number of disks that can be added to a volume group to 32, and the total number of logical volumes to 256 (including one reserved for LVM internal use). With the proliferation of disk arrays, the need for increased capacity in a single volume group is growing. This section describes the requirements for a new big Volume Group Descriptor Area and Volume Group Status Areas, here after referred as VGDA and VGSA. Objectives • Increase maximum number of disk per VG from 32 to 128 • Increase maximum number of logical volumes per VG to 512 • Provide migration path for small VG to big VG Changes in commands: • mkvg • -B option is added to create big VGs. • -t If the t flag (factor value) is not used, the default total of 1016physical partitions per physical volume limit will be set. Using the factor value will change the physical partitions per disk to 1016* factor and the total number of disks per VG to 64/factor. BigVG can not be imported/activate into systems with pre AIX 4.3.2 versions. • chvg • -B option added to convert the small VG to bigVG. -B flag can be used to convert the small VG to the bigVG format. This operation will expand the VGDA/VGSA to change the total number of disks that can be added to the volume group from 1-32 to 64. Once converted, these volume groups cannot be imported/activated into systems running pre AIX 4.3.2 versions. If both t and B flags are specified, factor will be update first and then VG is converted to bigVG format (sequential operation). Continued on next page

-8 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview LVM Flexibility

Guide

-- continued

LVM offer great flexibility for the system administrator and users such as • Real-time Volume Group and Logical Volume expansion/deletion • Ability to customize data integrity check • Use of Logical Volume under file system • Use of Logical Volume as raw data storage • User customized logical volumes

Real-time Volume Group and Logical Volume expansion / deletion

Typical UNIX operating systems have static file systems that require the archiving, deletion, and recreation of larger file systems in order for an existing file system to expand. LVM allows the user to add disks to the system without bringing the system down and allows the real-time expansion of the file system through the use of the logical volume. All file systems exist on top of logical volumes. However, logical volumes can exist without the presence of a file system. When a file system is created, the system first creates a logical volume, then places the journaled file system (jfs) "layer" on top of that logical volume. When a file system is expanded, the logical volume associated with that file system is first "grown", then the jfs is "stretched" to match the grown logical volume.

Ability to customize data integrity checks

The user has the ability to control which levels of data integrity checks are placed in the LVM code in order to tune the system performance. The user can change the mirror write consistency check, create mirroring, and change the requirement for quorum in a volume group.

Use of Logical Volume under a file system

The logical volume is a logical to physical entity which allows the mapping of data. The jfs maps files defined in its file system in its own logical way and then translates file actions to a logical request. This logical request is sent to the LVM device driver which converts this logical request into a physical request. When the LVM device driver sends this physical request to the disk device driver, it is further translated into another physical mapping. At this level, LVM does not care about where the data is truly located on the disk platter. But with this logical to physical abstraction, LVM provides for the easy expansion of a file system, ease in mirroring data for a file system, and the performance improvement of file access in certain LVM configurations. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview

-- continued

Use of Logical Volumes as raw data storage

As stated before, the logical volume can run without the existence of the jfs file system to hold data. Typically, database programs use the "raw" logical volume as a data "device" or "disk". They use the LVM logical volumes (rather than the raw disk itself) because LVM allows them to control which disks the data resides, allows the flexibility to add disks and "grow" the logical volume, and gives data integrity with the mirroring of the data via the logical volume mirroring capability.

User customized logical volumes

The user can create logical volumes, using a map file, that will allow them to specify the exact disk(s) the logical volume will inhabit and the exact order on the disk(s) that the logical volume will be created in. This ability allows the user to tune the creation of their logical volumes for performance cases.

Write Verify LVM setting

There is a capability in LVM to specify that you wish an extra level of data integrity is assured every time you write data to the disk. This is the ability known as write verify. This capability is given to each logical volume in a volume group. When you have write verify enabled, every write to a physical portion of a disk that’s part of a logical volume causes the disk device driver to issue the Write and Verify scsi command to the disk. This means that after each write, the disk will reread the data and do an IOCC parity check on the data to see if what the platter wrote exactly matched what the write request buffer contained. This type of extra check understandably adds more time to the completion length of a write request, but it adds to the integrity of the system. Continued on next page

-10 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Logical Volume Manager overview Quorum checking for LVM volume groups

Guide

-- continued

Quorum checking is the voting that goes on between disks in a volume group to see if a majority of disks exist to form a quorum that will allow the disks in a volume group to become and stay activated. LVM runs many of its commands and strategies based on having the most current copy of some data. Thus, it needs a method to compare data on two or more disks and figure out which one contains the most current information. This need gives rise to the need of a quorum. If not enough quorums can be found during a varyonvg command, the volume group will not varyon. Additionally, if a disk dies during normal operation and the loss of the disk causes volume group quorum to be lost, then the volume group will notify the user that it is ceasing to allow any more disk i/o to the remaining disks and enforces this by performing a self varyoffvg. However, the user can turn off this quorum check and its actions by telling LVM that it always wants to varyon or stay up regardless of the dependability of the system. Or, the user can force the varyon of a volume group that doesn’t have quorum. At this point, the user is responsible for any strange behavior from that volume group.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Data Integrity and LVM Mirroring Mirroring, and parameters for mirroring

When discussing mirrors in LVM, it is easier to refer to each copy, regardless of when it was created, as a copy. the exception to this is when one discusses Sequential mirroring. In Sequential mirroring, there is a distinct PRIMARY copy and SECONDARY copies. However, the majority of mirrors created on AIX systems are of the Parallel type. In Parallel mode, there is no PRIMARY or SECONDARY mirror. All copies in a mirrored set are just referred to as copy, regardless of which one was created first. Since the user can remove any copy from any disk, at any time, there can be no ordering of copies. AIX allows up to three copies of a logical volume and the copies may be in sequential or parallel arrangements. Mirrors improve the data integrity of a system by providing more than one source of identical data. With multiple copies of a logical volume, if one copy cannot provide the data, one or two secondary copies may be accessed to provided the desired data.

Staleness of Mirrors

The idea of a mirror is to provide an alternate, physical copy of information. If one of the copies has become unavailable, usually due to disk failure, then we refer to that copy of the mirror as going "stale". Staleness is determined by the LVM device driver when a request to the disk device driver returns with a certain type of error. When this occurs, the LVM device driver notifies the VGSA of a disk that a particular physical partition on that disk is stale. This information will prevent further read or writes from being issued to physical partitions defined as stale by the VGSA of that disk. Additionally, when the disk once again becomes available (suppose it had been turned off accidentally), the synchronization code knows which exact physical partitions must be updated, instead of defaulting to the update of the entire disk. Certain High Level commands will display the physical partitions and their stale condition so that the user can realize which disks may be experiencing a physical failure. Continued on next page

-12 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Data Integrity and LVM Mirroring Sequential Mirroring

-- continued

Sequential vs. Parallel mirror, and What good is Sequential Mirroring? Sequential mirroring is based on the concept of an order within mirrors. All read and write requests first go through a PRIMARY copy which services the request. If the request is a write, then the write request is propagated sequentially to the SECONDARY drives. Once the secondary drives have serviced the same write request, then the LVM device driver will consider the write request complete.

Parallel Mirroring

In Parallel mirroring, all copies are of equal ordering. Thus, when a read request arrives to the LVM, there is no first or favorite copy that is accessed for the read. A search is done on the request queues for the drives which contain the mirror physical partition that is required. The drive that has the fewest requests is picked as the disk drive which will service the read request. On write requests, the LVM driver will broadcast to all drives which have a copy of the physical partition that needs updating. Only when all write requests return will the write be considered complete and the write-complete message will be returned to the calling program.

Disk 1 Write req

Disk 2 Write ack

Write req

Disk 3 Write ack

Write ack Write req

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Data Integrity and LVM Mirroring Mirror Write Consistency Check

-- continued

Mirror Write Consistency Check (MWCC) is a method of tracking the last 62 writes to a mirrored logical volume. If the AIX system crashes, upon reboot the last 62 writes to mirrors are examined and one of the mirrors is used as a "source" to synchronize the mirrors (based on the last 62 disk locations that were written). This "source" is of importance to parallel mirrored systems. In sequentially mirrored systems, the "source" is always picked to be the Primary disk. If that disk fails to respond, the next disk in the sequential ordering will be picked as the "source" copy. There is a chance that the mirror picked as "source" to correct the other mirrors was not the one that received the latest write before the system crashed. Thus, the write that may have completed on one copy and incomplete on another mirror would be lost. AIX does not guarantee that the absolute, latest write request completed before a crash will be there after the system reboots. But, AIX will guarantee that the parallel mirrors will be consistent with each other. If the mirrors are consistent with each other, then the user will be able to realize which writes were considered successful before the system crashed and which writes will be retried. The point here is not data accuracy, but data consistency. The use of the Primary mirror copy as the source disk is the basic reason that sequential mirroring is offered. Not only is data consistency guaranteed with MWCC, but the use of the Primary mirror as the source disk increases the chance that all the copies have the latest write that occurred before the mirrored system crashed.

Ability to detect stale mirror copies and correct

The Volume Group Status Area (VGSA) tracks the status of 1016 physical partitions per disk per volume group. During a read or write, if the LVM device driver detects that there was a failure in fulfilling a request, the VGSA will note the physical partition(s) that failed and mark that partition(s) "stale". When a partition is marked stale, this is logged by AIX error logging and the LVM device driver will know not to send further partition data requests to that stale partition. This saves wasted time in sending i/o requests to a partition that most likely will not respond. And when this physical problem is corrected, the VGSA will tell the mirror synchronization code which partitions need to be updated to have the mirrors contain the same data.

-14 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

LVM Striping Striping and parameters for striping

Disk striping is the concept of spreading sequential data across more than one disk to improve disk i/o. The theory is that if you have data that is close to each other, and if you can divide the request into more than one disk i/o, you will reduce the time it takes to get the entire piece of data. This request must be done so it is transparent to the user. The user doesn’t know which pieces of the data reside on which disk and does not see the data until all the disk i/o has completed (in the case of a read) and the data has been reassembled for the user. Since LVM has the concept of a logical to physical mapping already built into its design, the concept of disk striping is an easy evolution. Striping is broken down into the "width" of a stripe and the "stripe length". The width is how many disks the sequential data should lay across. The stripe length is how many sequential bytes reside on one disk before the data jumps to another disk to continue the sequential information path.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Striping Striping Example

-- continued

We present an example to show the benefit of striping: A piece of data that is stored of the disk is 100 bytes. The physical cache of the system is only 25 bytes. Thus, it takes 4 read requests to the same disk to complete the reading of 100 bytes: As you can see, since the data is on the same disk, four sequential reads must be required.

hdisk0: First read-bytes 0-24

hdisk0: Second read-bytes 25-49

hdisk0: Third read-bytes 50-74

hdisk0: Fourth read-bytes 75-99

If this logical volume were created with a stripe width of 4 (how many disks) and a stripe size of 25 (how many consecutive bytes before going to the next disk), then you would see:

hdisk0: First read-bytes 0-24

hdisk1: Second read-bytes 25-49

hdisk2: Third read-bytes 50-74

hdisk3: Fourth read-bytes 75-99

As you can see, each disk only requires one read request and the time to gather all 100 bytes has been reduced 4-fold. However, there is still a bottleneck of having the four independent data disks channel through one adapter card. But, this can be remedied with the expensive option of having each disk on an independent adapter card. Note the effect of using striping: the user has now lost the usage of 3 disks that could have been used for other volume groups.

-16 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

LVM Performance Performance with disk mirroring

Disk mirroring can improve the read performance of a system, but at a cost to the write performance. Of the two mirroring strategies, parallel and sequential, parallel is the better of the two in terms of disk i/o. In parallel mirroring, when a read request is received, the lvm device driver looks at the queued requests (read and write) and finds the disk with the least number of requests waiting to execute. This is a change from AIX 3.2, where a complex algorithm tried to approximate the disk that would be "closest" to the required data (regardless of how many jobs it had queued up). In AIX 4.1, it was decided that this complex algorithm did not significantly improve the i/o behavior of mirroring and so the complex logic was scrapped. The user can see how this new strategy of finding the shortest wait line would improve the read time. And with mirroring, two independent requests to two different locations can be issued at the same time without causing disk contention, because the requests will be issued to two independent disks. However, with the improvement to the read request as a result of disk mirroring and the multiple identical sources of reads, the LVM disk driver must now perform more writes in order to complete the write request. With mirroring, all disks that make up a mirror are issued write commands which each disk must complete before the LVM device driver considers a write request as complete Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Performance Changeable parameters that affect LVM performance

-- continued

There are a few parameters that the user can change per logical volume which will affect the performance of the logical volume in terms of data access efficiency. From experience however, many people have different views of how to achieve that efficiently, so there can’t be a specific "right" recommendation given in these notes. Inter-policy - This comes in two variations, min and max. The two choices tells LVM how the user wishes the logical volume to be spread over the disks in the volume group. With min, this tells LVM that the logical volume should be spread over as few disks as possible. The max policy directs LVM to spread the logical volume over as many disks that are defined by the volume group and limited by the "Upper Bound" variable. Some users try to use this variation to form a cheap version of disk striping on systems below AIX 4.1. However, it must be stated that the Inter-policy is a "recommendation" to the allocp binary (Partition allocation routine), not a strict requirement. In certain cases, depending on what is free on a disk, these allocation policies may not be achievable. Intra-policy - There are five regions on a disk platter defined by the intrapolicy: edge, inner-edge, middle, inner-middle, and center. This policy will tell the LVM what the preferred location of the logical volume on the disk platter. Depending on the value also provided for inter-policy, this preference may or may not be satisfied by LVM. Many users have different ideas as to which portion of the disk is considered the "best", so no recommendation is given in these notes. Mirror write consistency check - As mentioned before, the mirror write consistency check tracks the last 62 distinct writes to physical partitions. If the user turns this off, they will shorten (although slightly), the path length involved in a disk write. However, the trade-off may be inconsistent mirrors if the system crashes during a write call. Write verify - This by default is turned off by LVM when a logical volume is created. If this value is turned on for a logical volume, additional time during writes will be accumulated as the IOCC check is performed for each write to the disk platter. Continued on next page

-18 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Performance Physical Connections

Guide

-- continued

Mirroring on different disks - The default of disk mirroring is that the copies should exist on different disks. This is for performance as well as data integrity. With copies residing on different disks, if one disk is extremely busy, then a read request can be completed the other copy residing on a less busy disk. Although it might seem the cost would be the same for writes, the section "Command tag queuing" should show that writing to two copies on the same disk is worse than writing to two copies on separate disks. Mirroring across different adapters - Another method to improve disk throughput is to mirror the copies across adapters. This will give you a better chance of not only finding a copy on a disk that is least busy, but it will also improve your chances of finding an adapter that is not as busy. LVM does not realize, nor care, that the two disks do not reside on the same adapter. If the copies were on the same adapter, the bottleneck there is still the bottleneck of getting your data through the flow of other data coming from other devices sharing the same adapter card. With multiadapters, the throughput through the adapter channel should improve. Command tag queuing - This is a feature only found on scsi-2 devices. In scsi-1, an adapter may get many requests, but will only send out one command at a time. Thus, if the scsi device driver received three requests for i/o, it will buffer the last two requests until the first one sent is received. It then will pick the next one in line and issue that command. Thus, the target device will only receive one command at a time. With command-tag queuing on scsi-2 devices, multiple commands may be sent out to the same device at once. The two device drivers (disk and scsi adapter) will be capable of determining which command returned and what to do with that command. Thus, disk i/o throughput can be improved. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Performance

-- continued

Physical Placement of Logical Partitions

The one important ability of LVM is the ability to let the user dictate how on the disk platter the logical volume should be assigned. This is done with the map file that can be used in the "mklv" and "mklvcopy" commands. This map file will allow the user to assign a distinct physical partition number to a distinct logical partition number. Thus, people with different theories on the optimal layout for data partitions can customize their systems according to their personal preferences.

Performance consideration with Disk Striping

Disk striping is introduced in AIX 4.1. This is another word to describe the RAID 0 implementation in software. This functionality is based on the assumption that large amounts of data can be more efficiently retrieved if the request were broken up into smaller requests given to multiple disks. And if the multiple disks are on multiple adapters, then the theory works even better, as mentioned in the previous sections of mirroring across different disks and adapters. In the previous sections, we describe the efficiency gained for mirrors. In this case, the same efficiency is gained with data across disks and adapters, but without mirroring. Thus there is a savings on the write case, as compared to mirrors. But, there is a slight loss in the read case, as compared to mirrors, because now there isn’t more than one copy to read from if one disk is busier than the other.

Performance summarize

To sum up previously mentioned ideas about mirroring. If you have a system that is mainly to be used in read cases, mirroring gives you an advantage because there is more than one version of the same data to be used to satisfy a read request. The only downfall is that if you require just as many writes as reads, then the system must wait for all the writes to complete before the single write command is considered complete. Additionally, there are two types of mirroring, parallel and sequential. Parallel is the more efficient of the two, and is the default mirroring option unless otherwise specified by the user. In parallel, the "best" disk is chosen for the read request, all write requests are issued independently to each disk that holds a copy of the data. In sequential mirroring, the same disk is always used as the first disk to be read. Thus, all reads are guaranteed to be issued to the "primary" disk (there is no "primary" in parallel mirroring) and the writes must complete in a sequential order before the write is considered complete.

-20 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Physical disk layout Power AIX 4.3.3 and AIX 5 IDs

This section will explore the physical disk layout on Power platform. There are three identifiers commonly used within LVM: Physical Volume Identifier (PVID), Volume Group Identifier (VGID), and Logical Volume Indentifier (LVID). The last two, VGID and LVID, are closely tied. The LVID is simply a dot "." and a minor number appended to the end of the VGID. The VGID is a combination of the machines unique processor serial number (uname -a) and the date that the volume group was created. The implementation of LVM, has always been to assume that the VGID of a system was made up of 2 32 bit words. Throughout the code however, the VGID/LVID is represented with the system data type struct unique_id which is made up of 4 32 bit words. However the LVM library, driver, and commands have always assumed or enforced the notion that the last 2 words, word3 and word 4 of this structure are zeroes. AIX 5 is now changed such that all 4 32 bit words are used for a total of 128 bit or 32 HEX digits. The MSb 32 bits are copied from the processor ID and the remaining 96 bits are the milisecond time stamp at creation time.

AIX 4.3.3 PVID Byte7

Byte8

Byte 6

Byte4

Byte3

Byte2

Byte1

Byte4

Byte3

Byte2

Byte1

Byte5

Byte4

Byte3

Byte2

Byte5

VGID Byte7

Byte8 0

0

0

Byte 6 9

0

Byte5 2

7

7

LVID Byte8

Byte9 0

0

0

Byte 7 9

0

Byte 6 2

7

Byte1 .

7

X

AIX 5 PVID Byte16

Byte15

Byte 14

Byte 13

Byte 12

Byte11

Byte 10

Byte 9

Byte8

Byte7

Byte 6

Byte5

Byte4

Byte3

Byte2

Byte1

Byte 13

Byte 12

Byte11

Byte 10

Byte 9

Byte8

Byte7

Byte 6

Byte5

Byte4

Byte3

Byte2

Byte1

Byte 13

Byte 12

Byte11

Byte 10

Byte8

Byte7

Byte4

Byte3

Byte2

VGID Byte16

Byte15

Byte 14

LVID Byte17

Byte16

Byte15

Byte 14

Byte 9

Byte 6

Byte5

Byte1

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout Power Example IDs from AIX 4 and AIX 5L systems that shows how IDs are constructed from processor ID

-- continued

The processor ID is 64 bit in AIX 5 the uname function cut out bit 33 to 47 such that the result is the first word and the last 16 bit of the last word. LVID and VGID combine 64 bit processor ID and 64 bit time stamp to form an ID. PVIDs are made of 32 bit processor ID and bits from the timestamp. Example from AIX 5 Power system PVID hdisk0: 00071483229d06620000000000000000 PVID hdisk1: 00071483b50bbaee0000000000000000 LVID hd1:

0007148300004c00000000e19f7c5aa3.8

LVID hd2:

0007148300004c00000000e19f7c5aa3.5

LVID hd3:

0007148300004c00000000e19f7c5aa3.7

LVID hd4:

0007148300004c00000000e19f7c5aa3.4

VGID rootvg: 0007148300004c00000000e19f7c5aa3 VGID testvg: 0007148300004c00000000e1b50bc8ec uname -a:

000714834C00

In a AIX 4 system all the IDs are made of the MSB 32 bit of the processor ID and 32 bit time stamp to form an ID. Example from AIX 4.3.3 Power system PVID hdisk0: 0009027724fdbd9f PVID hdisk1: 0009027779fe61c6 LVID hd1:

0009027724fdc36d.8

LVID hd2:

0009027724fdc36d.5

LVID hd3:

0009027724fdc36d.7

LVID hd4:

0009027724fdc36d.4

VGID rootvg: 0009027724fdc36d VGID datavg: 000902771db64c28 uname -a:

000902774C00 Continued on next page

-22 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Physical disk layout Power Physical volume, with a logical volume testlv defined

-- continued

The following example show a disk dump from sector 0 at a power system uninitialized is data not written by the LVM, sections holding 00’s or initialized are cut out for clarity. The ID’s are those listed in the previous section. 000000 ¦ C9 C2 D4 C1 00 00 00 00 00 00 00 00 00 00 00 00 000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

000070 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000080 ¦ 00 07 14 83 B5 0B BA EE 00 00 00 00 00 00 00 00 - PVID hdisk1 000090 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0001F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000200 ¦ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

00400 ¦ 39 C7 F2 9F 14 87 93 46 00 00 00 00 00 00 00 00 000410 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0005E0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

)è&)) è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

(è)&'&(VWUXFWOYPBUHFGHILQHGLQOYPUHFK (è%%&(&9*,'WHVWYJ (è&&' (è%$ (è-- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

è'()(&7 è )è è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

%)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

&è)&'&(èB/90VWUXFWOYPBUHF &è%%&(&è9*,'WHVWYJ &è&&'è_ &è%$è$ &è-- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

(èè'()(&7 (èè

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

))è è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

è&)%$

7KH9*6$

è

)(è ))è&)%$ è&)'&'&

7KH9*'$

è(%%&(& è è è$ è è

Continued on next page

-24 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout Power Disk data continued

Guide

-- continued

)è è%%%$(( è& è è è è è è è è $è %è &è 'è (è )è è è è è è$ è )è è& è )è è&)'&' è )è è&)%$ è (è )è&)%$ è&)'&'& è(%%&(& è )è è$

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm è è 21A5F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦...... 21A600 ¦ 74 65 73 74 6C 76 00 00 00 00 00 00 00 00 00 00 ¦testlv 21A610 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......

()è (è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

Continued on next page

-26 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Physical disk layout Power Disk data continued

-- continued

)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

è&$è$,;/9&%MIV èè èè èèF è&èHWHVWOY èè èè èè èè7XH6HS è$$$è $è$è7XH6HS %è$$è &è'è&\PH\ 'è$()(è1RQH (èè )èè èè èè èè èè èè èè èè èè èè èè $èè %èè &èè 'èèEEF (è('($'%(()èHF )è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Uninitialized

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout Power lvm_rec structure from file /usr/ include/ lvmrec.h

-- continued

The structure lvm_rec is used by the lvm routines to define the disk layout struct lvm_rec /* structure which describes the physical volume LVM record */ { __long32_t lvm_id; /* LVM id field which identifies whether the PV is a member of a volume group */ #define LVM_LVMID

0x5F4C564D

/* LVM id field of ASCII "_LVM" */

struct unique_id vg_id; /* the id of the volume group to which this physical volume belongs */ __long32_t lvmarea_len; /* the length of the LVM reserved area */ __long32_t vgda_len; /* length of the volume group descriptor area */ daddr32_t vgda_psn [2]; /* the physical sector numbers of the beginning of the volume group descriptor area copies on this disk */ daddr32_t reloc_psn; /* the physical sector number of the beginning of a pool of blocks (located at the end of the PV) which are reserved for the relocation of bad blocks */ __long32_t reloc_len; /* the length in number of sectors of the pool of bad block relocation blocks */ short int pv_num; /* the physical volume number within the volume group of this physical volume */ short int pp_size; /* the size in bytes for the partition, expressed as a power of 2 (i.e., the partition size is 2 to the power pp_size) */ __long32_t vgsa_len;

/* length of the volume group status area */ daddr32_t vgsa_psn [2]; /* the physical sector numbers of the beginning of the volume group status area copies on this disk */ short int version; /* the version number of this volume group descriptor and status area */ short int vg_type; int ltg_shift; char res1 [444];

/* reserved area */

};

If we use the string “_LVM” we can locate the above structure in the previous disk dump an assign values to the variables struct lvm_rec

-28 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Variable

© Copyright IBM Corp. 2000

VALUE

#define LVM_LVMID

0x5F4C564D

struct unique_id vg_id;

0007148300004C00000000E1B50BC8EC

__long32_t lvmarea_len;

00001074

__long32_t vgda_len;

00000832

daddr32_t vgda_psn [2];

00000088

daddr32_t reloc_psn;

00867C2D

__long32_t reloc_len;

00000100

short int pv_num;

0001

short int pp_size;

0018

__long32_t vgsa_len;

00000008

daddr32_t vgsa_psn [2];

00000080

int ltg_shift;

0001

char res1 [444];

Uninitialized

000008C2

000008BA

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

VGSA structure

struct vgsa_area { #ifdef _KERNEL struct timestruc32_t b_tmstamp;

/* Beginning time stamp

*/

#else struct timestruc_t b_tmstamp; #endif /* Bit per PV uint

*/

pv_missing[(MAXPVS + (NBPI - 1)) / NBPI]; /* Stale PP bits

*/

uchar

stalepp[MAXPVS][VGSA_BT_PV];

short

factor;

/* for pvs with > 1016 pps

*/

char

pad2[10];

/* Padding

*/

#ifdef _KERNEL struct timestruc32_t e_tmstamp;

/* Ending time stamp

*/

#else struct timestruc_t e_tmstamp; #endif };

struct big_vgsa_area { #ifdef _KERNEL struct timestruc32_t b_tmstamp;

/* Beginning time stamp

*/

#else struct timestruc_t b_tmstamp; #endif char

b_tmbuf64bit[24]; /* Bit per PV

uint

pv_missing[(MAX_EVER_PV + (NBPI - 1)) / NBPI]; /* Stale PP bits */

*/

uchar

stalepp[MAX_EVER_PV][VGSA_BT_PV];

short

factor;

/* for pvs with > 1016 pps

*/

short

version;

/* vgsa version

*/

char

valid[4];

/* Validity string "LVM"

*/

char

pad2[824];

/* Padding

*/

char

e_tmbuf64bit[24];

#ifdef _KERNEL struct timestruc32_t e_tmstamp;

/* Ending time stamp

*/

#else struct timestruc_t e_tmstamp; #endif };

-30 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Physical disk layout IA-64 Introduction to AIX 5L on IA64 and EFI partitioned disks

IA64 systems has a different design than Power system, some, if not all, IA-64 systems will use The Extensible Firmware Interface (EFI). EFI has defined a new disk partitioning scheme to replace the legacy DOS partitioning support. When booting from a disk device, the EFI firmware utilizes one or more system partitions containing an EFI file system (FAT32) to locate EFI applications and drivers, including the OS boot loader. These applications and drivers provide ways to extend firmware or provide the operating system with assistance during boot time or runtime. In addition, it is expected that operating systems will define partitions unique to the operating system. EFI applications, will also have the capability to display and potentially create additional partitions before the OS is booted. AIX traditionally has not supported partitioned disks because AIX was the only OS running on the RS/6000 systems. Therefore the entire disk is defined by an hdisk ODM object and /dev/hdiskn special file with a single major and minor number assigned to the physical disk. In AIX 4.3.3 when a disk becomes a physical volume (having a PVID) an old style MBR (master boot record) renamed the IPL control block which contains the PVID is written into the first sector at the disk. The overall design for disk partitioning on AIX 5L on IA-64 is to introduce disk partitioning at the disk driver level. An hdisk ODM object will still refer to the physical disk, however multiple special files will be created and associated with the partitions on the disk. Besides the EFI system partitions, AIX 5L on IA-64 disk configure method will recognize IA-64 physical volume partitions. AIX 5L on IA-64 supports a maximum of 4 partitions, of these one partition can be a physical volume partition, other partitions are EFI system partitions. Therefore only one AIX PV, and one volume group can be defined per physical disk. A new command, efdisk, act as a partition manager Special files will be created for the following partition types: • Entire physical disk n Access (used by efdisk) /dev/hdiskn_all •

System Partition index y on physical disk n /dev/hdiskn_sy



Physical volume Partition on physical disk n /dev/hdiskn



Unknown partition index x on physical disk n /dev/hdiskn_px

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout IA-64 Creating new partitions at a IA-64 system

-- continued

AIX 5L on IA-64 will partition disks under the following circumstances: • Under the direction of the user/administrator via the efdisk command. • During bos install after the designation of a "boot" disk (install targets) • When adding a disk that is not yet a physical volume to a VG • Under the direction of the "chdev -l hdiskx -a pv=yes: command

The disk system after a default installation

After installing AIX 5L on a system with one disk, the physical drive and the /dev special files can be listed. lsdev -Cc disk hdisk0 Available 00-19-10 Other IDE Disk Drive /dev/hdisk0

- hdisk0, AIX 5L PV

/dev/hdisk0_all

- The entire disk starting at block 0

/dev/hdisk0_s0

- EFI System partition 0 at disk 0

The EFI system partition holds HW information and EFI firmware data the disk is DOS formatted and can be accessed through dos utilities as in the example. 5L-IA64:/tmp> dosdir -D/dev/hdisk0_s0 A.OUT BOOT.EFI Free space: 33155072 bytes

Continued on next page

-32 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Physical disk layout IA-64 Creating partitions with efdisk

-- continued

After creating four partitions we can list the start block number and length with the efdisk command. -----------------------------------------------------Partition Index:

0

Partition Type:

Physical Volume

StartingLBA:

(0x1)

1

Number of blocks:

Partition Index:

819200 blocks

1

Partition Type: StartingLBA:

System Partition

819201

Number of blocks:

(0x64000)

2

Partition Type:

System Partition

1228801

Number of blocks:

(0x12c001)

614400 blocks

Partition Index:

(0x96000)

3

Partition Type: StartingLBA:

(0xc8001)

409600 blocks

Partition Index:

StartingLBA:

(0xc8000)

System Partition

1843201

Number of blocks:

(0x1c2001)

614400 blocks

(0x96000)

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout IA-64 Disk layout at IA-64 systems

-- continued

The following disk dump lists the data in hex format, the six leftmost digits is the byte offset from physical start of disk, each line list 16 bytes. The data is read at a IBM Power system with the same utility as previous examples, when byte swapping is mentioned it is relative to what it would have been at a disk connected to a AIX Power system. 000000 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped 000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0001C0 ¦ FF FF 09 FF FF FF 01 00 00 00 00 80 0C 00 00 FF

-start LBA = 0x1

0001D0 ¦ FF FF EF FF FF FF 01 80 0C 00 00 40 06 00 00 FF

-start LBA = 0x0c8001

0001E0 ¦ FF FF EF FF FF FF 01 C0 12 00 00 60 09 00 00 FF

-start LBA = 0x12c001

0001F0 ¦ FF FF EF FF FF FF 01 20 1C 00 00 60 09 00 55 AA

-start LBA = 0x1c2001

length = 0xc8000

length = 0x064000

length = 0x096000

length = 0x096000 000200 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped 000210 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - 0x200 = start LBA 1 = first part.

è'&)&(è09/B B/90E\WHVZDSSHG è&(%)èOYPBUHFVWUXFWRIIVHWE\[ è&))(&èIURPWKHOYPVWUXFWDWSRZHUGDWD è%$èLQSDUWLWLRQLVSODFHGDVDW39V

Continued on next page

-34 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Physical disk layout IA-64 Disk layout at IA-64 systems

Guide

-- continued

èè'()(&7GHIHFWOLVW èèRIIVHWE\[FRPSDUHGWR3RZHU èè'()(&7GHIHFWOLVW èè è$&$)9*6$7LPHVWDPS è

)è$&$)HQG9*6$WLPHVWDPS è$&$%(&9*'$VWDUWWLPHVWDPS è(&(%)9*,'IRULDYJ è

For reference information the PVID, LVID and VGID are listed below. $,;LDLD  39,'KGLVNFFDHHEHDG /9,'OYFHEIHF /9,'OYFHEIHF 9*,'LDYJFHEIHF

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-35 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Passive Mirror Write Consistency AIX 5L Passive Mirror Write Consistency

The previous Mirror Write Consistency Check (MWCC) algorithm has been in place since AIX 3.1. This original design has served the Logical Volume Manager (LVM) well, but has always slowed the performance of mirrored logical volumes that performed massive and varied writes. A new design is implemented in AIX 5 to supplement the original MWCC design.

AIX 4 MWCC algorithm

The AIX 4 MWCC method uses a table called the mwc table. This table is kept in memory as well on the disk platter. The table has 62 entries and each entry tracks the last 62 distinct Logical Track Group (LTG) writes. An LTG is 128 Kilobytes. The mwc table is only concerned with writes, not reads. The algorithm can be expressed in pseudo-code: if (action is a write) { if (LTG to be written is already in the mwc table array in memory) { proceed and issue the write to the mirrors wait until all mirrored writes complete return to calling process } else { update the mwc table with this latest LTG number overwriting the oldest LTG entry in the mwc table (in memory), write the memory mwc table to the edge of the platter of all disks in the volume group wait for the mwc table writes to complete - when the mwc table write of the disk that holds the LTG in question returns, this is considered write complete of the mwc table. issue the parallel mirror writes to all the mirrors. wait until all mirrored writes complete and return to calling process } } else process the read

Continued on next page

-36 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Passive Mirror Write Consistency

Guide

-- continued

MWCC usage for recovery

The reason for having mwcc is: Recovery from a crash while i/o is proceeding on a mirrored logical volume. By implication, this means that mwcc is ignored for non-mirrored logical volumes. A key phrase is data "in flight", which implies that a write has been issued to a disk and the write order has not come back from the disk with a confirmation that the action is complete. Thus, there is no certainty that the data did in fact get written to the disk. mwcc tracks the last 62 write orders so that upon reboot, this table is used to rewrite the last 62 mirror writes. It is more than likely that all the writes finished before the system crash, however LVM goes ahead and goes to each of the 62 distinct LTGs, reads one copy of the mirror and writes it to the other mirror(s) that exist. Note that mwcc does not guarantee that the absolute latest write is made available to the user. Mwcc just guarantees that the images on the mirrors are consistent (identical).

AIX 4 MWCC performance implications

The current mwcc algorithm has a penalty for heavily random writes. There is a performance sag associated with doing an extra write for each write you perform. A good example, taken from a customer, is a mail server that had mirrored accounts. Thousands of users were constantly writing or deleting files from their mail accounts. Thus, the LTG counter was constantly being changed and written to disk. In addition to that overhead, if the mwcc table has been dispatched to be written, new requests that come into the LVM work queue are held until the mwcc table write returns so that it can be updated and once more sent down to the disk platters to be updated.

Current AIX 4 MWCC workaround.

Currently, the only way customers can work around the performance penalty associated with mwcc is to turn the functionality off. But in order to insure data consistency, they must do a syncvg -f immediately after a system crash and reboot to synchronize data. Since there is no mwcc table on the platter, there is no way to determine which LTGs need resyncing, thus a forced resync of ALL partitions is required. Omitting this synchronization may cause inconsistent data.

AIX 5 LVM Passive Mirror Write Consistency Check

The MWCC implementation in AIX 5 provides a new passive algorithm, but only for big VGs, The reason for this is that we need space for a dirty flag for each logical volume, and only the VGSA for big VGs provides this space.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-37 of 64

Guide

AIX 5 Passive MWCC algorithm

Draft Version for review, Sunday, 15. October 2000, LVM.fm

The new MWCC algorithm set a flag when the mirrored LV is open in RW mode, and the flag is no cleared until the last close on the device. The flag is then examined during subsequent boots, the algorithm implemented is: 1. The user opens a mirrored logical volume. 2. The lvm driver marks a bit in the VGDA which states that for purposes of passive mwcc, the lv is "dirty" 3. Reads and writes occur to the mirrored lv with no (traditional) mwcc table writes 4. The machine crashes 5. Upon reboot, the volume group automatically varies on. As part of this varyonvg, checks are made to see if dirty bits exists for each lv 6. For each logical volume that is dirty, a "syncvg -f -l " is performed, regardless of whether or not the user wants to do this. Advantage: The behavior of a mirrored write will be the same as those of a mirrored logical volume with now mwcc. Since crashes are very rare, the need for mwcc resync is negligible. Thus, a mostly unnecessary write (mwc table update) will be avoided. Disadvantage: After a crash, the entire logical volume is considered dirty, although only a few blocks could have changed. Until all the partitions have been resync’ed, then the logical volume will always be considered dirty while the logical volume is open. Additionally, reads will be a bit slower as a readthen-sync operation must be performed.

Commands affected by the Passive MWCC algorithm

Varyonvg command will inform the user that a background forced sync may be occurring with the passive MWCC recovery. Syncvg command will inform user that a non-forced sync on a logical volume with a passive MWCC will result in a forced background sync. Lslv command has been altered such that the output shows if Passive MWCC is set and active. To set passive sync • mklv -w p = Use Passive MWCC algorithm • chlv -w p = Use Passive MWCC algorithm

Changes in Kernel extensions due to Passive MWCC

Three functions are changed hd_open, hd_close, and hd_ioctl: hd_open: if the logical volume being opened is part of a big VG, it is being opened for write, it is mirrored, and the mwcc policy is passive, the lv_dirty_bit

-38 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

representing the logical volume minor number is marked as dirty. Multiple settings of this may occur as multiple opens results in multiple visits to hd_open. hd_close: only when a logical volume is being closed for the last time, this function is called. When this occurs, the function checks to see if the logical volume is part of a big VG, it has more than one copy, the mwcc policy is set to passive and the passive_mwcc_recover flag of the logical volume is not set. If all these conditions are true, then the lv_dirty_bit of the logical volume is cleared and the logical volume mirrors are considered 100% consistent with each other. hd_ioctl: this will return additional status and tell the user if the logical volume is current marked as needing to undergo or is actually undergoing passive mwcc recovery (all reads result in a resync of the mirrors). The function hd_mirread is called upon the completion, successful or otherwise, of a read of a mirrored logical volume. When entering this function, if the passive_mwcc_recover flag is set, then the function will search the other viable mirrors that were not read and copy the contents of the just read mirror into those other mirrors via first set the mirrors to avoid with the pb_mirbad variable, then calling the function hd_fixup. The function hd_kdeflvs, which is called at varyonvg time, looks to see if the volume group is mirrored, has the mwcc policy set to passive, and is a big volume group. If it is, then it checks the lv_dirty_bit of that logical volume in the VGSA. If the bit is set, then the driver notifies itself that it is going to be in passive mwcc recovery state by setting the passive_mwcc_recover flag to true. Changes to allow hd_kextend to work properly with the new LV_ACTIVE_MWC definition. Changes in hdpin.exp Export the call hd_sa_update so that hd_top can update the VGSA well with the modified lv_dirty_bit as a result of hd_open or hd_close.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-39 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

AIX 5 LVM Hot Spare Disk in a Volume group. AIX 5 Hot Spare Disk function

• Automatic migration of failed disks for mirrored LVs • Ability to create spare disk pool for a VG The hot spare function applies to mirrored LVs, non mirrored LVs on a failing disk can not be recovered and therefore no attempt is made.

AIX 5 Hot Spare disk chpv command

Chpv [-h Hotspare] ... existing flags ... PhysicalVolume -h hotspare • Sets the sparring characteristics of the physical volume such that the physical volume can be used as a hot spare and the allocation permission for physical partitions on the physical volume specified by the PhysicalVolume parameter. This flag has no meaning for non mirrored logical volumes. The Spare variable can be either: • y • Marks the disk as a hot spare disk within the VG it belongs to and prohibits the allocation of physical partitions on the physical volume. The disk must not have any partitions allocated to logical volumes to be successfully marked as a hot spare disk. • n • Removes the disk from the hot spare pool for the volume group in which it resides and allows allocation of physical partitions on the physical volume. Continued on next page

-40 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

AIX 5 LVM Hot Spare Disk in a Volume group. AIX 5 Hot Spare disk chvg command

Guide

-- continued

Chvg [-s Sync] [-h Hotspare] ... existing flags .... VolumeGroup -h hotspare • Sets the sparing characteristics for the volume group specified by the VolumeGroup parameter. Either allows (yes) the automatic migration of failed disks, or prohibits (no) the automatic migration of failed disks. This flag has no meaning for non mirrored logical volumes • y • Allows the automatic migration of failed disks. Use one for one migration of partitions from one failed disk to one spare disk. The smallest disk in the volume group spare pool that is big enough for one to one migration will be used. • Y • Allows the automatic migration of failed disks. Potentially use the entire pool of spare disks to migrate to as apposed to a one for one migration of partitions to a spare. • n • Prohibits the automatic migration of failed disks. This is the default value for a volume group. • r • Removes all disks from the hotspare pool for the volume group. -s sync Sets the synchronization characteristics for the volume group specified by the VolumeGroup parameter. Either allows (yes) the automatic synchronization of stale partitions or prohibits (no) the automatic synchronization of stale partitions. This flag has no meaning for non mirrored logical volumes. • y • Attempt to automatically synchronize stale partitions. • n • Prohibits automatic synchronization of stale partitions. This is the default for a volume group. • Lsvg -p will show the status of all physical volumes in the VG. • Lsvg will show status of current state of sparing and synchronization. • Lspv will show if a disk is a spare.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-41 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Hot spot management AIX 5 LVM Hot Spot Management

Provides tools to determine which logical partitions have high I/O traffic and allow the migration of those logical partitions to other disks. The benefit from this system is to: • Improve performance by eliminating hot spots. • The system can also be used to migrate certain logical partitions for maintenance.

LVM Hot spot data collection

lvmstat { -l | -v } Name [ -e | -d ] [ -F ] [ -C ] [ -c Count ] [ -s ] [ Interval [ Iterations ] ] The lvmstat command generates reports that can be used to change logical volume configuration to better balance the input/output load between physical disks. By default, the statistics collection is not enabled in the system. You must use the -e flag to enable this feature for the logical volume or volume group in question. Enabling the statistics collection for a volume group enables for all the logical volume in that volume group. The first report generated by lvmstat provides statistics concerning the time since the system was booted. Each subsequent report covers the time since the previous report. All statistics are reported each time lvmstat runs. The report consists of a header row followed by a line of statistics for each logical partition or logical volume depending on the flags specified. Flags • -c Count

Prints only the specified number of lines of statistics.



-C Causes the counters that keep track of the iocnt, Kb_read and Kb_wrtn be cleared for the specified logical volume/volume group.



-d Specifies that statistics collection should be disabled for the logical volume/volume group in question.



-e Specifies that statistics collection should be enabled for the logical volume/volume group in question.



-F



-l



-s Suppresses the header from the subsequent reports when Interval is used.



-v Specifies that the Name specified is the name of the volume group.

Causes the statistics to be printed colon-separated. Specifies the name specified is the name of the logical volume.

Continued on next page

-42 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

LVM Hot spot management LVM Hot Spot lists

-- continued

The lvmstat command is useful in determining whether a physical volume is becoming a hindrance to performance by identifying the busiest physical partitions for a logical volume. The lvmstat command generates two types of reports, per Logical partition statistics in a logical volume and per logical volume statistics in a volume group. The reports has the following format: # lvmstat -l hd3 Log_part

mirror#

iocnt

Kb_read

Kb_wrtn

Kbps

1

1

0

0

0

0.00

2

1

0

0

0

0.00

3

1

0

0

0

0.00

# lvmstat -v rootvg Logical Volume

iocnt

Kb_read

Kb_wrtn

Kbps

1592

5620

880

0.05

hd9var

71

32

28

0.00

hd8

71

0

284

0.00

hd4

13

8

60

0.00

hd1

11

1

21

0.00

hd2

Migrating Hot Spots

migratelp LVname/LPartnumber[ /Copynumber ] DestPV[/PPartNumber] The migratelp moves the specified logical partition LPartnumber of the logical volume LVname to the DestPV physical volume. If the destination physical partition PPartNumber is specified it will be used, otherwise a destination partition is selected using the intra region policy of the logical volume. By default the first mirror copy of the logical partition in question is migrated. A value of 1, 2 or 3 can be specified for Copynumber to migrate a particular mirror copy. The migratelp command fails to migrate partitions of striped logical volumes. Examples

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-43 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

move the first logical partitions of logical volume lv00 to hdisk1, type: migratelp lv00/1 hdisk1 move second mirror copy of the third logical partitions of logical volume hd2 to hdisk5, type: migratelp hd2/3/2 hdisk5

-44 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

LVM split mirror AIX 4.3.3. Splitting and reintegrating a mirror

For a long time it has been a desire to be able to make online backups, especially in installations with mirrored volumes it’s been a requested feature to be able to split the mirror and use one side of the mirror for online backups. It has been possible to do a manual split and later reintegration, but it has been rather complicated and therefore unsafe. In AIX 4.3.3. this feature has been made available with an easy command interface.

A mirrored LV can be divided with the chfs command, the example will split the LV mounted on /testfs, copy number 3 will be mounted ad /backup. chfs -a splitcopy=/backup -a copy=3 /testfs The LV is reintegrated in two steps # umount /backup # rmfs /backup

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-45 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Variable logical track group (LTG) AIX 5 introduce Variable LTG size to improve disk performance

Today the Logical Volume Manager (LVM) shipped with all versions of AIX has a constant max transfer size of 128K also know within LVM as the Logical Track Group (LTG). All IO within LVM must be on a Logical Track Group boundary. When AIX was first released all disks supported 128K. Today many disks are going beyond 128K and the efficiency of many disks such as RAID Arrays are impacted if the IO is not a multiple of the stripe size and the stripe size is normally larger than 128K. The enhancements in AIX 5 will allow a VG LTG size to be specified at VG creation time. The enhancements allows the VG LTG to be changed when volume group is active but no logical volumes are open. The Default LTG size is still 128K, other sizes must be requested by the user. Mkvg/chvg will fail if the specified LTG is larger than the max_transfer size of the target disk(s). Extendvg will fail if the specified LTG is larger than the max_transfer size of the target disk(s). The change of LTG size will not be allowed for disks active in concurrent mode.

Variable LTG size and commands

LTG now supports the following sizes • 128K - Default value • 256K • 512K • 1024K Variable LTG commands: • mkvg -L - create a new volumegroup with LTGsize = • chvg -L - change a volumegroup to LTGsize = • lsvg will display the LTG size

-46 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

LVM command overview High level commands

• varyonvg executable • extendvg shell script • extendlv shell script • mkvg shell script • mklv shell script • lsvg executable • lspv executable • lslv executable

Internal commands

• getlvodm executable • getvgname executable • putlvodm executable • synclvodm executable • allocp executable • mapread • map_alloc • migfix executable

Low level commands

• lcreatevg executable • lmigratelv executable • lquerypv executable • lqueryvg executable • lextendlv executable • lreducelv executable • lquerylv executable • lqueryvgs executable

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-47 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Problem Determination LVM Problem Determination

The Purpose of this section is •

What is the root cause of the error?



Can this problem be distilled to the simplest case?



What has been lost and what is left behind?



Is this situation repairable?

Because in most cases, each LVM problem case is specific to a user and their environment, this section isn't a "how-to" section. Instead, it's mostly a checklist section which will help the user gather necessary information to rationally determine the root cause of the problem and if the problem can be fixed in the field, rather than sending to Level 3 software support. And if the problem must be sent to Level 3, this will give suggested information that would speed the problem determination/solution given by Level 3. Find out What is the root cause of the error?

The first question to be asked is if this problem is really in the LVM layer. The sections that detail how an I/O request is handed down from layer to layer might help clarify all the sections that must be considered. The most important initial determination is whether the problem is in above the LVM layer, in the LVM layer, or below the LVM layer. For instance, an application program such as Oracle or HACMP/6000 that accesses the LVM directly might have a problem. If you can determine what actions these failing programs are attempting to the LVM, then try to recreate this action by hand using a method that is not based on those application programs. If your attempt by hand works, then the focus of the problem shifts "up" to the application program. Obviously if it fails, then you isolated the problem at the LVM layer (or below). Or, the problem could simply be corruption to the data needed by LVM; the programs are behaving correctly, but data needed by LVM is corrupted which is causing LVM to behave strangely. An additional bonus to the field investigator is the fact that most high-level commands are shell scripts. Thus, if they are familiar with shell programming, they may turn on the shell output and what the execution of the shell commands to observe the failure point. This information might produce additional helpful information to the problem record. Finally, if there is corruption or loss of data required by LVM (such as a disk accidently erased from a volume group), it helps to find the exact steps performed (or even not performed) by the user so that the investigator can deduce the state of the system and what useful LVM information is left behind. Continued on next page

-48 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Problem Determination

Guide

-- continued

Can this problem be distilled to the simplest case?

Many times problem reports from the field to Level 3 concerning LVM are difficult to investigate because clarification is required (to determine the root cause of the problem). Or, the problem is described with the complex user configurations. If it is possible, the most basic action of the LVM is the one that should be investigated. This is not always possible as some problem may only be exposed when running in a complex environment. However, whenever possible one should try to distill the case into how the action to a logical volume is causing misbehavior by the system. And in that clarification, a non-LVM root cause may be discovered instead.

What has been lost and what has been left behind?

This type of question is typically asked of the system when some sort of accident has resulted in data corruption or loss of LVM required information. Given the state of the system before the corruption, the steps that most likely caused the corruption, and the current state of the machine, one can deduce what is left to work with. Sometimes one will receive conflicting information. This is because part of the ODM disagrees with part of the VGDA. The ODM is the one that is easily alterable (compared to the VGDA).

Is this situation repairable?

Sometimes you have enough information to know what is missing and what should be done to repair the system. However, the design of ODM, the system configurator, and LVM prevents the repair. By fixing one problem, another is spawned. And, one is caught in a deadlock situation that cannot be fixed unless one wrote very specific kernel code to repair the internal aspects of the LVM (most likely the VGDA). This is not a trivial solution, but it is possible. It is only through experience that a judgement can be made if recovery can be attempted.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-49 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Problem Determination Problem Recovery

-- continued

• Warn the user of the consequences • Gather all possible data • Save off what can be saved • Each case is different, so must be the solution Although this might seem a trivial step, when you attempt problem recovery, most of the time you must alter or destroy an important internal structure within the LVM (such as the VGDA). Once this is done, if the recovery attempt didn't work, the user's system is usually in worse shape than before the recovery attempt. Many users will decline the recovery attempt once this warning is given. However, it is better to warn them ahead of time!

Gather all possible data

While the volume group is still partially accessible, gather all possible data about the current volume group. The VGDA will provide information about missing logical volumes, which will be important. Once the recovery procedure starts, important reference information such as that gathered from the VGDA will be lost for good. And if your information is incomplete, then you may be stuck with no where to go.

Save off what can be saved

Before starting the recovery, make a copy of files that can be restored in case something goes wrong. A good example would be something like the ODM database files that reside in /etc/objrepos. Sometimes the recovery steps involves deleting information from those databases. And once deleted, if one is unsure of their form, one can't try to recreate some of the structures or values.

Each case is different, so must each solution be

Since each LVM problem is most likely going to be unique for that system, these notes cannot provide a list of steps one would take in a repair. Once again, the recovery steps must be based on individual experiences with LVM. The LVM lab exercise on recovery provides a glimpse of the complexity and information required to repair a system. However, this lab is just an example, not a template of how all fixes should be attempted.

-50 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Trace LVM commands with the trace command Tracehook 105

Trace HOOK 105 : HKWD KERN LVM This event is recorded by the Logical Volume Manager for selected events. LVM relocingblk bp=value pblock=value relblock=value • Encountered relocated block • bp=value, Buffer pointer • pblock=value, Physical block number • relblock=value, Relocated block number. LVM oldbadblk bp=value pblock=value state=value bflags • Bad block waiting to be relocated • bp=value, Buffer pointer •

pblock=value, Physical block number



state=value, State of the physical volume



bflags, Buffer flags are defined in the sys/buf.h file.

LVM badblkdone bp=value • Block relocation complete • bp=value, Buffer pointer. LVM newbadblk bp=value badblock=value error=value bflags • New bad block found • bp=value, Buffer pointer • badblock=value, Block number of bad block • error=value, System error number (the errno global variable) • bflags, Buffer flags are defined in the sys/buf.h file. LVM swreloc bp=value status=value error=value retry=value • Software relocating bad block • bp=value, Buffer pointer • status=value, Bad block directory entry status • error=value, System error number (the errno global variable) © Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-51 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

• retry=value, Relocation entry count. LVM resyncpp bp=value bflags • Resyncing Logical Partition mirrors • bp=value, Buffer pointer • bflags, Buffer flags are defined in the sys/buf.h file.

Continued on next page

-52 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Trace LVM commands with the trace command Trace hook 105 continued

-- continued

LVM open device name flags=value Open device name, Name of the device flags=value, Open file mode. LVM close device name Close device name, Name of the device. LVM read device name ext=value Read device name, Name of the device ext=value, Extension parameters. LVM write device name ext=value Write device name, Name of the device ext=value, Extension parameters. LVM ioctl device name cmd=value arg=value ioctl device name, Name of the device cmd=value, ioctl command arg=value, ioctl arguments. Example on a trace -a --j105 ID

ELAPSED_SEC

DELTA_MSEC

APPL

SYSCALL KERNEL

INTERRUPT

001

0.000000000

0.000000 TRACE ON channel 0

Mon Sep 18 21:52:50 2000

105

20.598330739

6.109275

LVM close:

rloglv00

105

20.598415445

0.084706

LVM close:

rlv00

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-53 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Trace LVM commands with the trace command Trace hook 10B

-- continued

10B : HKWD KERN LVMSIMP This event is recorded by Logical Volume Manager for selected events. Recorded Data Event: LVM rblocked: bp=value Request blocked by conflict resolution bp=value Buffer pointer. LVM pend: bp=value resid=value error=value bflags End of physical operation bp=value, Buffer pointer resid=value, Residual byte count error=value, System error number (the errno global variable) bflags, Buffer flags are defined in the sys/buf.h file. • LVM lstart: device name bp=value lblock=value bcount=value bflags opts: Value • Start of logical operation • device name, Device name • bp=value, Buffer pointer • lblock=value, Logical block number • bcount=value, Byte count • bflags, Buffer flags are defined in the sys/buf.h file • opts: value, Possible values: •

WRITEV, HWRELOC, UNSAFEREL, RORELOC, NO_MNC, MWC_RCV_OP, RESYNC_OP, ,AVOID_C1, AVOID_C2, AVOID_C3

Example on a trace -a --j10b: ID

-54 of 64 AIX 5L Internals

ELAPSED_SEC

DELTA_MSEC

APPL

SYSCALL KERNEL

INTERRUPT

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm 001

0.000000000

10B

0.007512611

10B 0.007523970 B_WRITE 10B

© Copyright IBM Corp. 2000

8.968758818

Guide

0.000000 TRACE ON channel 0 7.512611

Mon Sep 18 21:52:50 2000

LVM pend:pbp=F100 00971615E580 resid=0000 error=0000 B_WRITE

0.011359 8961.234848

LVM lend:rhd9var lbp=F10000 971E17E1A0 resid=0000 error=0000 LVM lstart: rhd4 lbp=F100009

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-55 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

LVM Library calls List of Logical Volume Subroutines

The library of LVM subroutines is a main component of the Logical Volume Manager. LVM subroutines define and maintain the logical and physical volumes of a volume group. They are used by the system management commands to perform system management for the logical and physical volumes of a system. The programming interface for the library of LVM subroutines is available to anyone who wishes to provide alternatives to or expand the function of the system management commands for logical volumes. Note: The LVM subroutines use the sysconfig system call, which requires root user authority, to query and update kernel data structures describing a volume group. You must have root user authority to use the services of the LVM subroutine library. The following services are available: •

lvm_querylv Queries a logical volume and returns all pertinent information.



lvm_querypv Queries a physical volume and returns all pertinent information.



lvm_queryvg Queries a volume group and returns pertinent information.



lvm_queryvgs Queries the volume groups of the system and returns information for groups that are varied on-line.

-56 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

logical volume device driver LVMDD LVM logical volume device driver

The Logical Volume Device Driver (LVDD) is a pseudo-device driver that operates on logical volumes through the /dev/lvn special file. Like the physical disk device driver, this pseudo-device driver provides character and block entry points with compatible arguments. Each volume group has an entry in the kernel device switch table. Each entry contains entry points for the device driver and a pointer to the volume group data structure. The logical volumes of a volume group are distinguished by their minor device numbers. • Attention: Each logical volume has a control block located in the first 512 bytes. Data begins in the second 512-byte block. Care must be taken when reading and writing directly to the logical volume, because the control block is not protected from writes. If the control block is overwritten, commands that use it can no longer be used. Character I/O requests are performed by issuing a read or write request on a /dev/rlvn character special file for a logical volume. The read or write is processed by the file system SVC handler, which calls the LVDD ddread or ddwrite entry point. The ddread or ddwrite entry point transforms the character request into a block request. This is done by building a buffer for the request and calling the LVDD ddstrategy entry point. Block I/O requests are performed by issuing a read or write on a block special file /dev/lvn for a logical volume. These requests go through the SVC handler to the bread or bwrite block I/O kernel services. These services build buffers for the request and call the LVDD ddstrategy entry point. The LVDD ddstrategy entry point then translates the logical address to a physical address (handling bad block relocation and mirroring) and calls the appropriate physical disk device driver. On completion of the I/O, the physical disk device driver calls the iodone kernel service on the device interrupt level. This service then calls the LVDD I/O completion-handling routine. Once this is completed, the LVDD calls the iodone service again to notify the requester that the I/O is completed. The LVDD is logically split into top and bottom halves. The top half contains the ddopen, ddclose, ddread, ddwrite, ddioctl, and ddconfig entry points. The bottom half contains the ddstrategy entry point, which contains block read and write code. This is done to isolate the code that must run fully pinned and has no access to user process context. The bottom half of the device driver runs on interrupt levels and is not permitted to page fault. The top half runs in the context of a process address space and can page fault.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-57 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Disk Device Calls scsidisk, SCSI Disk Device Driver

This driver supports the small computer system interface (SCSI) and the Fibre Channel Protocol for SCSI (FCP) fixed disk, CD-ROM (compact disk read only memory), and read/write optical (optical memory) devices. Syntax #include #include #include Device-Dependent Subroutines Typical fixed disk, CD-ROM, and read/write optical drive operations are implemented using the open, close, read, write, and ioctl subroutines. open and close Subroutines: The openx subroutine is intended primarily for use by the diagnostic commands and utilities. Appropriate authority is required for execution. The ext parameter passed to the openx subroutine selects the operation to be used for the target device. The /usr/include/sys/scsi.h file defines possible values for the ext parameter. rhdisk Special File Provides raw I/O access to the physical volumes (fixeddisk) device driver. The rhdisk special file provides raw I/O access and control functions to physical-disk device drivers for physical disks. Raw I/O access is provided through the /dev/rhdisk0, /dev/rhdisk1, ..., character special files. Direct access to physical disks through block special files should be avoided. Such access can impair performance and also cause data consistency problems between data in the block I/O buffer cache and data in system pages. The /dev/hdisk block special files are reserved for system use in managing file systems, paging devices and logical volumes. The r prefix on the special file name indicates that the drive is to be accessed as a raw device rather than a block device.

-58 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

Guide

-59 of 64

Guide

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Disk low level Device Calls such as SCSI calls SCSI Adapter Device Driver

The SCSI device driver has access to the physical disk (if SCSI disk). The driver support data transfers via read and write and control commands via ioctl calls. The diskDD use the Adapter device driver to access and control the physical storage device. Supports the SCSI adapter. Syntax Description The /dev/scsin and /dev/vscsin special files provide interfaces to allow SCSI device drivers to access SCSI devices. These files manage the adapter resources so that multiple SCSI device drivers can access devices on the same SCSI adapter simultaneously. The /dev/vscsin special file provides the interface for the SCSI-2 Fast/Wide Adapter/A and SCSI-2 Differential Fast/Wide Adapter/A, while the /dev/scsin special file provides the interface for the other SCSI adapters. SCSI adapters are accessed through the special files /dev/scsi0, /dev/scsi1, .... and /dev/vscsi0, /dev/ vscsi1, .... The /dev/scsin and /dev/vscsin special files provide interfaces for access for both initiator and target mode device instances. The host adapter is an initiator for access to devices such as disks, tapes, and CD-ROMs. The adapter is a target when accessed from devices such as computer systems, or other devices that can act as SCSI initiators. For further information look in Kernel and Subsystems Technical Reference, Volume 2 and Files Reference manual.

-60 of 64 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Guide

Exercises Examine the physical disk layout of a logical volume and a physical volume.

Use a tool such as edhx, hexit, dd or other to Look at a physical volume, Idintify the PVID, the VGID, and the LVM structure. Hint: which device should you use to access these data. It may be esier to copy data from the drive to a file with the dd command. dd if=/dev/xxx of=/tmp/Myfile bs=1020k count= Use another device to look at the logical volume, and does the data match those from the physical device.

Examinine the impact of LVM Passive Mirror Write Consistency

This exercise will look at the perfromace impact enabling and disabling MWC, to do do this we need a reproduceable write load. one way to get this is to write a C program to create the load remember the file has to be realy big to exceed the cache size or, force a sync to occur before terminating. Sample C code to write a big file: void writetstfile() { char buffer[512]; char *filename = "/test/a_large_file"; register int i; int fildes; /* for (i=0;i Try to Unmount a filesystem, mount the filesystem again, create a file, and write data into the file to create some activity in the LVM trace file.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-63 of 64

Guide

-64 of 64 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, LVM.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Unit 12. Enhanced Journaled File System Objectives After completing this unit, you should be able to • List the difference between the terms aggregate and fileset. • Identify the various data structures that make up the JFS-2 filesystem. • Use the fsdb command to trace the various data structures that make up the logical and virtual file system.

References SCnn-nnnn

Title of Reference

http://www.yoururl.com WEB Page Name

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

J2 - Enhanced Journaled File System Introduction

The Enhanced Journaled File System (JFS2), is an extent based Journaled File System. It is the default filesystem on IA-64 systems and is available on the Power based systems. Currently the default on Power systems is the Journaled File System (JFS).

Numbers

The following table list some general information about JFS2

Function Block Size Architectural max. files size Max. file size tested Max. file system size Number of Inodes Directory Organization

-2 of 36 AIX 5L Internals

Value 512 - 4096 Configurable block size 4 Petabytes 1 Tetabytes 1 Tetabytes Dynamic, limited by disk space. B-tree

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Aggregate Introduction

The term aggregate is defined in this section. The layout of a JFS2 aggregate is described.

Definitions

JFS2 separates the notion of a disk space allocation pool, called an aggregate, from the notion of a mountable file system sub-tree, called a fileset. The rules that define aggregates and filesets in JFS2 are: • There is exactly one aggregate per logical volume. • There may be multiple filesets per aggregate. • In The first release of AIX 5L, only one fileset per aggregate is supported,. • The meta-data has been designed to support multiple filesets, and this feature may be introduced in a future release of AIX 5. The terms aggregate and fileset in this document correspond to their DCE/ DFS (Distributed Computing Environment Distributed File System) usage.

Aggregate block size

An aggregate has a fixed block size (number of bytes per block) that is defined at configuration time. The aggregate block size defines the smallest unit of space allocation supported on the aggregate. The block size cannot be altered, and must be no smaller than the physical block size (currently 512 bytes). Legal aggregate block sizes are: •

512 bytes

• 1024 bytes • 2048 bytes • 4096 bytes. Do not confused aggregate block size with the logical volume block size, which defines the smallest unit of I/O. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Aggregate Aggregate layout

-- continued

The following diagram and table details the layout of the aggregate. 1KB (One Aggregate Block)

Note: Aggregate Block Size is 1K in this example.

RESERVED

Aggregate Block #

0

31

32 Inodes (16KB)

Aggregate Inode Table; inode numbers shown Primary Aggregate Superblock

32

Control Page

36

40

Secondary aggregate superblock

4

6

8

10 12 14 16 18 20 22 24 26 28 30

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31

aggr inode #1: “self” owner: root perm: -rwx-----etc: blah blah size: 8192 offset: 0 addr: 36 length: 8

Secondary Aggregate Superblock

60

aggr inode #2: block map owner: root perm: -rwx-----etc: blah blah size: 16384

aggr inode #16: fileset 0 owner: root perm: -rwx-----etc: blah blah size: 12288

offset: 0 addr: 64 length: 16

xad entries (8 total)

ixd Section length[0]: 16 addr[0]: 44 length[1]: 0 addr[1]: 0 ...

Persistent Map 0xf8008000 0x00000000 ...

Working Map 0xf8008000 0x00000000 ...

Control Section iagnum: 0

Primary aggregate superblock

2

1

44

1st extent of Aggregate Inode Allocation Map

Part Reserved area

0

IAG

offset: addr: length: offset: addr: length:

0 8 240 8192 4 10284

aggr inode #16: fileset 1 owner: root perm: -rwx-----etc: blah blah size: 8192 offset: 0 addr: 8 length: 5992

Function A 32K area at the front not used by JFS2. The first block is used by the LVM. The primary aggregate superblock (defined as a struct superblock) contains aggregate-wide information such as the: • size of the aggregate • size of allocation groups • aggregate block size The superblock is at fixed locations, which allows us to always be able to find these without depending on any other information. The secondary aggregate superblock is a direct copy of the primary aggregate superblock. The secondary aggregate superblock is used if the primary aggregate superblock is corrupted. Continued on next page

-4 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Aggregate

Guide

-- continued

Aggregate layout

(continued)

Part Aggregate inode table Secondary aggregate inode table

Aggregate inode allocation map Secondary aggregate inode allocation map Block allocation map

fsck working space

In-line Log,

Function Contains inodes that describe the aggregate-wide control structures these inodes are described below. Contains replicated inodes from the Aggregate Inode Table. Since the inodes in the Aggregate Inode Table are critical for finding any file system information they will each be replicated in the Secondary Aggregate Inode Table. The actual data for the inodes will not be repeated, just the addressing structures used to find the data and the inode itself. Describes the Aggregate Inode Table. It contains allocation state information on the aggregate inodes as well as their on-disk location. Describes the Secondary Aggregate Inode Table.

Describes the control structures for allocating and freeing aggregate disk blocks within the aggregate. The Block Allocation Map maps oneto-one with the aggregate disk blocks. Provides space for fsck to be able to track the aggregate block allocations. This space is necessary - for a very large aggregate there might not be enough memory to track this information in memory when fsck is run. The space is described by the superblock. One bit is needed for every aggregate block. The fsck working space always exists at the end of the aggregate. Provides space for logging of the meta-data changes of the aggregate. The space is described by the superblock. The in-line log always exist following the fsck working space. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Aggregate Aggregate Inodes

-- continued

When the aggregate is initially created, the first inode extent is allocated, additional inode extents are allocated and de-allocated dynamically as needed. These aggregate Inodes each describe certain aspects of the aggregate itself, as follows: Inode # 0 1

2 3 4 - 15 16 -

-6 of 36 AIX 5L Internals

Description Reserved Called the “self” inode, this inode describes the aggregate disk blocks comprising the aggregate inode map. This is a circular representation, in that aggregate inode one is itself in the file that it describes. The obvious circular representation problem is handled by forcing at least the first aggregate inode extent to appear at a well-known location, namely, 4K after the Primary Aggregate Superblock. Therefore, JFS2 can easily find Aggregate Inode one, and from there it can find the rest of the Aggregate Inode table by following the B+–tree in inode one Describes the Block Allocation Map. Describes the In-line Log when mounted. This inode is allocated but no data is saved to disk. Reserved for future extensions. Starting at aggregate inode 16 there is one inode per fileset, the Fileset Allocation Map Inode. These inodes describe the control structures that represent each fileset. As additional filesets are added to the aggregate, the aggregate inode table itself may have to grow to accommodate additional fileset inodes

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Allocation Groups Introduction

Allocation Groups (AG) divide the space on an aggregate into chunks, and allow JFS2 resource allocation policies to use well known methods for achieving good JFS2 I/O performance.

Allocations policies

When locating data on the disk JFS2 will attempt to: • Group disk blocks for related data and inodes close together. • Distribute unrelated data throughout the aggregate.

Allocation Group Sizes

Allocation group sizes must be selected which yield Allocation Groups that are sufficiently large to provide for contiguous resource allocation over time. The allocation group size is stored in the aggregate superblock. The rules for setting the allocation group size is: • maximum number of allocation groups per aggregate is 128 • minimum size of an allocation group is 8192 aggregate blocks • The allocation group size must always be a power of 2 multiple of the number of blocks described by one dmap page. (i.e. 1, 2, 4, 8,... dmap pages)

Partial Allocation Group

An aggregate whose size is not a multiple of the allocation group size contains a partial allocation group - it is not fully covered by disk blocks. This partial allocation group will be treated as a complete allocation group, except we mark the non-existent disk blocks allocated in the Block Allocation Map.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Filesets Introduction

A fileset is a set of files and directories that form an independently mountable sub-tree, equivalent to a Unix file system file hierarchy. A fileset is completely contained within a single aggregate.

Layout

The following illustration and table details the layout of a fileset. Filese Inode Table Control Page

240 fileset #0: AG Free Inode List AG 0

inofree: extfree: numinos: numfree:

1 1 32 28

1

inofree: extfree: numinos: numfree:

-1 -1 0 0

2

inofree: extfree: numinos: numfree:

-1 -1 0 0

128

inofree: extfree: numinos: numfree:

-1 -1 0 0

Part Fileset Inode table Fileset Inode allocation map

Inodes

0

2

4

6

8

10 12 14 16 18 20 22 24 26 28 30

1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31

IAG 244

IAG 248

264

2nd Half of Fileset Superblock Information

Fileset Inode Allocation Map: 1st extent Control Section iagnum: 0 Working Map 0xf0000000 0xffffffff ...

10284

Fileset Inode Allocation Map: 2nd extent IAG Free List: 1st entry fileset inode #2: root directory owner: root perm: -rwx-----etc: blah blah size: 4096 idotdot:2

Control Section iagnum: 1 iagfree: -1 Working Map 0xffffffff 0xffffffff ...

Persistent Map 0xf0000000 0xffffffff ...

Persistent Map 0xffffffff 0xffffffff ...

ixd Section length[0]: 16 addr[0]: 248 length[1]: 0 addr[1]: 0 ...

ixd Section length[0]: 0 addr[0]: 0 length[1]: 0 addr[1]: 0 ...

Function Contains inodes describing the fileset-wide control structures. The Fileset Inode Table logically contains an array of inodes. A Fileset Inode Allocation Map which describes the Fileset Inode Table. The Fileset Inode Allocation Map contains allocation state information on the fileset inodes as well as their on-disk location. Objects. Every JFS2 object is represented by an inode, which contains the expected object-specific information such as time stamps, file type (regular vs. directory, etc.). They also “contain” a B+–tree to record the allocation of extents. Note specifically that all JFS2 meta data structures (except for the superblock) are represented as “files.” By reusing the inode structure for this data, the data format (on-disk layout) becomes inherently extensible. Continued on next page

-8 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Filesets

-- continued

Super Inode

Super Inodes found in the aggregate inode table (#16 and greater) describe the Fileset Inode Allocation Map and other fileset information resides in the Aggregate Inode Table. Since the Aggregate Inode Table is replicated there is also a secondary version of this inode which points to the same data.

Inodes

When the fileset is initially created, the first inode extent is allocated, additional inode extents are allocated and de-allocated dynamically as needed. The inodes in a fileset are allocated as follows:

Fileset Inode #

Description

0 1

reserved additional fileset information that would not fit in the Fileset Allocation Map Inode in the Aggregate Inode Table. The root directory inode for the fileset. The ACL file for the fileset. Fileset inodes from four onwards are used by ordinary fileset objects, user files, directories, and symbolic links.

2 3 4-

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Extents Introduction

Disk space in a JFS2 filesystem is allocated in a sequence of contiguous aggregate blocks called an extent.

Extent rules

An extent is: • made up of a series contiguous aggregate blocks. • variable in size and can range from 1 to 223 aggregate blocks. • wholly contained within a single aggregate • large extents may span multiple allocation groups. • indexed in a B+-tree.

Extent Allocation Descriptor

Extents are described in an xad structure. The two main values describing an extent, its length, and its address. In an xad both the length and address are expressed in units of the aggregate block size. Details of the xad data structure are shown below. struct xad { uint8 uint16 uint40 uint24 uint40 };

Member

xad_flag; xad_reserved; xad_offset; xad_length; xad_address;

Description

xad_flag

Flags set on this extent. See /usr/include/j2/j2_xtree.h for a list of flags.

xad_reserved

Reserved for future use.

xad_offset

Extents are generally grouped together to from a larger group of disk blocks. The xad_offset, describes the logical byte address this extent represents in the larger group.

xad_length

A 24-bit field, containing the length of the extent in aggregate blocks. An extent can range in size from 1 to 224-1 aggregate blocks.

xad_address

A 40-bit field containing the address of the first block of the extent. The address is in units of aggregate blocks and is the block offset from the beginning of the aggregate.

Continued on next page

-10 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Extents

Guide

-- continued

Allocation Policy

In general, the allocation policy for JFS2 tries to maximize contiguous allocation by allocating a minimum number of extents, with each extent as large and contiguous as possible. This allows for larger I/O transfer resulting in improved performance. However in special cases this is not always possible. For example copy-on-write clone of a segment will cause a contiguous extent to be partitioned into a sequence of smaller contiguous extents. Another case is restriction of the extent size. For example the extent size is restricted for compressed files since we must read the entire extent into memory and decompress it. We have a limited amount of memory available so we must ensure we will have enough room for the decompressed extent.

Fragmentation

An extent based file system combined with a user-specified aggregate block size allows JFS2 to not have separate support for internal fragmentation. The user can configure the aggregate with a small aggregate block size (e.g., 512 bytes) to minimize internal fragmentation for aggregates with large numbers of small size files. A defragmentation utility will be provided to reduce external fragmentation which occurs from dynamic allocation/de-allocation of variable size extents. This allocation and de-allocation can result in disconnected variable size free extents all over the aggregate. The defragmentation utility will coalesce multiple small free extents into single larger extents.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Binary Trees of Extents Introduction

Objects in JFS2 are stored in groups of extents arranged in binary trees. The concepts on binary trees are introduced in this section.

Trees

Binary trees consists of nodes arranged in a tree structure. Each node contains an header describing the node. A flag in the node header identifies the role of the node in the tree. Root node Header flags=BT_ROOT

Internal node Header

Leaf node Header flags= BT_LEAF

flags= BT_INTERNAL

Leaf node Header

Leaf node Header

flags= BT_LEAF

flags= BT_LEAF

Array of extent descriptors xad xad xad

Header flags

Array of extent descriptors

Array of extent descriptors

xad

xad

xad

xad

xad

xad

This table describe the binary tree header flags. Flag BT_ROOT BT_LEAF BT_INTERNAL

Description The root or top of the tree. The bottom of a branch of a tree. Leaf nodes point to the extents containing the objects data. An internal node points to two or more leaf nodes or other internal nodes. Continued on next page

-12 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Binary Trees of Extents Why B+-tree

-- continued

B+–trees are used in JFS2 to help performance by: • providing fast reading and writing of extents - the most common operations. • fast search for reading a particular extent of a file. • efficient append or insert of an extent in a file. • efficient for traversal of an entire B+–tree

B+-tree index

There is one generic B+–tree index structure for all index objects in JFS2 except for directories. The data being indexed depends upon the object. The B+–tree is keyed by offset of the xad structure of the data being described by the tree. The entries are sorted by the offsets of the xad structures, each of which is an entry in a node of a B+–tree.

Root node header

The file j2_xtree.h describes the header for the root of the B+–tree in struct xtpage_t. #define XTPAGEMAXSLOT

256

typedef union { struct xtheader { int64

next;

/* 8: */

int64

prev;

/* 8: */

uint8

flag;

/* 1: */

uint8

rsrvd1;

/* 1: */

int16

nextindex;

/* 2: next index = # of entries */

int16

maxentry;

/* 2: max number of entries */

int16

rsrvd2;

/* 2: */

pxd_t

self;

/* 8: self */

} header; xad_t

/* (32) */ xad[XTPAGEMAXSLOT]; /* 16 * maxentry: xad array */

} xtpage_t;

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Binary Trees of Extents Leaf node header

-- continued

The file j2_btree.h describes the header for an internal node or a leaf node in struct btpage_t. typedef struct { int64 next; int64 prev; uint8 flag; uint8 rsrvd[7]; int64 self; uint8 entry[4064]; } btpage_t;

-14 of 36 AIX 5L Internals

/* /* /* /* /* /*

8: right sibling bn */ 8: left sibling bn */ 1: */ 7: type specific */ 8: self address */ 4064: */

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

inodes Overview

Every file on a JFS2 filesystem is describe by an on-disk inode. The inode holds the root header for the extent binary tree. File attribute data and block allocation maps are also kept in the inode.

Inode Layout

The inode is a 512 byte structure, split into four 128 byte sections described here. Inode Layout Section 1 Section 2 Section 3 Section 4

Section 1

2

3

4

POSIX Attributes

é é é é

extended attributes block allocation maps Inode allocation maps headers describing the inode data In-line data or xad’s extended attributes or more in-line data or additional xad’s

Description This section describes the POSIX attributes of the JFS2 object including the inode and fileset number, object type, object size, user id, group id, created, access time, modified time, created time and more. This section contains several parts: •

descriptors for extended attributes



block allocation maps



inode allocation maps



Header pointing to the data (b+-tree root, directory, in-line data)

This section can contain one of the following: •

In-line File data - for very small files (up to 128 bytes)



The first 8 xad structures describing the extents for this file.

This section extends section 3 by providing additional storage for more attributes, xad structures or in-line data.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Inodes Structure

-- continued

The current definition of the on-disk inode structure is struct dinode{ /* I. base area (128 bytes) * define generic/POSIX attributes */ ino64_t di_number; /* 8: inode number, aka file serial number */ uint32 di_gen; /* 4: inode generation number */ uint32 di_fileset; /* 4: fileset #, inode # of inode map file */ uint32 di_inostamp; /* 4: stamp to show inode belongs to fileset */ uint32 di_rsv1; /* 4: */ pxd_t di_ixpxd; /* 8: inode extent descriptor */ int64 di_size; /* 8: size */ int64 di_nblocks; /* 8: number of blocks allocated */ uint32 di_uid; /* 4: uid_t user id of owner */ uint32 di_gid; /* 4: gid_t group id of owner */ int32 di_nlink; /* 4: number of links to the object */ uint32 di_mode; /* 4: mode_t attribute, format and permission */ j2time_t di_atime; /* 16: time last data accessed */ j2time_t di_ctime; /* 16: time last status changed */ j2time_t di_mtime; /* 16: time last data modified */ j2time_t di_otime; /* 16: time created */ /* II. extension area (128 bytes) * extended attributes for file system (96); */ ead_t di_ea; /* 16: ea descriptor */ union { uint8

_data[80];

/* block allocation map */ struct { struct bmap *__bmap; } _bmap;

/* incore bmap descriptor */

/* inode allocation map (fileset inode 1st half) */ struct { uint32 _gengen; /* di_gen generator */ struct inode *__ipimap2; /* replica */ struct inomap *__imap; /* incore imap control */ } _imap; } _data2; /* B+-tree root header (32) * B+-tree root node header, or dtroot_t for directory, * or data extent descriptor for inline data; */ union { struct { int32 _di_rsrvd[4]; /* 16: */ dxd_t _di_dxd; /* 16: data extent descriptor */ } _xd; int32 _di_btroot[8]; /* 32: xtpage_t or dtroot_t */ ino64_t _di_parent; /* 8: idotdot in dtroot_t */ } _data2r; /* III. type-dependent area (128 bytes) * B+-tree root node xad array or inline data */ union { uint8 _data[128]; /* +-tree root node/inline data area */ struct { uint8 _xad[128]; } _file; /* device special file */ struct { dev64_t _rdev; /* 8: dev_t device major and minor */ } _specfile;

}

/* symbolic link. * link is stored in inode if its length is less than * IDATASIZE. Otherwise stored like a regular file. */ struct { uint8 _fastsymlink[128]; } _symlink; } _data3; /* IV. type-dependent extension area (128 bytes) * user-defined attribute, or * inline data continuation, or * B+-tree root node continuation */ union { uint8 _data[128]; } _data4;

Continued on next page

-16 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Inodes Allocation Policy

Guide

-- continued

JFS2 allocates inodes dynamically, which provides the following advantages: • Allows placement of inode disk blocks at any disk address, which decouples the inode number from the location. This decoupling simplifies supporting aggregate and fileset reorganization to enable shrinking the aggregate. The inodes can be moved and still retain the same number, which allows us to not need to search the directory structure to update the inode numbers. • There is no need to allocate “ten times as many inodes as you will ever need”, as with filesystems that contain a fixed number of inodes, and thus filesystem space utilization is optimized. This is especially important with the larger inode size of 512 bytes in JFS2. • File allocation for large files can consume multiple allocation groups and still be contiguous. Static allocation forces a gap containing the initially allocated inodes in each allocation group, with dynamic allocation, all the blocks contained in an allocation group can be used for data. Dynamic inode allocation causes a number of problems, including: • With static allocation the geometry of the file system implicitly describes the layout of inodes on disk. With dynamic allocation separate mapping structures are required. • The inode mapping structures are critical to JFS2 integrity. Due to the overhead involved in replicating these structures we accept the risk of losing these maps. However, replicating the B+–tree structures allows us to find the maps.

Inode extents

Inodes are allocated dynamically by allocating inode extents that are simply a contiguous chunk of inodes on the disk. By definition, a JFS2 inode extent contains 32 inodes. With a 512 byte inode size, an inode extent is therefore occupies 16KB on the disk. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Inodes

-- continued

Inode initialization

When a new inode extent is allocated the extent is not initialized, but in order for fsck to be able to check if an inode is in-use, JFS2 will need some information in it. Once an inode in an extent is marked in-use its fileset number, inode number, inode stamp, and the inode allocation group block address are initialized. Thereafter, the link field will be sufficient to determine if the inode is currently in-use.

Inode Allocation Map

Dynamic inode allocation implies that there is no direct relationship between an inode number and the disk address of the inode. Therefore we must have a means of finding the inodes on disk. The Inode Allocation Map provides this function.

Inode Generation Numbers

Inode generation numbers are simply counters that will increment each time an inode is reused. Network file system protocols such as NFS (implicitly) require them; they form part of the file identifier manipulated by VNOP_FID() and VFS_VGET(). The static-inode-allocation practice of storing a per-inode generation counter doesn’t work with dynamic inode allocation, because when an inode becomes free its disk space may literally be reused for something other than an inode (e.g., the space may be reclaimed for ordinary file data storage). Therefore, in JFS2 there is simply one inode generation counter that is incremented on every inode allocation (rather than one counter per inode that would be incremented when that inode is reused). Although a fileset-wide generation counter will recycle faster than a perinode generation counter, a simple calculation shows that the 32-bit value is still sufficient to meet NFS or DFS requirements.

-18 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

File Data Storage Overview

This section introduces the data structures used to describe where a file’s data is stored.

In-line data

If a file contains small amounts of data the data may be stored in the inode its self. This is called in-line storage. The header found in the second section of the inode points to the data that is stored in the third and fourth section of the inode. inode Inode Info

In-line data

Header for in-line data

Binary trees

When more storage is needed than can be provided in-line the data must be placed in extents. The header in the inode now becomes the binary tree root header. If there are 8 or fewer extents for the file, then the xad structures describing the extents are contained in the inode. An inode containing less than 8 xad structures would look like: inode

68

Inode Info

16KB Data

xad entries (8 total)

B+-tree header offset: addr: length: offset: addr: length:

0 68 4 4096 84 12

4096 48KB Data

offset: 26624 addr: 256 length: 2

26624 8KB Data

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

File Data Storage

-- continued

INLINEEA bit

Once the 8 xad structures in the inode are filled, an attempt is made to use the last quadrant of the inode for more xad structures. If the INLINEEA bit is set in the di_mode field of the inode, then the last quadrant of the inode is available for 8 more xad structures.

More extents

Once all of the available xad structures in the inode are used, the B+–tree must be split. 4K of disk space is allocated for a leaf node of the B+–tree, which is logically an array of xad entries with a header. The 8 xad entries are moved from the inode to the leaf node, and the header is initialized to point to the 9th entry as the first free entry. The first xad structure in the inode is updated to point to the newly allocated leaf node, and the inode header is updated to indicate that only one xad structure is now being used, and that it contains the pure root of a B+-tree. The offset for this new xad structure contains the offset of the first entry in the leaf node. The organization of the inode now look like:

412

inode Inode Info

xad entries (8 total)

offset: addr: length: offset: addr: length:

0 412 4 0 0 0

254 xad leaf node entries

B+-tree header

68

header

16KB Data

4096 48KB Data

offset: 0 addr: 0 length: 0

26624 8KB Data

Continued on next page

-20 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

File Data Storage Continuing to add extents

-- continued

As new extents are added to the file, they continue to be added to the leaf node in the necessary order, until the node fills. Once the node fills a new 4K of disk space is allocated for another leaf node of the B+–tree, and the second xad structure from the inode is set to point to this newly allocated node. The node now looks like:

412

inode Inode Info

xad entries (8 total)

offset: addr: length: offset: addr: length:

0 412 4 750 560 4

16KB Data

254 xad leaf node entries

B+-tree header

68

header

4096 48KB Data

offset: 0 addr: 0 length: 0

560 header

254 xad leaf node entries

26624 8KB Data

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

File Data Storage Another split

-- continued

As extents are added to the inode, this behavior continues until all 8 xad structures in the inode contain leaf node xad structures, at which time another split of the B+–tree will occur. This split creates an internal node of the B+–tree which is used purely to route the searches of the tree. An internal node looks exactly like a leaf node. 4K of disk space is allocated for the internal node of the B+–tree., the 8 xad entries of the leaf nodes are moved from the inode to the newly created internal node, and the internal node header is initialized to point to the 9th entry as the first free entry. The root of the B+–tree is then updated by making the inode’s first xad structure point to the newly allocated internal node, and the header in the inode is updated to indicate that now only 1 xad structure is being used for the B+–tree. As extents continue to be added, additional leaf nodes are created to contain the xad structures for the extents, and these leaf nodes are added to the internal node. Once the first internal node is filled, a second internal node is allocated, the inode’s second xad structure is updated to point to the new internal node. This behavior continues until all 8 of the inode’s xad structures contain internal nodes.

380

inode Inode Info

xad entries (8 total)

0 380 4 8340 212 4

68 16KB Data

254 xad leaf node entries

offset: addr: length: offset: addr: length:

412 header

254 xad internal node entries

B+-tree header

header

4096 48KB Data

offset: 0 addr: 0 length: 0

212 header

254 xad internal node entries

254 xad leaf node entries

-22 of 36 AIX 5L Internals

560

header

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

26624 8KB Data

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

fsdb Utility Introduction

The fsdb command enables you to examine, alter, and debug a file system.

Starting fsdb

It is best to run fsdb against an unmounted filesystem. Use the following syntax to start fsdb: fddb For example: # fsdb

/dev/lv00

Aggregate Block Size: 512 >

Supported filesystems

fsdb supports both the JFS and JFS2 file systems. The commands available in fsdb are different depending on what filesystem type it is running against. The following explains how to use fsdb with a JFS2 file system.

Commands

The commands available in fsdb can be viewed with the help command as shown here. > help Xpeek Commands a[lter] b[tree] [] dir[ectory] [] d[isplay] [ [ [ []]]] dm[ap] [] dt[ree] [] h[elp] [] ia[g] [] [a | ] i[node] [] [a | ] q[uit] su[perblock] [p | s]

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 1 - fsdb Introduction

In this lab you will run the fsdb utility against a JFS2 filesystem that was created for you. The filesystem should not be mounted when running fsdb. The filesystem may be mounted to examine the files, just be sure to un-mount it before running fsdb.

Lab steps

Follow the steps in this table: Step 1

Action Start fsdb on the logical volume /dev/lv00 # fsdb

1

2

/dev/lv00

What is the aggregate block size used in this filesystem. Type help to view the fsdb sub-commands. The commands you will be using in this lab are: inode, directory and display What inode number represents the fileset root directory inode? Display the root inode for the file set. What command did you use?

3

Note: If you want to display the aggregate inodes instead of the fileset inode append an “a” to the command i.e.: inode 2 a. Find the inode number of each file in the fileset using the directory command followed by the inode number of the root directory inode of the fileset. For example: > dir 2 idotdot = 2 4 fileA 5 fileB 6 fileC 3 lost+found Continued on next page

f

-24 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Exercise 1 - fsdb Using fsdb continued

-- continued

In the next few steps you will locate and display the fileA’s data. Step 4

Action Display the inode of fileA, what command did you use? Use the inode you displayed to answer the following questions: What is the file size of fileA? How many disk blocks is fileA’s data using?

5

After the inode is displayed a sub-menu of commands is shown. Type a t to display the root binary tree header. Examine the flags in the header, what flags are set?

6

Type to walk down the xad structures in this node. How many xad structures are used for this file?

7

The address field in the xad shows the aggregate block number of the first data block of fileA. Use the display command to display this block. > d 12345 Did you find fileA’s data? Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 1 - fsdb FileB and fileC

-- continued

Use the commands and techniques you learned in the last section to examine fileB, fileC and fileD. Answer the following questions about these files: 1. What number inodes are used for fileB, fileC and fileD?

2. How many xad structures are used to describe fileB’s data blocks?

3. How many xad structures are used to describe fileC’s data blocks?

4. Examine the inode for fileD. How big is this file (as shown in di_size)? How many aggregate blocks are being used by fileD? Are enough aggregate blocks allocated to store the entire file? Explain your answer.

-26 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Directory Introduction

In addition to files an inode can represent a directory. A Directory is a journaled meta-data file in JFS2, and is composed of directory entries which indicate the files and sub-directories contained in the directory.

Directory entry

Stored in an array the directory entries links the names of the objects in the directory to an inode number. The directory entry has the following members. Member inumber namelen name[30] next

Description Inode number Length of the name. File name, up to 30 characters. If more that 30 characters are needed additional entries are link using the next pointer Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Root Header

-- continued

In order to improve performance of locating a specific directory entry a binary tree sorted by name is used. As with files, the header section of a directory inode contains the binary tree root header. Each header describes an 8 element array of directory entries. The root header is defined by a dtroot_t structure contained in /usr/include/j2/j2_dtree.h: typedef union { struct { ino64_t int64 uint8 int8 int8 int8 int32 int8 } header; dtslot_t } dtroot_t;

Member idotdot flag nextindex freecnt freelist stbl[8] slot[9]

Leaf and internal node header

idotdot; rsrvd1; flag; nextindex; freecnt; freelist; rsrvd2; stbl[8];

/* /* /* /* /* /* /* /* /*

8: parent inode number */ 8: */ 1: */ 1: next free entry in stbl */ 1: free count */ 1: freelist header */ 4: */ 8: sorted entry index table */ (32) */

slot[9];

Description Inode number of parent directory. indicating if the node is an internal or leaf node, and whether it is the root of the binary tree. last used slot in the directory entry slot array. number of free slots in the directory entry array. slot number of the head of the free list indices to the directory entry slots that are currently in use. The entries are sorted alphabetically by name. Array of directory entries. 8 entries, The header is stored in the first slot.

When more than 8 directory entries are needed a leaf or internal node is added. The directory internal and leaf node headers are similar to root node header except that up to 128 directory entries. The page header is defined by a dpage_t structure contained in /usr/include/j2/j2_dtree.h. Continued on next page

-28 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Directory

-- continued

Directory slot array

The Directory Slot Array (stbl[]) is a sorted array, of indices to the directory slots that are currently in use. The entries are sorted alphabetically by name. This limits the amount of shifting necessary when directory entries are added or deleted, since the array is much smaller than the entries themselves. A binary search can be used on this array to search for particular directory entries. In this example the directory entry table contains four files. The stbl table contains the slot numbers of the entries ordering the entries alphabetically.

Directory Entry table 1 2 3 4

def abc xyz

STBL[8] 2

1

4

3

0

0

0

0

hij

5 6 7 8

. and ..

A directory does not contain specific entries for self (“.”) and parent (“..”). Instead these will be represented in the inode itself. Self is the directory’s own inode number, and the parent inode number is held in the “idotdot” field in the header. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Growing directory size

-- continued

As the number of files in the directory grow the directory tables must be increase in size. This table describes the steps used. Step 1 2

3

4

5

-30 of 36 AIX 5L Internals

Action Initial directory entries are stored in directory inode in-line data area. When the in-line data area of the directory inode becomes full JFS2 allocates a leaf node the same size as the aggregate block size. When that initial leaf node becomes full and the leaf node is not yet 4K, double the current size. First attempt to double the extent in place, if there is not room to do this a new extent must be allocated and the data from the old extent must be copied to the new extent. The directory slot array will only have been big enough to reference enough slots for the smaller page so a new slot array will have to be created. Use the slots from the beginning of the newly allocated space for the larger array and copy the old array data to the new location. Update the header to point to this array and add the slots for the old array to the free list. If the leaf node again becomes full and is still not 4K repeat step 3. Once the leaf node reaches 4K allocate a new leaf node. Every leaf node after the initial one will be allocated as 4K to start. When all entries are free in a leaf page, the page will be removed from the B+–tree. When all the entries in the last leaf page are deleted, the directory will shrink back into the directory inode in-line data area.

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Directory Examples Introduction

This sections demonstrates how the directory structures change over time.

Small Directories

Initial directory entries are stored in directory inode in-line data area. Examine this example of a small directory. In this example all the inode information fits into the in-line data area: # ls -ai 69651 . 2 .. 69652 foobar1 69653 foobar12 69654 foobar3 69655 longnamedfilewithover22charsinitsname flag: BT_ROOT BT_LEAF nextindex: 4 freecnt: 3 freelist: 6 idotdot: 2 stbl: {1,2,3,4,0,0,0} 1

inumber: 69652 next: -1 namelen: 7 name: foobar1

2

inumber: 69653 next: -1 namelen: 8 name: foobar12

3

inumber: 69654 next: -1 namelen: 7 name: foobar2

4

inumber: 69655 next: 5 namelen: 37 name:longnamedfilewithover2

5 next: -1 cnt: 0 name: 2charsinitsname

Note: the file with a long name has its name split across two slots. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Examples Adding a file

-- continued

An additional file called “afile” is created. The details for this file are added at the next free slot (slot 6). As this is now, alphabetically, the first file in the directory, the search table array (stbl[]) is re-organized, such that the entry in slot 6 is now in the first entry. # ls -ai 69651 . 2 .. 69656 afile 69652 foobar1 69653 foobar2 69654 foobar3 69655 longnamedfilewithover22charsinitsname flag: BT_ROOT BT_LEAF nextindex: 5 freecnt: 2 freelist: 7 idotdot: 2 stbl: {6,1,2,3,4,0,0,0} 1

inumber: 69652 next: -1 namelen: 7 name: foobar1

2

inumber: 69653 next: -1 namelen: 8 name: foobar12

3

inumber: 69654 next: -1 namelen: 7 name: foobar2

4

inumber: 69655 next: 5 namelen: 37 name:longnamedfilewithover2

5 next: -1 cnt: 0 name: 2charsinitsname 6

inumber: 69656 next: -1 namelen: 5 name: afile

Continued on next page

-32 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Directory Examples Adding a leaf node

-- continued

When the directory grows to where there are more entries than can be stored in the in-line data area of the inode then JFS2 allocates a leaf node the same size as the aggregate block size. The in-line entries are moved to a leaf node as illustrated. Block 52 flag: BT_ROOT BT_INTERNAL nextindex: 1 freecnt: 7 freelist: 2 idotdot: 2 stbl: {1,2,3,4,5,6,7,8}

flag: BT_LEAF nextindex: 20 freecnt: 103 freelist: 25 maxslot: 128 stbl: {1,2,15, ... 8,13,14}

1

1

inumber: 5 next: -1 namelen: 5 name: file0

2

inumber: 6 next: -1 namelen: 5 name: file1

3

inumber: 15 next: -1 namelen: 6 name: file10

19

inumber: 23 next: -1 namelen: 6 name: file18 inumber: 24 next: -1 namelen: 6 name: file19

xd.len: 1 xd.addr1: 0 xd.addr2: 52 next: -1 namelen: 0 name: file0

20

Once the leaf is full, an internal node is added at the next free in-line data slot in the inode, which will contain the address of the next leaf node. Note: the internal node entry, contains the name of the first file (in alphabetical order) for that leaf node. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Directory Examples Adding a internal node

-- continued

Once all the in-line slots have been filled by internal nodes, a separate node block is allocated, the entries from the in-line data slots are moved to this new node, and the first in-line data slot updated with the address of the new internal node. Block 118

Block 52

flag: BT_ROOT BT_INTERNAL nextindex: 4 freecnt: 4 freelist: 5 idotdot: 2 stbl: {1,3,4,2,6,7,2,8}

flag: BT_INTERNAL nextindex: 64 freecnt: 59 freelist: 76 maxslot: 128 stbl: {1,19,18, ... 7,8}

flag: BT_LEAF nextindex: 64 freecnt: 59 freelist: 21 maxslot: 128 stbl: {1,2,15 ... 113,112}

1

1

xd.len: 1 xd.addr1: 0 xd.addr2: 52 next: -1 namelen: 0 name: file0

1

inumber: 5 next: -1 namelen: 5 name: file0

2

2

xd.len: xd.addr1: xd.addr2: next: namelen: name:

inumber: 6 next: -1 namelen: 5 name: file1

3

inumber: 15 next: -1 namelen: 6 name: file10

126

inumber: 10057 next: -1 namelen: 9 name: file10052 inumber: 10041 next: -1 namelen: 9 name: file10036

2

xd.len: 1 xd.addr1: 0 xd.addr2: 118 next: -1 namelen: 0 name: file0 xd.len: 1 xd.addr1: 0 xd.addr2: 1204 next: -1 namelen: 8 name: file4845

3

xd.len: 1 xd.addr1: 0 xd.addr2: 1991 next: -1 namelen: 9 name: file13833

126

4

xd.len: 1 xd.addr1: 0 xd.addr2: 2609 next: -1 namelen: 8 name: file17723

127

xd.len: 0 xd.addr1: -1 xd.addr2: 1473 next: -1 namelen: 8 name: file1472 xd.len: 1 xd.addr1: 0 xd.addr2: 1472 next: -1 namelen: 8 name: file1017

127

After many extra files have been added to the directory, two layers of internal nodes are required to reference all the files. Note: now, that the internal node entries in the inode contain the name of the alphabetically first entry referenced by each of the second level internal nodes, and each entry in these references the name of the alphabetically first entry in each leaf node.

-34 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Guide

Exercise 2 - Directories Introduction

In this exercise you will use the fsdb utility to examine directory inodes in a jfs2 filesystem.

Small directories

Run fsdb on the sample filesystem. Use the following steps to examine the directory node for /mnt/small. Step 1 2 3

4

Action Find the inode for directory small: > dir 2 Display the inode found in the last step. > i Using the t sub-command display the directory node root header. Is this header a root, internal or leaf header? Type to display the directory entries. Repeat until all the entries are displayed. How many files are in the directory?

5

Examine the directory slot array stbl[] (displayed in the header). What file name is associated with the first slot entry?

6 7

Exit fsdb and mount the filesystem. # mount /mnt Create the file /mnt/small/a #touch /mnt/small/a Predict what the stbl[] table for directory small will look like now?

8

Un-mount the filesystem run fsdb and check your prediction.

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-35 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, j2-1.fm

Exercise 2 - Directories Larger directories

-- continued

In this section you will examine the directory node structures for some larger directories.

Step 1

Action What is the inode for the directory called medium?

2

Display the inode and look at the root tree header. The flags should indicate that this is an internal header. One entry should be found for each leaf node. Display the entries with the key. How many leaf nodes are their?.

3

Use the down sub command to display the first leaf node header. How many entries is this header currently describing? What is the maximum number of entries (files) that be described by a single leaf node?

4

Examine the big directory and answer the following questions. How many internal leaf nodes in big? How many files in big?

-36 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

Unit 13. Logical and Virtual File Systems Objectives After completing this unit, you should be able to • Identify the various compoints that make up the logical and virtual • To use the debugger (kdb/iadb) to display these components.

References

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

General File System Interface Introduction

This lesson covers the interface and services that AIX 5L provides to physical filesystem. The Logical File System (LFS), Virtual File System (VFS) and the interface between these compoints and physical file systems are discussed in this lesson.

Supported file systems

Using the structure of the logical file system and the virtual filesystem AIX 5L can support a number of different file system types transparently to application programs. These file systems reside below the LFS/VFS and operate relatively independently of each other. Currently AIX 5L supports the following physical filesystem implementations: • Enhanced Journaled Filesystem (JFS2) • Journaled filesystem (JFS) • Network File System (NFS) • A CD-ROM File system which supports ISO-9660, High Sierra and Rock Ridge formats.

Extensible

The LFS/VFS interface also provides a relatively easy means by which third party filesystem types can be added without any changes to the LFS.

Hierarchy

Access to files and directories by a process is controlled by the various layers in the AIX 5L kernel as illustrated here. é é é é

é é

System calls Logical File System (LFS) Virtual File System (VFS) File System Implementation Support of individual file system layout. Fault Handler - Device page fault handler support in the VMM. Device Driver - Actual device driver code to interface with the device. It is invoked by the page fault handler when the file system implementation code maps the opened file to kernel memory and reads the mapped memory. LVM is the device driver for J2 and Journalled filesystems.

System Call

Logical File System

Virtural File System

File System Implementation

Fault Handler

Device Driver

Device

Continued on next page

-2 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

General File System Interface

Internal data structures

-- continued

This illustration shows the major data structures that will be discussed in this lesson. This illustration is repeated throughout the lesson highlighting the areas being discussed.

u-block

inode

vnode

gnode

User File Descriptor Table System File Table

vnodeops

vfs

gfs vmount vfsops

Logical File System

© Copyright IBM Corp. 2000

Virtural File System (Vnode-VFS Interface)

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

File System

-3 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Logical File System Overview

The Logical File System (LFS) provides a consistent programming interface to applications via the system call interface, with calls such as open(), close(), read() and write(). The LFS breaks down each system call into requests for the underlying file system implementations.

LFS Data Structures

The data structures discussed in this section are the System Open File Table and the User File Descriptor Table. The system open file table has one entry for each open file on the system. The user file descriptor table (one per process) contains entries for each of the process open file...

u-block

inode

vnode

gnode

User File Descriptor Table System File Table

vnodeops

vfs

gfs vmount vfsops

Logical File System

Operations

Virtural File System (Vnode-VFS Interface)

File System

The LFS provides a standard set of operations to support the system call interface, its routines manage the open file table entries and the perprocess file descriptors. It provides: • the User File Descriptor Table. • the System File table. An open file table entry records the authorization of a process’s access to a file system object. The LFS abstraction specifies the set of file system operations that an implementation must include in order to carry out logical file system requests. Physical file systems can differ in how they implement these predefined operations, but they must present a uniform interface to the LFS. It supports UNIX-like file system access semantics, but other nonUNIX file systems can also be supported. Continued on next page

-4 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Logical File System User interface

Guide

-- continued

A user can refer to an open file table entry through a file descriptor held in the thread’s ublock, or by accessing the virtual memory to which the file was mapped. The file descriptor table entry is created when the file is initially opened, via the open() system call and will remain until either the user closes the file via the close() system call, or the process terminates. The LFS is the level of the file system at which users can request file operations by using system calls, such as open(), close(), read(), write() etc. For all these calls (except open()), the file descriptor number is passed as an argument to the call. The system calls implement services that are exported to users, and provide a consistent user mode programming interface to the LFS that is independent of the underlying file system type. System calls that carry out file system requests: • Map the user’s parameters to a file system object. This requires that the system call component use the vnode (virtual node) component to follow the object’s path name. In addition, the system call must resolve a file descriptor or establish implicit (mapped) references using the open file component. • Verify that a requested operation is applicable to the type of the specified object. • Dispatch a request to the file system implementation to perform operations.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

User File Descriptor Table Description

The user file descriptor table, is contained in the user area, and is a per process resource. Each entry references an open file, device, or socket from the process’ perspective. The index into the table for a specific file, is the value returned by the open() system call when the file is opened - the file descriptor.

Table Management

One or more slots of the file descriptor tables are used for each open file. The file descriptor table can extend beyond first page of the ublock, and is page-able. There is a fixed upper limit of 32768 open file descriptors per process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is fixed, and may not changed.

User File Descriptor Table structure

The user file descriptor table consists of an array of user file descriptor table structures defined in /usr/include/sys/user.h in the structure ufd: struct ufd { struct file * fp; unsigned short flags; unsigned short count; #ifdef __64BIT_KERNEL unsigned int reserved; #endif /* __64BIT_KERNEL */ };

-6 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

System File Table Description

The system file table is a global resource, and is shared by all processes on the system. One unique entry is allocated for each unique open of a file, device, or socket in the system.

Table Management

The table is a large array, and is partly initialized. It grows on demand, and is never shrunk. Once entries are freed, they are added back onto the free list (ffreelist). The table can contain a maximum of 1,000,000 entries, and is not configurable.

Table entries

The file table array consists of struct file data elements. Several of the key members of this data structure are described in this table. Member Description f_count A reference count field detailing the current number of opens on the file. This value is increased each time the file is opened, and decremented on each close(). Once the reference count is zero, the slot is considered free, and may be re-used. f_flag various flags described in fcntl.h f_type a type field describing the type of file: /* f_type values */ #define DTYPE_VNODE 1 /* file */ #define DTYPE_SOCKET 2 /* communications endpoint */ #define DTYPE_GNODE 3 /* device */ #define DTYPE_OTHER -1 /* unknown */

f_offset f_data

f_ops

a read/write pointer. Defined as f_up.f_uvnode it is a pointer to another data structure representing the object typically the vnode structure. a structure containing pointers to functions for the following file operations: rw (read/write), ioctl, select, close, fstat. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

System File Table file structure

-- continued

The file table structure is described in /usr/include/sys/file.h struct file { long f_flag; /* see fcntl.h */ int f_count; /* reference count */ short f_options; /* file flags not passed through vnode layer */ short f_type; /* descriptor type */ union { struct vnode *f_uvnode; /* pointer to vnode structure */ struct file *f_unext; /* next entry in freelist */ } f_up; offset_t f_offset; /* read/write character pointer */ off_t f_dir_off; /* BSD style directory offsets */ union { struct ucred *f_cpcred; /* process credentials at open() */ struct file *f_cpqmnext; /* next quick move chunk on free list*/ } f_cp; Simple_lock f_lock; /* file structure fields lock */ Simple_lock f_offset_lock; /* file structure offset field lock */ caddr_t f_vinfo; /* any info vfs needs */ struct fileops { int (*fo_rw)(); int (*fo_ioctl)(); int (*fo_select)(); int (*fo_close)(); int (*fo_fstat)(); } *f_ops; };

-8 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

Virtual File System Overview

The Virtual FIle System (VFS) defines a standard set of operations on an entire file system. Operations preformed by a process on a file or file system are mapped through the VFS to the file system below. In this way, the process need not know the specifics of different file systems (such as JFS, J2, NFS or CDROM).

Data Structures

The data structures within a virtual file system are: • vnode - one per file • gfs - one per filesystem type kernel extension. • vnodeops - one per filesystem type kernel extension. • vfsops - one per filesystem type kernel extension. • vfs - one per mounted filesystem. • vmount - one per mounted filesystem.

u-block

inode

vnode

gnode

User File Descriptor Table System File Table

vnodeops

vfs

gfs vmount vfsops

Logical File System

Functional sections

Virtural File System (Vnode-VFS Interface)

File System

For the purpose of this lesson the VFS will be broken into three sections and described separately. These sections are: • Vnode-VFS interface • File and File System Operations • The gnode

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Vnode/vfs interface Overview

The interface between the logical file system and the underlying file system implementations is referred to as the vnode/vfs interface. This interface provides a logical boundary between generic objects understood at the LFS layer and the file system specific objects that the underlying file system implementation must manage such as inodes and super blocks. The LFS is relatively unaware of the underlying file system data structures since they can be radically different for the various file system types.

Data Structures

Vnodes and vfs structures are the primary data structures used to communicate through the interface (with help from the vmount). • vnodes - represents a files • vfs - represents a mounted file system • vmount - contains specifics of the mount request.

u-block

inode

vnode

gnode

User File Descriptor Table System File Table

vnodeops

vfs

gfs vmount vfsops

Logical File System

History

Virtural File System (Vnode-VFS Interface)

File System

The vnode and vfs structures of the LFS was created by Sun Micro Systems and has evolved into a de-facto industry standard, thanks in part to NFS.

-10 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

Vnodes Overview

The vnode provides a standard set of operations within the file system, and provides system calls with a mechanism for local name resolution. This allows the logical file system to access multiple file system implementations through a uniform name space.

Detail

Vnodes are the primary handles by which the operating system references files, and represent access to an object within a virtual file system. Each time an object (file) within a file system is located (even if it is not opened), a vnode for that object is located (if already in existence), or created, as are the vnodes for any directory that has to be searched to resolve the path to the object. As a file is created, a vnode is also created, and will be re-used for every subsequent reference made to the file by a path name. Every path name known to the logical file system can be associated with, at most, one file system object, and each file system object can have several names because it can be mounted in different locations. Symbolic links and hard links to an object always get the same vnode if accessed through the same mount point.

vnode Management

Vnodes are created by the vfs-specific code when needed, using the vn_get kernel service. Vnodes are deleted with the vn_free kernel service. Vnodes are created as the result of a path resolution.

structure

The vnode is structure is defined in /usr/include/sys/vnode.h struct vnode { ushort ulong32int64 int Simple_lock struct vfs struct vfs

v_flag; v_count; v_vfsgen; v_lock; *v_vfsp; *v_mvfsp;

struct gnode *v_gnode; struct vnode *v_next;

/* the use count of this vnode */ /* generation number for the vfs */ /* lock on the structure */ /* pointer to the vfs of this vnode */ /* pointer to vfs which was mounted over * / /* this vnode; NULL if not mounted */ /* ptr to implementation gnode */ /* ptr to other vnodes that share same gnode

*/ struct vnode *v_vfsnext; /* ptr to next vnode on list off struct vnode *v_vfsprev; /* ptr to prev vnode on list off union v_data { void * _v_socket; /* vnode associated data struct vnode * _v_pfsvnode; /* vnode in pfs for spec } _v_data; char * v_audit; /* ptr to audit object

of vfs */ of vfs */ */ */

*/ };

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

vfs and vmount Description

When new file systems are mounted, a vfs and vmount structures are created. The vmount structure contains specifics of the mount request, such as the object being mounted, and the stub over which it is being mounted. The vfs structure is the connecting structure which links the vnodes (representing files) with the vmount information, and the gfs structure that help define the operations that can be performed on the filesystem and its files.

vfs

The vfs structure is the connecting structure which links the vnodes (representing files) with the vmount information, and the gfs structure witch provides a path to the operations that can be performed on the filesystem and its files. Element *vfs_next

*vfs_gfs

vfs_mntd

vfs_mntdover

vfs_nodes vfs_mdata

Description vfs’s are a linked list with the first vfs entry addressed by the rootvfs variable which is private to the kernel. path back to the gfs structure and its file system specific subroutines through the vfs_gfs pointer. The vfs_mntd pointer points to the vnode within the file system which generally represents the root directory of the file system. The vfs_mntdover pointer points to a vnode within another file system, also usually representing a directory, which indicates where the file system is mounted. In this sense, the vfs_mntd pointer corresponds to the object within the vmount structure referenced by the vfs_mdata pointer, and the vfs_mntdover pointer corresponds to the stub within the vmount structure referenced by the vfs_mdata pointer. Pointer to all vnodes for this file system. Pointer to the vmount providing mount information for this filesystem Continued on next page

-12 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

vfs and vmount vfs structure

-- continued

The vfs structure is defined in /usr/include/sys/vfs.h: struct vfs { struct vfs struct gfs struct vnode struct vnode struct vnode int caddr_t unsigned int between */ int short unsigned short struct vmount Simple_lock

*vfs_next; *vfs_gfs; *vfs_mntd; *vfs_mntdover; *vfs_vnodes; vfs_count; vfs_data; vfs_number;

vfs_bsize; vfs_rsvd1; vfs_rsvd2; *vfs_mdata; vfs_lock;

/* /* /* /* /* /* /* /*

vfs’s are a linked list */ ptr to gfs of vfs */ pointer to mounted vnode */ pointer to mounted-over vnode */ all vnodes in this vfs */ number of users of this vfs */ private data area pointer */ serial number to help distinguish

/* different mounts of the same object */ /* native block size */ /* Reserved */ /* Reserved */ /* record of mount arguments */ /* lock to serialize vnode list */

};

vfs Management

The mount helper creates the vmount structure, and calls the vmount subroutine. The vmount subroutine then creates the vfs structure, partially populates it, and invokes the file system dependent vfs_mount subroutine which completes the vfs structure, and performs any operations required internally by the particular file system implementation. There is one vfs structure for each file system currently mounted. New vfs structures are created with the vmount subroutine. This subroutine calls the vfs_mount subroutine found within the vfsops structure for the particular virtual file system type. The vfs entries are removed with the uvmount subroutine. This subroutine calls the vfs_umount subroutine from the vfsops structure for the virtual file system type.

vmount

The vmount structure contains specifics of the mount request. The vmount structure is defined in /usr/include/sys/vmount.h struct vmount { uint vmt_revision; uint vmt_length; fsid_t vmt_fsid; int vmt_vfsnumber; uint vmt_time; uint vmt_timepad; int vmt_flags;

/* /* /* /* /* /* /* /* /*

I I O O O O I O I

revision level, currently 1 total length of structure & data id of file system unique mount id of file system time of mount (in future, time is 2 longs) general mount flags MNT_REMOTE is output only type of gfs, see MNT_XXX above

int vmt_gfstype; struct vmt_data { short vmt_off; /* I offset of data, word aligned short vmt_size; /* I actual size of data in bytes } vmt_data[VMT_LASTINDEX + 1];

*/ */ */ */ */ */ */ */ */ */ */

};

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

File and Filesystem Operations Overview

Each file system type extension provides functions to perform operations on the filesystem and its files. Pointers to these functions are stored in the vfsops (filesystem operations) and vnodeops (file operations) structures.

Data Structures

The data structures discussed in this section are: • gfs - Holds pointers to the vnodeops and the vfsops structures • vnodeops - contains pointers to filesystem dependent operations on files (open, close, read, write...). • vfsops - contains pointers to filesystem dependent operations on the filesystem (mount, umount...)

u-block

inode

vnode

gnode

User File Descriptor Table System File Table

vnodeops

vfs

gfs vmount vfsops

Logical File System

-14 of 26 AIX 5L Internals

Virtural File System (Vnode-VFS Interface)

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

File System

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

gfs Description

There is one gfs structure for each type of virtual file system currently installed on the machine. For each gfs entry, there may be any number of vfs entries.

Purpose

The operating system uses the gfs entries as an access point to the virtual file system functions on a type-by-type basis. There is no direct link from a gfs entry to all of the vfs entries of a particular gfs type. The file system code generally uses the gfs structure as a pointer to the vnodeops structure and the vfsops structure for a particular type of file system.

gfs management

The gfs structures are stored within a global array accessible only by the kernel. The gfs entries are inserted with the gfsadd() kernel service, and only one gfs entry of a given gfs_type can be inserted into the array. Generally, gfs entries are added by the CFG_INIT section of the configuration code of the file system kernel extension. The gfs entries are removed with the gfsdel()kernel service. This is usually done within the CFG_TERM section of the configuration code of the file system kernel extension.

gfs structure

The gfs structure is defined in /usr/include/sys/gfs.h struct gfs { struct vfsops struct vnodeops int char int int caddr_t int int

*gfs_ops; *gn_ops; gfs_type; gfs_name[16]; (*gfs_init)();

/* /* /* /* /* /*

type of gfs (from vmount.h) */ name of vfs (eg. "jfs","nfs")*/ ( gfsp ) - if ! NULL, */ called once to init gfs */ flags for gfs capabilities */ gfs private config data*/

gfs_flags; gfs_data; (*gfs_rinit)(); gfs_hold /* count of mounts */

}

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

vnodeops Description

The vnodeops structure contains pointers to the filesystem dependant operations that can be performed on the vnode, such as link, mkdir, mknod, open, close, remove.

vnodeops management

There is one vnodeops structure per filesystem kernel extension loaded (i.e. one per unique filesystem type), and is initialized when the extension is loaded.

vnodeops structure

This structure is defined in /usr/include/sys/vnode.h. Due to the size of this structure, only a few lines are detailed below: struct vnodeops { /* creation/naming/deletion */ int

(*vn_link)(struct vnode *, struct vnode *, char *, struct ucred *);

int

(*vn_mkdir)(struct vnode *, char *, int32long64_t, struct ucred *);

int

(*vn_mknod)(struct vnode *, caddr_t, int32long64_t, dev_t, struct ucred *);

int

(*vn_remove)(struct vnode *, struct vnode *, char *, struct ucred *);

int

(*vn_rename)(struct vnode *, struct vnode *, caddr_t, struct vnode *,struct vnode *,caddr_t,struct ucred *);

int

(*vn_rmdir)(struct vnode *, struct vnode *, char *, struct ucred *);

/* lookup, int

file handle stuff */

(*vn_lookup)(struct vnode *, struct vnode **, char *, int32long64_t, struct vattr *, struct ucred *);

int

(*vn_fid)(struct vnode *, struct fileid *, struct ucred *);

/* access to files */ int

(*vn_open)(struct vnode *, int32long64_t, ext_t, caddr_t *, struct ucred *);

int

(*vn_create)(struct vnode *, struct vnode **, int32long64_t, caddr_t, int32long64_t, caddr_t *, struct ucred *);

-16 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

vfsops Description

The vfsops structure, contains pointers to the filesystem dependant operations that can be performed on the vfs, such as mount, unmount or sync.

vfsops management

There is one vfsops structure per filesystem kernel extension loaded (i.e. one per unique filesystem type), and is initialized when the extension is loaded.

vfsops structure

This structure is defined in /usr/include/sys/vfs.h. struct vfsops { /* mount a file system */ int (*vfs_mount)(struct vfs *, struct ucred *); /* unmount a file system */ int (*vfs_unmount)(struct vfs *, int, struct ucred *); /* get the root vnode of a file system */ int (*vfs_root)(struct vfs *, struct vnode **, struct ucred *); /* get file system information */ int (*vfs_statfs)(struct vfs *, struct statfs *, struct ucred *); /* sync all file systems of this type */ int (*vfs_sync)(); /* get a vnode matching a file id */ int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *, struct ucred *); /* do specified command to file system */ int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t, struct ucred *); /* manage file system quotas */ int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t, struct ucred *); };

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

The Gnode Introduction

Gnode represent an object in a file system implementation, and serves as the interface between the logical file system and the file system implementation. There is a one-to-one correspondence between a gnode and an object in a file system implementation.

Overview

Each filesystem implementation is responsible for allocating and destroying gnodes. Calls to the file system implementation serve as requests to perform an operation on a specific gnode. A gnode is needed, in addition to the file system inode, because some file system implementations may not include the concept of an inode. Thus the gnode structure substitutes for whatever structure the file system implementation may have used to uniquely identify a file system object. The logical file system relies on the file system implementation to provide valid data for the following fields in the gnode: • gn_type Identifies the type of object represented by the gnode. • gn_ops Identifies the set of operations that can be performed on the object.

Creation

A gnode refers directly to a file (regular, directory, special, and so on), and is usually embedded within a file system implementation specific structure (such as an inode). Gnodes are created as needed by file system specific code at the same time as creating implementation specific structures. This is normally immediately followed by a call to the vn_get kernel service to create a matching vnode. The gnode structure is usually deleted either when the file it refers to is deleted, or when the implementation specific structure is being reused for another file. Continued on next page

-18 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

The Gnode gnode and inode

-- continued

The gnode is typical embedded in an in-core inode. The member gnode->gn_data points to the start of the inode. Incore inode

gnode gnode->gn_data

Structure

The gnode structure is defined in /usr/include/sys/vnode.h: struct gnode { enum vtype gn_type; /* type of object: VDIR,VREG etc */ short gn_flags; /* attributes of object */ ulong gn_seg; /* segment into which file is mapped */ long32int64 gn_mwrcnt; /* count of map for write */ long32int64 gn_mrdcnt; /* count of map for read */ long32int64 gn_rdcnt; /* total opens for read */ long32int64 gn_wrcnt; /* total opens for write */ long32int64 gn_excnt; /* total opens for exec */ long32int64 gn_rshcnt; /* total opens for read share */ struct vnodeops *gn_ops; struct vnode *gn_vnode; /* ptr to list of vnodes per this gnode */ dev_t gn_rdev; /* for devices, their "dev_t" */ chan_t gn_chan; /* for devices, their "chan", minor’s minor */ Simple_lock gn_reclk_lock; /* lock for filocks list */ int gn_reclk_event;/* event list for file locking */ struct filock *gn_filocks; /* locked region list */ caddr_t gn_data; /* ptr to private data (usually contiguous) */ }

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Exercise 1 Overview

This exercise will test you knowledge of the data structures of the LFS and VFS and the relationships between them.

lab

Use the following list of terms to best complete the statements below. File

vfs

File system

vnodeops

System File Table

vmount

1. A vnode represents a ______________. 2. A vfs represents a _____________. 3. The gfs contains pointers to the ufsops and the _____________. 4. The ___________ structure contains specifics about a mount request. 5. The ____________ has one entry for each open file on the system. Answer the following two questions by completing this diagram as directed.

u-block

inode

gnode

User File Descriptor Table System File Table

vfs

vnodeops

vfsops

Logical File System

Virtural File System (Vnode-VFS Interface)

File System

6. Label the blocks representing the vnode, vmount and gfs structures 7. Draw a line representing the file pointer in the ufd to an entry in the system file table.

-20 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Guide

Lab Exercise 1 Overview

In the following exercise you will run a small C program that opens a file, initializes it by writing a few bytes to it, then pauses. The pause allows us to investigate the various LFS structures that are created by opening the file, using the appropriate system debugger.

The program

The C code for the example is: #include main() { int fd; fd=open("foo", O_RDWR | O_CREAT); write(fd, "abcd", 4); close(fd); fd=open("foo", O_RDONLY); printf("fd = %d\n", fd); pause(); } The close() then open() is required, to ensure that the write is committed to disk & hence that the inode is updated. save this code to a file called t.c, and compile it using “make t”.

Lab

Follow the steps in the table below. Stage 1

2

3

Description Enter the C program from above, save it to a file called t.c and compile with the command: $ make t Execute the program created in the last step. It will print the file descriptor number of the file it creates, then pauses. $ ./t fd = 3 From another shell on the same system, enter the system debugger (kdb or iadb). Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1

-- continued

Lab

Stage 4

Description Initially, we need to find the address of the file structure for the open file. We know that the file descriptor for our program is number 3, so we have to find the mapping between the file descriptor number and the file structure. This mapping is done from the file descriptor table in the uarea structure for the process. To find the uarea, find the slot number in the thread table that our “t” process occupies, the uarea slot number will be the same. For kdb use the “th *” command to display all the threads. Page down through the entries until you find the correct entry: (0)> th * SLOT NAME pvthread+000000 ... pvthread+001D00 ...

5

0 swapper 55 t

STATE

TID PRI

RQ CPUID

CL WCHAN

SLEEP 000003 010

1

0

SLEEP 003A39 03C

1

0

Now use the command “uarea” on this thread slot number, to view the user area (which contains the file descriptor table), and page down through the output until you find the “File descriptor table”: (0)> u 55 File descriptor table at..F00000002FF3CEC0: fd 0 fp..F100009600007430 count..00000000 flags. fd 1 fp..F100009600007430 count..00000000 flags. fd 2 fp..F100009600007430 count..00000000 flags. fd 3 fp..F100009600007700 count..00000000 flags. Rest of File Descriptor Table empty or paged out. ...

ALLOCATED ALLOCATED ALLOCATED ALLOCATED

Continued on next page

-22 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1

Guide

-- continued

lab

Stage 6

Description The file structure for file descriptor 3 is at address F100009600007700. Use the “file” command along with this address to display the contents of the structure: (0)!ILOH) $''5&28172))6(7'$7$7 gnode F10000971528A3F8 GNODE............ F10000971528A3F8 KERN_heap+528A3F8 gn_type....... 00000001 gn_flags...... 00000000 gn_seg........ 00000000000078AD gn_mwrcnt..... 00000000 gn_mrdcnt..... 00000000 gn_rdcnt...... 00000001 gn_wrcnt...... 00000000 gn_excnt...... 00000000 gn_rshcnt..... 00000000 gn_ops........ 00000000003D7DC8 jfs_vops gn_vnode...... F10000971528A380 gn_rdev....... 8000000A00000008 gn_chan....... 00000000 gn_reclk_event 00000000FFFFFFFF gn_reclk_lock@ F10000971528A440 gn_reclk_lock. 0000000000000000 gn_filocks.... 0000000000000000 gn_data....... F10000971528A3D8 gn_type....... REG

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1

-- continued

Lab

Step 9

Action The inode address is contained in the gn_data field, in this case F10000971528A3D8. Use the kdb command “inode” to display this structure: (0) !LQRGH)$' '(9180%(5&177865@ GTXRW>*53@GLQRGH#)$& FOXVWHUUFOXVWHUGLRFQWQRQGLR VL]HJHWV

*12'()$) JQBW\SHJQBIODJV JQBVHJ$' JQBPZUFQWJQBPUGFQWJQBUGFQW JQBZUFQWJQBH[FQWJQBUVKFQW JQBRSV''&MIVBYRSV JQBYQRGH)$JQBUGHY$ JQBFKDQJQBUHFONBHYHQW)))))))) JQBUHFONBORFN#)$JQBUHFONBORFN JQBILORFNVJQBGDWD)$' JQBW\SH5(* GLBJHQ))&GLBPRGH&GLBQOLQN GLBDFFWGLBXLGGLBJLG GLBQEORFNVGLBDFO GLBPWLPH&)'GLBDWLPH&)'GLBFWLPH&)' GLBVL]HBKLGLBVL]HBORGLBVHF GLBUGDGGU GLBYLQGLUHFWGLBULQGLUHFW GLBSULYRIIVHWGLBSULYIODJVGLBSULY 912'()$ YBIODJYBFRXQW YBYIVJHQYBYIVS))' YBORFN#)$YBORFN YBPYIVSYBJQRGH)$) YBQH[WYBYIVQH[W)%) YBYIVSUHYYBSIVYQRGH YBDXGLW

Continued on next page

-24 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 1

Guide

-- continued

lab

Step 10

Action The inode command displays the inode, gnode and vnode structures. The member number in the inode structure should contain the inode number in hex of the file foo. Verify this inode number matches the inode number displayed by the command : $

11

ls -lia foo

Don’t forget to convert the inode number from hex to decimal. The dev field displays the major and minor number of the logical volume for the filesystem. For example: 64 bit systems: 8000000A00000007 -> major=10 minor=7 32 bit systems: 000A0007 -> major=10 minor=7 Verify this number with the command: $ ls -lia /dev/

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 26

Guide

Draft Version for review, Sunday, 15. October 2000, fs1.fm

Lab Exercise 2 Overview

The instructor will create a simple shell script that simply prints its process id, then pauses. Both the “ps” command, and the process and thread tables entries for this script will simply list the name of the program as the name of the shell that it is being executed by (E.g. “ksh”).

Objective

To determine the name of the script that the instructor is running.

Tips

• Remember that the shell will have to open() the script prior to executing it. • The command find . -inum xxx can be used to find the name of a file given the filesystem name and an inode number.

-26 of 26 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Unit 14. AIX 5L boot Objectives After completing this unit, you should be able to : • List and locate boot components and their usage • Understand the 3 Phases of rc.boot • Understand the contents and usage of a RAMFS • Understand the ODM structure and the usage of ODM classes • Create new boot images • Debug boot problems

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

What is boot Definition

It is the process that begins when the computer is powered up and continues until the entries in the init table have been processed.

ROS process

System ROS (Read Only Storage), contains firmware that is independent of the operating system which initializes the hardware and loads AIX. All platforms except RS6K will use an intermediate boot process called : • Softros : (/usr/lib/boot/aixmon_chrp) for CHRP systems • Softros : (/usr/lib/boot/aixmon_rspc) for RSPC systems • Boot loader : (/usr/lib/boot/boot_elf) for IA-64 systems

AIX process

AIX begins execution after system ROS firmware or the intermediate boot process finishes its execution : • sets up firmware information • kernel initialization • RAM filesystem based configuration • control is passed to files based in the permanent filesystem (this may be a disk or network filesystem) • /etc/inittab entries are processed. This usually includes enabling the user login process.

-2 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Various Types of boot Devices

AIX can boot from the following types of devices : • hard disk boot • CD-ROM boot • tape boot (Not supported on IA-64 platform) • network boot

Configuration

The boot process can use one of the following boot configurations : • standalone • diskless/dataless (Not supported on IA64 platform) • operating system installation/software maintenance • diagnostics

Hard disk boot

The hard disk boot has the following characteristics : • the boot image resides on the hard disk • the RAM filesystem contains the files necessary for configuring the hard disk(s), and then accessing the filesystems that reside in the root volume group (rootvg) • this is the most common system configuration • these types of systems are also known as “standalone” systems • these types of systems may also be booted into the diagnostics functions

CDROM boot

The CDROM boot maybe used in the following situations : • operating system installation • diagnostics • hard disk boot failure recovery/maintenance Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Various Types of boot Tape boot

-- continued

The Tape boot device can be used for : • operating system installation • hard disk boot failure recovery/maintenance The tape boot device is usually used for creating bootable system backups The tape boot device is not supported on IA-64 platform.

Network boot

The network boot can be used for the following purposes : • boot and install the operating system • the operating system is installed on a hard disk with NIM • subsequent boots are from the hard disk • supported diskless/dataless configurations • diagnostics • hard disk boot failure recovery/maintenance The centralized boot/filesystem servers offer convenient administration

-4 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Systems types and Kernel images System Types

There are four basic hardware architecture types: • RS6K - the “classic” IBM workstation • RSPC - the PowerPC Reference Platform workstation • CHRP - Common Hardware Reference Platform • IA-64 - Intel IA-64 Platform

boot images types

There are three corresponding types of boot images: • The RS6K uses an hardware ROS to build the IPL Control Block • The RSPC and CHRP uses a SOFTROS to build the IPL Control Block • The IA-64 use an EFI boot loader to build the IPL Control Block

kernel types

There are four types of Kernels loaded: • 32 bits Power UP (/unix->/usr/lib/boot/unix_up) • 32 bits Power MP (/unix->/usr/lib/boot/unix_mp) • 64 bits Power (/unix->/usr/lib/boot/unix_64) • 64 bits IA-64 (/unix->/usr/lib/boot/unix_ia64)

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

RAMFS and prototype files Introduction

In order to successfully boot a system, the AIX kernel will need basic commands, configuration files, kernel extensions and device drivers to be able to configure a minimum environment. All the files needed are included in the RAMFS using the following command mkfs -V jfs -p

prototypes files description

A prototype file is a list of file and file descriptions that are needed to create a RAMFS. A prototype file entry format is as follow :

0 0

Where : • : is the name of the file, directory, link or device as it will be written to the RAMFS • : defines the type of the entry and can be : • d--- : a directory entry (this will change the relative path of the following entries). • l--- : a link (the target will be listed in the parameter) • b--- : a block device (the parameter will represent the major and minor numbers) • c--- : a character device (the parameter will represent the major and minor numbers) • ---- : a file • : represent the file permissions in octal format • : value will depend on the as described before. Continued on next page

-6 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

RAMFS and prototype files prototypes files types

Guide

-- continued

Prototype files are divided in several parts according to their specific use : • Prototypes files located in /usr/lib/boot are the base prototypes used for a platform according to the boot device type and comes with the platform base system device fileset • Prototypes files located in /usr/lib/boot/network are specific to any general kind of network boot device and comes with the platform base system device fileset • Prototypes files located in /usr/lib/boot/protoext are used for any specific type of boot device and comes with the device specific fileset

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Boot Image Creation Introduction

In order to successfully boot from a device, the administrator will need to run commands that will create the boot structure.

bosboot command

The bosboot command is the most commonly used on AIX because it will manage all verification tasks and environment setup for the administrator. The administrator can also use the mkboot command but he then should take care himself of all these preliminary checks. The bosboot command will also be used by over commands like mksysb or installp post installation process when installing packages that needs to build a new boot image.

bosboot process overview

The bosboot command will do the following : • set execution environment • parse command line arguments • verify syntax and arguments • point to platform specific files (like mkboot_chrp or aixmon_rspc) • check for space needed in /tmp and destination filesystem if needed • create a RAMFS if requested using mkfs and proto files • create a bootimage and a boot record if requested using the appropriate mkboot command • copy the boot image and savebase to the boot device if requested. • cleanup execution environment Continued on next page

-8 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Boot Image Creation bosboot parameters

Guide

-- continued

The most commonly used bosboot command is : # bosboot -a -d /dev/hdisk0 For example if you need to load and invoke the kernel debugger you can use : bosboot -a -I -d /dev/hdisk0 The following table list the bosboot parameters that can be used : argument

description -a Create complete boot image and device. -w file Copy given boot image file to device. -r file Create ROS Emulation boot image. -d device Device for which to create the boot image. -U Create uncompressed boot image. -p proto Use given proto file for RAM disk file system. -k kernel Use given kernel file for boot image. -l lvdev Target boot logical volume for boot image. -b file Use given file name for boot image name. -D Load Low Level Debugger. -I Load and Invoke Low Level Debugger. -L Enable MP locks instrumentation (MP kernels) -M norm|serv|both Boot mode - normal or service -O offset boot image offset for CDROM file system. -q Query size required for boot image.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Distributions Introduction

AIX 5L will be delivered in two separate distributions : • One for Power systems • One for Intel IA-64 systems

Power CDROM Distributions

The distribution CDROM that IBM provides to our customers has three boot images. There is a boot image for the RS6K computers, a second for the RSPC computers, and a third for CHRP (/ppc/chrp/bootfile.exe). The RS6K, RSPC, and CHRP UP computers can use the MP Kernel, which is the method implemented for distribution media that goes to our customers. In other words, when a customer receives boot/install media from IBM, there is no need to determine whether the system is UP or MP. This boot image is created using the MP kernel. The UP kernel is more efficient for uniprocessor systems, but the strategy of a single boot image for both hardware platform types lowers distribution cost, and is more convenient for our customers.

IA-64 CDROM Distributions

-10 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Checkpoint Introduction

Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz

1. What is the name of the file used as a SOFTROS on CHRP systems

2. Does an IA-64 support 32 bit kernel

3. What are the common functions of the ROS, SOFTROS and EFI boot loader.

4. List the 4 platforms supported by AIX 5L

5. What is the purpose of the RAMFS

6. How to create a RAMFS

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose

Notes on Quiz and transition to the next section

Quiz responses

The responses for the Quiz are : 1. What is the name of the file used as a SOFTROS on CHRP systems • /usr/lib/boot/aixmon_chrp 2. Does an IA-64 support 32 bit kernel : NO 3. What are the common functions of the ROS, SOFTROS and EFI boot loader. • create the IPLCB • load the kernel 4. list the 4 platforms supported by AIX 5L • RS6K • RSPC • CHRP • IA-64 5. What is the purpose of a RAMFS : • Get basic commands, configuration files, kernel extensions and device drivers in order to be able to bring a minimum environment. 6. How to create a RAMFS : • Using mkfs and prototype files.

Transition Statement

Now we will describe: • the Power specific boot process if this is a Power course • the IA-64 specific boot process if this is a IA-64 course

-12 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

The Power Boot Mechanism Introduction

The section will explain the boot mechanism used by Power family systems.

Boot overview

When the system is powered on, the ROS or the firmware will look for the bootrecord on the device pointed by the bootlist to find the boot entry point. The Softros on RSPC and CHRP will execute and uncompress the boot image if needed using the bootexpand process. Then it will load the kernel that will initialize. The kernel will then call init (In fact /usr/lib/boot/ssh at this stage) The ssh will then call rc.boot for PHASE I and PHASE II specific to each boot device types. Then init will execute rc.boot phase 3 and the remaining common code in rc.boot for disk and network boot devices

Boot diagram

The following diagram represent the high level boot process overview. execution of the system ROS or firmware. boot record read from boot device rspc or chrp boot

yes

execution of softros

no yes com pressed boot img

execution bootexpand

no Kernel initialization Kernel call init (/usr/lib/boot/ssh) init ssh call rc.boot PHASE I&II init exit to newroot init calls rc.boot PHASE III from inittab and process the rest of inittab entries.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Power boot disk layout Boot image overview

The following chart describes a Power boot disk : boot disk hd5

compressed kernel

compressed RAM Filesystem

bootexpand

VGDA softros (chrp and rspc) bootrecord

rest of base the boot customized disk data

bootrecord

512 byte block containing size and location of the boot image. The boot record is the first block on a disk or cdrom and is therefore separated from the boot image. The boot image on a disk is placed in the boot logical volume which is a reserved contiguous area.

softros

RSPC and CHRP platform uses a SOFTROS program (/usr/lib/boot/ aixmon_rspc or /usr/lib/boot/aixmon_chrp) that performs system initialization for AIX that the hardware firmware in ROS does not provide, such as appending device information to the IPL control block. Continued on next page

-14 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Power boot disk layout

Guide

-- continued

bootexpand

Program to expand compressed boot image which is executed before control is passed to kernel. The compression of a boot image is optional but it is the default since the image size is less than half of an uncompressed image and requires less time to load from the media.

kernel

AIX 32 bits UP, 32 bits MP or 64 bits MP kernels that which control passes to after expansion by bootexpand. The kernel initializes itself and then passes control to the simple shell init (ssh) in the RAM filesystem.

RAM filesystem

Filesystem used during boot process, that contains programs and data for initializing devices and subsystems in order to install AIX, execute diagnostics, or to access and bring up the rest of AIX.

base customized data

Area of the hard disk boot logical volume containing user configured ODM device configuration information that is used by the system configuration process.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Power boot record Introduction

On Power systems, the boot record is located at the beginning of the boot device and contains the following informations : • The IPL record • The boot partition table used by chrp and rspc systems.

IPL record description

The following table describe the content of the boot record. size offse name t 4 0 IPL_record_id

20 4

4 24

reserved1 formatted_cap

1

28

last_head

1

29

last_sector

6 4

30 36

4

40

reserved2 boot_code_lengt h boot_code_offse t

4 4

44 48

boot_lv_start boot_prg_start

4 4

52 56

boot_lv_length boot_load_add

1

60

boot_frag

description This physical volume contains a valid IPL record if and only if this field contains IPLRECID in EBCDIC ’IBMA’ Formatted capacity. The number of sectors available after formatting. THIS IS DISKETTE INFORMATION The number of heads minus 1. THIS IS DISKETTE INFORMATION The number of sectors per track. Boot code length in sectors. A 0 value implies no boot code present Boot code offset. Must be 0 if no boot code present, else contains byte offset from start of boot code to first instruction. Contains the PSN of the start of the BLV. Boot code start. Must be 0 if no boot code present, else contains the PSN of the start of boot code. BLV length in sectors. 512 byte boundary load address for boot code. 0x1 => fragmentation allowed Continued on next page

-16 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

AIX 5L Power boot record

-- continued

IPL record description continued

size offse name t 1 61 boot_emulation 2 62 reserved3 2 64 basecn_length

description 0x1 => ROS network emulation code Number of sectors for base customization. Normal mode. Number of sectors for base customization. Service mode. Starting PSN value for base customization. Normal mode. Starting PSN value for base customization. Service mode.

2

66

basecs_length

4

68

basecn_start

4

72

basecs_start

24 4

76 100

4

104

4 4

108 112

4 4

116 120

1

124

1 2 8 376

125 126 128 136

reserved4 ser_code_length Service code length in sectors. A 0 value implies no service code present. ser_code_offset Service code offset. 0 if no service code is present, else contains byte offset from start of service code to first instruction. ser_lv_start Contains the PSN of the start of the SLV. ser_prg_start Service code start. Must be 0 if service code is not present, else contains the PSN of the start of service code. ser_lv_length SLV length in sectors. ser_load_add 512 byte boundary load address for service code. ser_frag Service code fragmentation flag. Must be 0 if no fragmentation allowed, else must be 0x01. ser_emulation ROS network emulation flag reserved5 pv_id The unique identifier for this PV. dummy Include the partition table. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

AIX 5L Power boot record boot partition table

The boot record contains 4 partition tables entries starting at offset 0x1be. Each entry contains the following information : size in byte 1 1 1 1 1 1 1 1 4 4

boot partition tables entries

-- continued

name boot_ind begin_h begin_s begin_c syst_ind end_h end_s end_c RBA sectors

description Boot indicator Begin head Begin sector Begin cylinder System indicator End head End sector End cylinder Relative block address in little endian format Number of sectors in little endian format

RS6K platform doesn’t use a boot partition table. The four boot partition table entries are used for : • CHRP boot images • CHRP and First RSPC boot image • CHRP and Second RSPC boot image • CHRP Third RSPC boot image Continued on next page

-18 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

AIX 5L Power boot record Example

-- continued

The following chart represent an AIX 5L boot record from a chrp system. It was obtained using : od -Ax -x /dev/hdisk0|pg

IBMA boot_code _len base_cn _length base_cs _length

0000000 0000010 0000020 0000030 0000040 0000050 0000060 0000070 0000080 0000090

c9c2 0000 0000 0000 0100 0000 0000 0000 0007 0000

d4c1 0000 0000 0000 0100 0000 0000 0000 1483 0000

PVID

0000 0000 0000 0000 0000 2cc1 0000 0000 0000 3cdc 0000 0000 0000 2cc1 0000 0000 229d 0662 0000 0000 serv_code _length

base_cs _start

boot_lv _start

0000 0000 0000 0000 0000 0000 0000 0000 0000 3cdc 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 base_cn _start

0000 0000 0000 0000 0000 1100 0000 0000 0000 0000 0000 0000 0000 1100 0000 0000 0000 0000 0000 0000 ser_lv _start

BOOT_SIGNATURE boot_partition _table

00001b0 00001c0 00001d0 00001e0 00001f0 0000200 0000210 0000220 0000230 0000240 0000250 0000260 0000270

0000 0000 ffff ffff ffff 4182 7ca4 57de 4182 4182 7fde 3fc0 67ff

0000 0000 41ff 41ff 41ff 000c 2b78 063e 001c 000c 1814 8000 0080

0000 0000 ffff ffff ffff 3880 83c3 2c1e 2c1e 2c1e 83de 7fcf 93fe

0000 0000 1b11 0211 1b11 0000 0098 0057 0059 0082 006c 01a4 10c0

0000 0000 0000 0000 0000 4800 7fde 4182 4182 4082 2c1e 3fc0 31ad

RBA

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

0000 0000 c12c 1900 c12c 000c 1814 0024 0014 0030 0000 f000 ffd8

0000 0000 0000 0000 0000 7c83 83de 2c1e 2c1e 83c3 4182 83fe 30c3

0000 80ff 00ff 80ff 55aa 2378 0034 0058 0072 0288 001c 10c0 0080

sectors

-19 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose

Notes on Power boot record

Little endian format

The RBA and sectors informations from the boot partition table are little endian format. So to obtain the actual address, you will need to swap the 2 bytes as they are display using the od command

-20 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Power boot images structures Introduction

Depending on the architecture, the boot image will not always contains the same elements due to the needs of ROS and Firmware specifications.

RS6K boot image

The rs6k platform doesn’t need a an softros emulation, so the boot image start with the bootexpand program. The bootexpand will be loaded first to uncompress the kernel and the RAMFS.

RSPC boot image

On rspc, the aixmon_rspc softros is located at the begening of the boot image, but the xcoff format is replaced by an hints structure has defined in /usr/include/sys/boot.h. So an RSPC boot image will contain the following sections : • The hints structure • The aixmon_rspc file reduced by it’s xcoff header and in fact starting at its entry point • The bootexpand program • The compressed kernel • The compressed RAMFS • The saved base customization.

CHRP boot image

On chrp, the aixmon_chrp softros is located at the begening of the boot image, but the xcoff format is replaced by an ELF format. So a CHRP boot image will contain : • The ELF structure • The aixmon_chrp file reduced by it’s xcoff header and in fact starting at its entry point. • The bootexpand program • The compressed kernel • The compressed RAMFS • The saved base customization.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

RSPC boot image hints header introduction

On rspc systems, the aixmon xcoff header is replaced by an hints structure. The aixmon_rspc file is copied to the boot image after the hints structure starting at it’s entry point.

hints boot structure description

The following table represents the hints structure :

size name description 4 signature Signature for boot program ‘0x4149584d’ 4 resid_data_address address of residual data as determined by firmware 4 bss_offset Address of bss section 4 bss_length Length of bss section 4 jump_offset Jump offset in boot image 4 load_exec_address address of boot loader as determined by firmware 4 header_size Size of header 4 header_block_size Offset to AIX boot image 4 image_length Size of boot program 4 Spare 4 res_mem_size reserved memory size 4 mode_control Boot mode control ‘0xDEAD0000 | mode_control’

Continued on next page

-22 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

RSPC boot image hints header RSPC boot image example

-- continued

The following output represents the hints header output from the following command : # dd if= bs=512 skip= count=1 |od -Xa x 0000000 * 0000200 0000210 0000220 * 0000400 0000410 0000420 0000430

0000 0000 0000 0000 0000 0000 0000 0000 3004 0000 00 fe 3200 0002 4149 5820 2034 2033 2030 3130 3130 3035 3437 3000 0000 0000 0000 0000 0000 0000 0000 0000 0000 4149 0000 0001 4800

584d 0000 0000 0000 ff d4 0000 022c 038c 0000 0000 0000 0400 0000 0097 2810 0000 0000 0000 0000 dead 00c0 0005 7e80 00a6 7e94 a278 3a94 1000

aixmon entry point

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

CHRP Boot image ELF structure introduction

On chrp systems, the aixmon xcoff header is replaced by an ELF header. The aixmon_chrp file is copied to the boot image after the ELF header starting at it’s entry point.

ELF boot header description

The ELF boot header is made of : • ELF header structure • Note section description • loader section 1 description • loader section 2 description • Note data description • The boot loader parameters data

ELF header structure description

The Following table describes the ELF header structure :

size 16 2 2 4 4 4 4 4 2 2 2 2 2 2

name e_ident e_type e_machine e_version e_entry e_phoff e_shoff e_flags e_ehsize e_phentsize e_phnum e_shentsize e_shnum e_shstrndx

description ELF identification object file type architecture object file version entry point prog hdr byte offset section hdr byte offset processor specific flags ELF header size prog hdr table entry size prog hdr table entry count section header size section header entry count sect name string tbl idx Continued on next page

-24 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

CHRP boot image ELF structure - Continued Note, load 1 and load2 segments descriptions

The following table describes the structure used to format note, loader 1 and loader 2 segments :

size 4 4 4 4 4 4 4 4

Note data description

name p_type p_offset p_vaddr p_paddr p_filesz p_memsz p_flags p_align

description segment type offset to this segment virt addr of seg in memory phy addr of seg in memory file image segment size mem image segment size segment flags segment alignment

The following table represent the note data description structure :

size 4 4 4 8 4 4 4 4 4 4

name namesz descsz type name real_mode real_base real_size virt_base virt_size load_base

description size of name size of descriptor descriptor interpretation the owner of this entry ISA env variable ISA env variable ISA env variable ISA env variable ISA env variable ISA env variable Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

CHRP boot image ELF structure - Continued Boot loader parameters description

The following table describes the boot loader structure :

size name 4 timestamp 4 bootimage_size

description date when the boot image was created equivalent to the number of sectors for the blv found in the bootrecord boot_loader_size size of the aixmon in bytes inst_offset jump offset in boot image rmalloc_size Percent of memory for kernel heap reserved1 reserved2 reserved3

4 4 4 4 4 4

example

Use the following command to display the ELF structure: # dd if= bs=512 skip= count=1 |od -Xa x

load_phdr1

note_phdr elf_hdr

IF H  IIIIIIII IIIIIIIIF FF F HH HH load_phdr2  DIIIIIIIIIF note_data EIIIIIIIIIIIIIIIIIIIIIIII FHHF G BL_parms_data HIHDIFG IEEEEE DFD FI

aixmon entry point

-26 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Exercise Introduction

This exercise will show you the way to locate the different parts of the boot image using the boot record

Procedure

Follow the following procedure to locate main parts of the boot image. Step 1 2 3

Action Locate the boot disk using : # bootinfo -b Determine the architecture of your system using : # bootinfo -p Find the boot record located at the beginning of the disk found in step 1 using : # dd if= bs=512 count=1 |od -Ax -x

4

5

6

7

8 9

• On RSPC or CHRP, locate in the boot partition table the RBA and sectors from output of step 3. • On RS6K, locate in the record, the boot_prg_start and boot_code_length Create a file using the offset and sectors length found in step 5 using : # dd if= bs=512 skip= count= of=/tmp/myfile Using the what command try to find what is included in this file What is missing from the what output ? Why ? Create a file using the offset and sectors length found in step 5 plus the size of the boot_loader # dd if= bs=512 skip= count=512 of=/tmp/myfile2 What is myfile2 Using the results from step 3, locate the base customization sector start and length : use these values to create a new file # dd if= bs=512 skip= count= of=/tmp/myfile3

10

Create a directory and copy /etc/objrepos/* to dir1 # /usr/lib/boot/restbase -o myfile3 -d dir1 -v

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose

Notes on boot record and image exercise

Details

Step 6 should output something like : 07 1.3 src/rspc/usr/lib/boot/aixmon_chrp/ cl_in_services.c, chrp_softros, rspc500, 0025A_500 10/ 22/98 14:25:3904 1.32 src/rspc/usr/lib/boot/ aixmon_chrp/aixmon_chrp.c, chrp_softros, rspc500, 0026A_500 6/16/00 12:43:2509 1.2 src/rspc/usr/lib/ boot/aixmon_chrp/printf.c,chrp_softros, rspc500, 0025A_500 1/13/99 10:38:0208 1.40 src/rspc/usr/ lib/boot/aixmon_chrp/iplcb_init.c, chrp_softros, rspc500, 0029A_500 7/17/00 14:07:1139 1.5 src/ rspc/usr/lib/boot/aixmon_chrp/numa_topo.c, chrp_softros, rspc500, 0028A_500 6/7/00 08:11:2148 1.1 src/rspc/usr/lib/boot/aixmon_chrp/rtas_func.c, chrp_softros, rspc500, 0026A_500 6/16/00 13:04:3265 1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot, bos500, 0025A_500 4/14/00 14:26:38 So it reflect the presence of the softros (aixmon_chrp) and the bootexpand codes. Here we are missing the kernel and ramfs because the are stripped and then unreadable for the what command. Step 8 should output something like : # what /tmp/myfile2 /tmp/myfile2: 65 1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot, bos500, 0 After completing the step 10, students should observe that the following files were updated by the restbase command. That confirms that myfile3 is actually the base customization area. -rw-r--r-- 1 root

system

32768 Aug 23 15:51 CuDvDr

-rw-r--r-- 1 root

system

4096 Aug 23 15:51 CuPath

-rw-r--r-- 1 root

system

4096 Aug 23 15:51 CuPath.vc

-rw-r--r-- 1 root

system

4096 Aug 23 15:51 CuPathAt

-rw-r--r-- 1 root

system

4096 Aug 23 15:51 CuPathAt.vc

-rw-r--r-- 1 root

system

16384 Aug 23 15:51 CuAt

-rw-r--r-- 1 root

system

8192 Aug 23 15:51 CuAt.vc

-28 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

-rw-r--r-- 1 root

system

4096 Aug 23 15:51 CuDep

-rw-r--r-- 1 root

system

16384 Aug 23 15:51 CuVPD

-rw-r--r-- 1 root

system

12288 Aug 23 15:51 CuDv

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Power ROS and Softros ROS

On RS6K platforms, the Hardware ROS performs some basic hardware configuration and tests, and create the IPL Control Block before transferring control to kernel’s entry point.

Softros

The RSPC and CHRP family of computers requires a boot image with special software known as SOFTROS, which is used to provide function that AIX requires, and is not provided by the hardware firmware. The SOFTROS performs some basic hardware configuration and tests, and also sets up some data structures to provide an environment for AIX that more closely resembles the environment provided by RS6K system ROS. On CHRP systems the firmware device tree is also appended to the IPL Control Block. The the Softros transfer control to kernel’s entry point.

-30 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

IPLCB on Power Definition

The IPLCB (Initial Program Load Control Block) defines the RAM resident interface between the IPL Boot Process and the Operating System The ROS or Softros will initialize the IPLCB structure using interfaces to the firmware or ROS (on RS6K platform). The kernel when loaded will use the IPLCB structure to initialize it’s runtime structures.

IPLCB Description

The IPLCB contains the following structures (described in : /usr/include/ sys/iplcb.h) : • IPLCB Directory : contains the IPLCB ID and pointers (offset and size to IPLCB Data) • IPLCB Data such as : • processor information ('ipl -proc [cpu]') • memory region ('ipl -mem') • system information ('ipl -sys') • user information ('ipl -user') • NUMA information ('ipl -numa') Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

IPLCB on Power IPLCB directory example on a CHRP system

-- continued

The following screen output shows the IPLCB on a CHRP system captured using the kdb iplcb -dir sub command : IPL directory [10000080] ipl_control_block_id.........ROSIPL ipl_cb_and_bit_map_offset...00000000 bit_map_offset..............000087A8 ipl_info_offset.............000002E8 iocc_post_results_offset....00000000 nio_dskt_post_results_offset00000000 sjl_disk_post_results_offset00000000 scsi_post_results_offset....00000000 eth_post_results_offset.....00000000 tok_post_results_offset.....00000000 ser_post_results_offset.....00000000 par_post_results_offset.....00000000 rsc_post_results_offset.....00000000 lega_post_results_offset....00000000 keybd_post_results_offset...00000000 ram_post_results_offset.....00000000 sga_post_results_offset.....00000000 fm2_post_results_offset.....00000000 net_boot_results_offset.....00000000 csc_results_offset..........00000000 menu_results_offset.........00000000 console_results_offset......00000000 diag_results_offset.........00000000 rom_scan_offset.............00000000 sky_post_results_offset.....00000000 global_offset...............00000000 mouse_offset................00000000 vrs_offset..................00000000 taur_post_results_offset....00000000 ent_post_results_offset.....00000000 vrs40_offset................00000000 gpr_save_area1............@ 10000178 system_info_offset..........00000880 buc_info_offset.............0000091C processor_info_offset.......00000A6C fm2_io_info_offset..........00000000 processor_post_results_off..00000000 system_vpd_offset...........00000000 mem_data_offset.............00000000 l2_data_offset..............00000D7C fddi_post_results_offset....00000000 golden_vpd_offset...........00000000 nvram_cache_offset..........00000000 user_struct_offset..........00000000 residual_offset.............00000E3C numatopo_offset.............00000E3C

ipl_cb_and_bit_map_size....00008898 bit_map_size...............00000007 ipl_info_size..............00000598 iocc_post_results_size.....00000000 nio_dskt_post_results_size.00000000 sjl_disk_post_results_size.00000000 scsi_post_results_size.....00000000 eth_post_results_size......00000000 tok_post_results_size......00000000 ser_post_results_size......00000000 par_post_results_size......00000000 rsc_post_results_size......00000000 lega_post_results_size.....00000000 keybd_post_results_size....00000000 ram_post_results_size......00000000 sga_post_results_size......00000000 fm2_post_results_size......00000000 net_boot_results_size......00000000 csc_results_size...........00000000 menu_results_size..........00000000 console_results_size.......00000000 diag_results_size..........00000000 rom_scan_size..............00000000 sky_post_results_size......00000000 global_size................00000000 mouse_size.................00000000 vrs_size...................00000000 taur_post_results_size.....00000000 ent_post_results_size......00000000 vrs40_size.................00000000 system_info_size...........0000009C buc_info_size..............00000150 processor_info_size........00000310 fm2_io_info_size...........00000000 processor_post_results_size00000000 system_vpd_size............00000000 mem_data_size..............00000000 l2_data_size...............000000C0 fddi_post_results_size.....00000000 golden_vpd_size............00000000 nvram_cache_size...........00000000 user_struct_size...........00000000 residual_size..............0000776C numatopo_size..............00000000

Continued on next page

-32 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Checkpoint Introduction

Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz

1. Where is the softros located ? 2. what are the four common parts of the boot image across Power platforms ? 3. What is the difference between the RSPC and the CHRP at the very platforms of the boot image ? 4. In which logical volume is located the boot record ? 5. Who builds the IPLCB on the 3 Power platforms ? 6. What is the difference between the RS6K and the other Power architectures in the boot record

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose

Notes on Quiz and transition to the next section

Quiz responses

The responses for the Quiz are : 1. Where is the located the softros : • after the header in the boot logical volume 2. what are the four common parts of the boot image across Power platforms : • bootexpand • kernel • ramfs • saved base 3. What is the difference between the RSPC and the CHRP at the very begening of the boot image • RSPC use an hints structure • CHRP use an ELF header 4. In which logical volume is located the boot record • None, the bootrecord is located at the very beginning of the disk 5. Who build the IPLCB ? • ROS on RS6K • Softros on CHRP and RSPC 6. What is the difference between RS6K and other Power platforms in the boot record ? • The RS6K doesn’t use the boot partition table

Transition Statement

Now we will describe: • the IA-64 specific boot process if this is not an only Power course

-34 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

The IA-64 Boot Mechanism Introduction

The section will explain the boot mechanism used by the IA-64 platform.

Definitions

EFI stands for Extensible Firmware Interface. EFI provide a standard interface between the Hardware and the operating system on IA-64 platforms.

Boot overview

When the system is powered on, the EFI will load first. EFI will load BIOS for devices that needs. EFI will then prompt to enter the setup for a timeout period. EFI will then prompt the EFI boot menu for another timeout period after witch he will scan the bootlist in order to find a boot device. The EFI boot loader will prompt for the boot loader menu and after the timeout or exit from the menu initialize the IPL Control Block. Then it will locate and load the kernel that will initialize. The kernel will then call init (In fact /usr/lib/boot/ssh at this stage) The ssh will then call rc.boot for Phase I and Phase II specific to each boot device types. Then init will execute rc.boot Phase III and the remaining common code in rc.boot for disk and network boot devices If no boot device is found EFI will start the EFI Shell on IA-64 platforms that supports EFI shell. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-35 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

The IA-64 Boot Mechanism Boot diagram

-- continued

The following diagram represent the high level boot process overview. execution of EFI firmware Load needed BIOS. Prompt for Setup

prompt no timeout or os boot request

setup menu

yes EFI boot manager menu

boot yes boot maintenance maint manager manager menu request no scan the boot list

valid boot device found

no

EFI Shell

yes AIX boot loader

key entered during timeout

yes

AIX boot loader menu

no Kernel initialization Kernel call init (/usr/lib/boot/ssh) init ssh call rc.boot PHASE I&II init exit to newroot init calls rc.boot PHASE III from inittab and the rest of inittab entries

-36 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

IA-64 boot disk layout Boot image overview

The following represent the overview of an AIX 5L on IA-64 boot image : hdisk0_all hdisk0 hd5

kernel

hdisk0_s0

RAM Filesystem

VGDA

base customized data

PMBR,EFI Partition Header and entries

PMBR, EFI Partition Header and entries

rest of the hdisk0

EFI boot loader

On IA-64 platform, AIX 5L must be aware of EFI disk partitioning. During installation, two partitions will be created on the target disk (hdisk0_all) : • A Physical Volume partition (hdisk0 in the AIX environment) known as a block device in the EFI environment (blkXX). • An IA-64 System partition (hdisk0_s0 in the AIX environment) known as an IA-64 System partition in the EFI environment (fsXX)

kernel

On IA-64 platform the 64 bit kernel (unix_ia64) can be used as the kernel for either UP or MP systems. The kernel initializes itself and then passes control to the simple shell init (ssh) in the RAM filesystem. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-37 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

IA-64 boot disk layout

-- continued

RAM filesystem

Filesystem used during boot process, that contains programs and data for initializing devices and subsystems in order to install AIX, execute diagnostics, or to access and bring up the rest of AIX.

base customized data

Area of the hard disk boot logical volume containing user configured ODM device configuration information that is used by the system configuration process.

EFI boot loader

The EFI boot loader will reside in am IA-64 System Partition physically located after the Physical Volume Partition by the installation process.

-38 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

EFI boot manager and boot maintenance manager overview Introduction

At boot time, EFI will prompt for the EFI boot manager menu to be entered for a timeout period. The timeout period is customizable via the boot maintenance menu.

boot manager

At boot time, the boot manager will display the bootlist and prompt for a time out period. If the timeout is reached, the boot manager will scan the bootlist in the boot order to find a valid boot device. If a key is entered before the timeout period, the user will be able to : • select a boot device from the list to boot for this session • start EFI Shell on platform that support EFI Shell • enter the boot maintenance manager

boot maintenance manager menu

The boot maintenance manager menu will allow the administrator to : • boot from a file • add/delete boot options • change boot order • manage boot next setting • set autoboot timeout • select active console output devices (output,input and error) • do a cold reset.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-39 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

EFI Shell Overview Introduction

The EFI Shell allow you to configure the boot process used by the IA-64 platform. The main functions are to : • Locate and identify different boot devices • Set environment variable • Use debugging sub commands • boot from the selected boot device

EFI Shell startup example

The EFI shell startup will display informations about the current EFI level and device mapping as follow : EFI version x.xx [xx.xx] Build flags : EIF64 Running on Merced EFI_DEBUG EFI IA-64 SDV/FDK (BIOS CallBacks) [Fri Mar 31 13:21:32 2000] - INTEL Cache Enabled. This image Main entry is at address 000000003F2BA000 Stack = 000000003F2B6FF0 BSP = 000000003F293000 INT Stack = 000000003F292FF0 INT BSP = 000000003F26F000 EFI Shell version x.xx [xx.xx] Device mapping table fs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54) blk0 : VenHw(Unknown Device:01)/HD blk1 : VenHw(Unknown Device:80)/HD blk2 : VenHw(Unknown Device:81)/HD blk3 : VenHw(Unknown Device:ff)/HD blk4 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54) blk5 : VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)

EFI Shell sub commands

In the EFI Shell you will be able to use the following sub commands :

sub command

Description

help [internal command]

Display this help

guid [sname]

Dump known guid ids

set [-d] [sname] [value]

Set/get environment variable

alias [-d] [sname] [value]

Set/get alias settings

dh [-p prot_id] | [handle]

Dump handle info

map [-dvr] [sname[:]] [handle]

Map shortname to device path

mount BlkDevice [sname[:]]

Mount a filesystem on a block device

cd [path]

Updates the current directory

echo [[-on | -off] \ [text]

Echo text to stdout or toggle script echo

endfor

Script-only: Delimiter for loop construct

pause

Script-only: Prompt to quit or continue

Continued on next page

-40 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

EFI Shell Overview

-- continued

EFI Shell sub commands continued

sub command ls [dir] [dir]... mkdir [dir][dir].... if [not] condition then endif goto label for var in mode [row col] cp file [file] ... dest comp file1 file2 rm file/dir [file/dir] memmap type [-a] [-u] file dmpstore load driver_name ver err [level] time [hh:mm:ss] date [mm/dd/yyyy] stall microseconds reset [/warm] [reset string] vol fs [Volume Label] attrib [+/- rhs] [filename] cls [background color] dnlk device [Lba] [Blocks] pci [bus dev] [func] mm Address [Width] [;Type] mem [Address] [size] [;MMIO]

Description Obtain directory listing Make directory Script-only: IF THEN construct Script-only: Delimiter for IF THEN construct Script-only: Jump to label location in script Script-only: Loop construct Set/get current text mode Copy files/directories Compare two files Remove file/directories Dumps memory map Type file Dumps variable store Loads a driver Displays version info Set or display error level Set or display time Set or display date Delay for x microseconds Cold or Warm reset Set or display volume label View/sets file attributes Clear screen Hex dump of BlkIo Devices Dsiplay pci device(s) info Memory modify: Mem, MMIO, IO, PCI Dump Memory or Memory Mapped IO

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-41 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

EFI Shell Overview

-- continued

EFI Shell sub commands continued

sub command

Description Configures boot driver & load options

bcfg -? edit [file name] Edd30 [On|Off] Enable or Disable EDD 3.0 Device paths unload [-nv] EddDebug [blockdevicename] Debug of EDD info from adapter card EFI Shell examples

The following is an example of the EFI Shell use : Shell> map pci Generic System Peripheral - Interrupt Controller Vendor 0x8086 Device 0x123D Program Interface 20 0 2 0 ==> Mass Storage Controller - SCSI Bus Vendor 0x1077 Device 0x1280 Program Interface 0 0 3 0 ==> PCI Bridge Device - ISA Vendor 0x8086 Device 0x7600 Program Interface 0 0 3 1 ==> Mass Storage Controller - IDE Vendor 0x8086 Device 0x7601 Program Interface 80 0 3 2 ==> Serial Bus Controller - USB Vendor 0x8086 Device 0x7602 Program Interface 0 0 3 3 ==> Serial BUS Controller - SMBUS Vendor 0x8086 Device 0x7603 Program Interface 0 . . . Shell> fs0: dir boot iplcb -dir Directory Information ipl_control_block_id......................= ipl_cb_and_bit_map_offset.................= ipl_cb_and_bit_map_size...................= bit_map_offset............................= bit_map_size..............................= ipl_info_offset...........................= ipl_info_size.............................= system_info_offset........................= system_info_size..........................= processor_info_offset.....................= processor_info_size.......................= io_xapic_info_offset......................= io_xapic_info_size........................= handoff_info_offset.......................= handoff_info_size.........................= platform_int_info_offset..................= platform_int_size.........................= residual_offset...........................= residual_size.............................=

-44 of 72 AIX 5L Internals

IA64_IPL 0x0 0x7F0 0x448 0x27 0xD8 0x7C 0x3D8 0x50 0x250 0x188 0x428 0x18 0x158 0xF0 0x440 0x8 0x0 0x0

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Checkpoint Introduction

Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz

7. In which partition is located the aix boot loader ? 8. What is the equivalent of fs0 partition in the AIX environment ?

9. In which partition is located the IA-64 boot record ? 10. In which partition is located the IA-64 boot image ? 11. Where is the bootexpand located on IA-64 ?

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-45 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose

Checkpoint for rc.boot results

Answers

1. In which partition is located the aix boot loader ? • The boot loader is located in fs0: 2. What is the equivalent of fs0 partition in the AIX environment ? • the equivalent is hdiskxx_s0 3. In which partition is located the IA-64 boot record ? • no boot record on IA-64 4. In which partition is located the IA-64 boot image ? • the boot image is located in hd5that in fact resides in the rootvg PV partition of the disk (blk5 in our example) 5. Where is the bootexpand located on IA-64 ? • no bootexpand on IA-64

-46 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Hard Disk Boot process (rc.boot Phase I) Introduction

The main goal here is to get the devices configured and odm initialized

Hard disk Phase I diagram

The following chart represent the hard disk boot phase I process restore base configuration from boot disk

restbase return code

0 led 548

0 led 510 configuration manager Phase I run bootinfo -b to get boot device link boot device to /dev/ipldevice led 511 exit 0

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-47 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Hard Disk Boot process (rc.boot Phase II) Introduction

The main objective in hard disk boot phase II is to varyon rootvg and mount standard filesystems.

Hard disk Phase II diagram

The following chart represent the hard disk boot phase II process led 511 ipl_varyon -v

ipl varyon return code

0 led 552,554 or 556

0 led 517 fsck and mount aix filesystems on /mnt. check for dump in hd6 swapon hd6 if no dump present run savebase recovery procedure

key service or dump in hd6

yes

execute the service procedure

no copy /etc/vg and objrepos to disk merge devices unmount filesystems remount filesystems led 553 exit 0

-48 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Hard Disk Boot process (rc.boot Phase III) Introduction

The main objective in hard disk boot phase III is to mount runtime /tmp, sync rootvg and then fall down the phase III common process.

Hard disk Phase III diagram

The following chart represent the hard disk boot phase III process

fsck and mount /tmp syncvg rootvg continue phase III common code

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-49 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

CDROM Boot process (rc.boot Phases I, II and III) Introduction

The main objective of the CDROM boot process is to configure devices needed for installation and maintenance procedures and start the bi_main process.

CDROM boot phases I,II and III diagram

The following chart shows the CDROM boot phases I,II and III

configuration manager Phase I led 517 Mount the cdrom spot

1

Phase number

3 exit 0

2 exec bi_main exit 0

led 512 recreate the ramfs from the SPOT led 510 configure remaining devices needed for install led 511 exit 0

-50 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Tape Boot process (rc.boot Phases I, II and III) Introduction

The main objective of the Tape boot process is to configure devices needed for installation and maintenance procedures and start the bi_main process.

Tape boot phases I,II and III diagram

The following chart shows the Tape boot phases I,II and III

1

Phase number

led 510 configuration manager Phase I

3 exit 0

2 exec bi_main

configuration manager Phase II

exit 0

led 512 Change all tape devices block_sizes to 512 Cleanup links Cleanup ODM and rebuild

exit 0

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-51 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Network Boot process (rc.boot Phases I, II and III) Introduction

The main objective of the Network boot process is to configure devices, configure additional network options (network address, mask and default route) and run the $RC_CONFIG script.

Network boot phases I,II and III diagram

The following chart shows the Network boot phases I,II and III

set nim debug if needed

1

Phase number

continue phase III common code

led 600 2 boot from atm0 yes

3

set nim debug if needed set nim environment run $RC_CONFIG

no

exit 0

exit 0

restbase save ATM datas Clear ODM configuration manager phase I boot from atm0 yes

no

configure ATM pvc, svc and muxatmd

rc =0 yes

configure the native network bootdevice (ifconfig)

no

rc =0

led 607

led 607

yes

tftp miniroot set nim environment create /etc/hosts and routes nfs mount the SPOT run $RC_CONFIG from SPOT

-52 of 72 AIX 5L Internals

no

exit 0

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Common Boot process (rc.boot Phase III) Introduction

The common Phase III boot code is run for disk and network boot only.

Common boot Phase III diagram

The following chart shows the common boot phases III process ensure 1024K free space in /tmp load streams modules fix the secondary dump device swapon hd6 if no dump present run savebase recovery procedure key

is in service

yes

config manager phase III disable controlling tty

no clean odm for alt disk install config manager phase II setup System Hang Detection run graphical boot if needed run savebase clean unavailable tty from inittab sync the files to hard disk run /etc/rc.B1 if exists start the syncd daemon start the errdaemon daemon clean /etc/locks and /etc/nologin start mirrord daemon start cfgchk daemon run diagsrv if supported by platform System initialization completed

exit 0

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-53 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Network boot $RC_CONFIG files Introduction

As seen in the Network Boot Process (Phases I, II and III) these scripts are ran by rc.boot when booting from a network device in phases I and II. These script are located in the /usr/lib/boot/network directory. They are loaded from the SPOT on the NIM server during the network boot process.

rc.config types

There are 3 types of rc.config files : • rc.bos_inst : Used to configure a system for AIX installation • rc.dd_boot : Used for network boot of diskless or dataless systems • rc.diag : Used for booting to diagnostics

rc.bos_inst

This script will : • Phase I : • Mount resources listed in niminfo as ${NIM_MOUNTS} • Enable NIM debug if needed • link necessary methods from the SPOT • run configuration manager • Phase II : • Set some tcpip parameters • enable diagnostics for pre-install diagnostics on disks • execute bi_main Continued on next page

-54 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Network boot $RC_CONFIG files rc.dd_boot

Guide

-- continued

This script will : • Phase I : • remove link from /lib to /usr/lib and populate /lib with hard links to /usr to ensure the use of RAM libraries • Mount the root directory • get niminfo file • unconfigure network services (ifconfig and routes) • run configuration manager phase I • reconfigure the network using nim informations • mount /usr • activate the local or remote paging spaces • issue mergedev • unmount all remote filesystems • Phase II : • mount types dd_boot filesystems • clean up unused shared libraries • set the hostname

rc.diag

This script will : • Phase I: • Mount resources list in niminfo as ${NIM_MOUNTS} • Enable NIM debug if needed • link necessary methods from the SPOT • run configuration manager • Phase II : • configure the console • if graphic console configure gxme0 and rcm0 • For RSPC and CHRP start, sleep 2 and stop the errdaemon to get errors since last boot • Execute diag pretest before running diag

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-55 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

The init process Introduction

The init initializes and controls AIX processes. The boot process, when running from the RAM filesystem (Phases I and II), doesn’t use the real init command but /usr/lib/boot/ssh. This strategy allows for more efficient use of the system resources during boot The real init is found in /usr/sbin/init. The real init begins during the kernel newroot, which occurs at the end of Phase II of rc.boot. The real init will use the /etc/inittab file to start AIX processes and run system environment initialization scripts

/etc/inittab

Here is a example of the inittab file : init:2:initdefault: brc::sysinit:/sbin/rc.boot 3 0/dev/console 2>&1 powerfail::powerfail:/etc/rc.powerfail 0/dev/ console 2>&1 # Power Failure Detection rc:2:wait:/etc/rc 0/dev/console 2>&1 fbcheck:2:wait:/usr/sbin/fbcheck 0/dev/console 2>&1 # Run /etc/firstboot srcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controller rctcpip:2:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP daemons rcnfs:2:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemons cron:2:respawn:/usr/sbin/cron cons:0123456789:respawn:/usr/sbin/getty /dev/console writesrv:2:wait:/usr/bin/startsrc -swritesrv uprintfd:2:respawn:/usr/sbin/uprintfd shdaemon:2:off:/usr/sbin/shdaemon >/dev/console 2>&1 # High availability daemon logsymp:2:once:/usr/lib/ras/logsymptom # for system dumps lft:2:respawn:/usr/sbin/getty /dev/lft0

-56 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

ODM Structure and usage Introduction

The Object Data Manager is widely used in AIX to store and retrieve various system informations. For this purpose, AIX defines number of standard ODM classes. Any application can create an use it’s own ODM classes to manage it’s own informations.

AIX Informations managed by ODM

AIX System data managed by ODM includes: • Device configuration information • Display information for SMIT (menus, selectors, and dialogs) • Vital product data for installation and update procedures • Diagnostics informations • System resource information. • RAS informations

Devices ODM Classes

The Devices classes are used by the configuration manager, device drivers and AIX device related commands (lsdev, lsattr ,lspv ,lsvg ...). The following table list the Devices ODM classes and their definitions : Class PdDv PdCn PdAt PdAtXtd Config_Rules CuDv CuDep CuAt CuDvDr CuVPD CuPart CuPath CuPathAt

Definition Predefined Devices Predefined Connection Predefined Attribute Extended Predefined Attribute Configuration Rules Customized Devices Customized Dependency Customized Attribute Customized Device Driver Customized Vital Product Data EFI partitions

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-57 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and usage SWVPD ODM Classes

-- continued

The SWVPD classes are used by fileset related commands like installp, instfix, lslpp, oslevel. SWVPD is divided in 3 parts : • root : classes are in /etc/objrepos • usr : classes are in /usr/lib/objrepos • share : classes are located in /usr/share/lib/objrepos The following table list the Software Vital Product Data ODM classes and their definitions : Class lpp

inventory history product

SRC ODM Classes

Definition The lpp object class contains information about the installed software products, including the current software product state. The inventory object class contains information about the files associated with a software product. The history object class contains historical information about the installation and updates of software products. The product object class contains product information about the installation and updates of software products and their prerequisites.

SRC Classes are used by the srcmstr and related commands : lssrc, startsrc, stopsrc and chssys. The following table list the System Resource Controller ODM classes and their definitions Class SRCsubsys

SRCsubsvr

SRCnotify

Definition The subsystem object class contains the descriptors for all SRC subsystems. A subsystem must be configured in this class before it can be recognized by the SRC. An object must be configured in this class if a subsystem has subservers and the subsystem expects to receive subserver-related commands from the srcmstr daemon. This class provides a mechanism for the srcmstr daemon to invoke subsystem-provided routines when the failure of a subsystem is detected.

SRCextmeth Continued on next page

-58 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and usage SMIT ODM Classes

Guide

-- continued

The SMIT odm classes are used by smit and smitty commands. The following table list the SMIT ODM classes and their definitions Use smit menu

Class Definition sm_menu_opt 1 for title of screen 1 for first item 1 for second item 1 for last item smit selector sm_name_hdr 1 for title of screen and other attributes 1 for entry field or pop-up list smit selector sm_cmd_opt 1 for entry field or pop-up list smit dialog sm_cmd_hdr 1 for title of screen and command string smit dialog sm_cmd_opt 1 for first entry field 1 for second entry field ... 1 for last entry field RAS ODM Classes

The RAS classes are used by the errdaemon, shdaemon, shconf and alog commands. The following table list the RAS ODM classes and their definitions Class Definition errnotify Used by errlog notification process SWservAt Used by errorlog, system dumps, System Hang Detection and alog Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-59 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and usage Diagnostics ODM Classes

The diagnostics classes are used by the diag command. The following table list the Diagnostics ODM classes and their definitions Class PDiagRes PDiagAtt PDiagTask CDiagAtt TMInput MenuGoal FRUB FRUs DAVars PDiagDev DSMOptions

ODM commands

-- continued

Definition Predefined Diagnostic Resource Object Class Predefined Diagnostic Attribute Device Object Class Predefined Diagnostic Task Object Class Customized Diagnostic Attribute Object Class Test Mode Input Object Class Menu Goal Object Class Fru Bucket Object Class Fru Reporting Object Class Diagnostic Application Variables Object Class Predefined Diagnostic Devices Object Class Diagnostic Supervisor Menu Options Object Class

The following table list the ODM commands and their usage: Command Definition odmadd Adds objects to an object class. The odmadd command takes an ASCII stanza file as input and populates object classes with objects found in the stanza file. odmchange Changes specific objects in a specified object class. odmcreate Creates empty object classes. The odmcreate command takes an ASCII file describing object classes as input and produces C language .h and .c files to be used by the application accessing objects in those object classes. odmdelete Removes objects from an object class. odmdrop Removes an entire object class. odmget Retrieves objects from object classes and puts the object information into odmadd command format. odmshow Displays the description of an object class. The odmshow command takes an object class name as input and puts the object class information into odmcreate command format. Continued on next page

-60 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

ODM Structure and Usage ODM subroutines

-- continued

The following table list the odm subroutines and their use :

subroutine odm_add_obj odm_change_obj odm_close_class odm_create_class odm_err_msg odm_free_list

definition Adds a new object to the object class. Changes the contents of an object. Closes an object class. Creates an empty object class. Retrieves a message string. Frees memory allocated for the odm_get_list subroutine. odm_get_by_id Retrieves an object by specifying its ID. odm_get_first Retrieves the first object that matches the specified criteria in an object class. odm_get_list Retrieves a list of objects that match the specified criteria in an object class. odm_get_next Retrieves the next object that matches the specified criteria in an object class. odm_get_obj Retrieves an object that matches the specified criteria from an object class. odm_initialize Initializes an ODM session. odm_lock Locks an object class or group of classes. odm_mount_class Retrieves the class symbol structure for the specified object class. odm_open_class Opens an object class. odm_rm_by_id Removes an object by specifying its ID. odm_rm_obj Removes all objects that match the specified criteria from the object class. odm_run_method Invokes a method for the specified object. odm_rm_class Removes an object class. odm_set_path Sets the default path for locating object classes. odm_unlock Unlocks an object class or group of classes. odm_terminate Ends an ODM session. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-61 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

ODM Structure and Usage ODM paths

-- continued

As the ODM classes can be found in 3 paths (root, usr and share), the user must decide which path he will use before running ODM commands or ODM subroutines. For ODM commands, the user can set the path using : # export ODMDIR=/usr/share/lib/objrepos In a C program, the user should use : odm_set_path("/usr/lib/objrepos");

-62 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

boot and installation logging facilities Introduction

It can be useful to retrieve rapidly the logging files used for boot or installation to help solve problems. The alog command can be used to recover system logs

log types

The alog command is used by installation and boot processes to log informations or errors for the following topics : • boot : log for the boot process • bosinst : log used for the AIX installation process • console : log used to store console messages • nim : log used to store NIM messages • dumpsymp : used to store dump symptom messages

alog command usage

The following alog commands may be used : • alog -L : will list alog log types defined in the ODM • alog -t -o : will display the log file related to the log_type • echo “Message xxx” | alog -t : will log the message to the log file • alog -L -t : will display detailed information related to the log_type definition (log file path, size and verbosity) • alog -Cw -t : will change the verbosity (09) for the log_type • alog -C -t dbgopt set exectrace=on CPU0> go Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-67 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Debugging boot problems using IADB boot debugging output example

-- continued

The following example show the beginning of what you can see on the native serial port when debugging the boot process : MEDIEVAL DEBUGGER ENTERE interrupt. IP->E00000000001D2F2 brkpoint()+2: { .mfi 0: nop.m 0x100001 ;; } >CPU0> set dbgmsg=on >CPU0> set exectrace=on CPU0> go /) + + bootinfo -t BOOTYPE=3 + [ 0 -ne 0 ] + [ -z 3 ] + unset pdev_to_ldev undolt

-68 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Packaging Changes Introduction

The lpp packaging has been reviewed to reflect the need for platform dependant packages.

Package names

The Packages names have the following structure : .V.R.M.F...bff where : • is the name of the package to be installed • V.R.M.F are the Version, Release, Modification and Fix levels of the package • is the platform type for which that package was designed. The platform type can be one of : • I : For Intel IA-64 platform • N : For Neutral packages that can be installed on all platforms • Nothing : For Power specific packages

Packaging commands

installp, bffcreate, inutoc and instfix commands are updated to reflect these changes. By default packaging commands will process only packages related to the platform where the command is ran. A “-M” flag has been added to these command that accept the following sub options : • I : To process Intel related packages • R : To process Power related packages • N : To process Neutral related packages • A : To process all kind of packages

installp options

The installp command will only accept the -M flag with -l or -L options.

bffcreate options

The bffcreate command will accept all -M sub options to allow transit of packages regardless of the current platform. This is needed for nim operations

instfix options

The instfix command like the installp command will only accept the -M flag when used in conjunction with the -T (list flag).

installp option -L output will include platform informations

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-69 of 72

Guide

inutoc command

Draft Version for review, Sunday, 15. October 2000, boot.fm

The inutoc command will accept the -M flag.

-70 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, boot.fm

Guide

Checkpoint Introduction

Take a few minutes to answer the following questions. We will review the questions as a group when everyone has finished.

Quiz

1. Who call rc.boot ? 2. What is common in phase II of tape, cdrom and network phase II ? 3. What is specific to the rc.boot phase III ? 4. What will you need to do if you want to modify something in rc.boot phase I or II ? 5. What is the phase and/or device in rc.boot not supported on IA-64 ? 6. What is the usage of the ODM ? 7. What is init in the first two phases of the boot ?

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-71 of 72

Guide

Draft Version for review, Sunday, 15. October 2000, boot.fm

Instructor Notes Purpose



Answers

1. Who call rc.boot ? • init 2. What is common in phase II of tape,cdrom and network phase II ? • They exec bi_main (rc.bos_inst for network) to run installation tasks 3. What is specific to the rc.boot phase III ? • rc.boot phase III is called by the actual init process after newroot. 4. What will you need to do if you want to modify something in rc.boot phase I or II ? • You will need to run bosboot in order to copy your changed rc.boot to the RAMFS 5. What is the phase and/or device in rc.boot not supported on IA-64 ? • The tape boot device (this was said in the map various types of boot) 6. What is the usage of the ODM ? • Store and retrieve system informations 7. What is init in the first two phases of the boot ? • ssh

-72 of 72 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

Unit 15. /proc Filesystem Support This unit describes: The /proc filesystem in the AIX 5L kernel.

What You Should Be Able to Do After completing this unit, you should be able to • List the directories and files that are found in the /proc filesystem • Describe the basic functionality of each file in the sub-directory tree for a specific process • Create a simple C program to access the files belonging to another process

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-1 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

/proc Filesystem Support Introduction

/proc is a file system that provides access to the state of each active process and Light Weight Process (LWP) in the system.

Platform

This lesson is platform independent

/proc filesystem

The contents of the /proc filesystem have the same appearance as any other file and directory in a Unix filesystem. The name of each top-level entry in the /proc directory is a sub-directory, named by the decimal number corresponding to the process ID, and the owner of each is determined by the user-ID of the process. Access to process state is provided by additional files contained within each sub-directory; this hierarchy is described more completely below. Except where otherwise specified, ‘‘/proc file’’ is meant to refer to a nondirectory file within the hierarchy rooted at /proc.

Filesystem heirarchy

The directory structure for the proc directory is described below. The pid represents the process ID number and the lwp# represents the light-weight process number.

File/Directory Name /proc /proc/pid /proc/pid/status /proc/pid/ctl /proc/pid/psinfo /proc/pid/as /proc/pid/map /proc/pid/object /proc/pid/sigact /proc/pid/lwp/lwp# /proc/pid/lwp/lwp#/lwpstatus /proc/pid/lwp/lwp#/lwpctl /proc/pid/lwp/lwp#/lwpsinfo

Description directory - list of processes directory for process pid status of process pid control file for process pid ps info for process pid address space of process pid as map info for process pid directory for objects for process pid signal actions for process pid directory for LWP lwp# status of LWP lwp# control file for LWP lwp# ps info for LWP lwp# Continued on next page

-2 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

/proc Filesystem Support Accessing / proc files

Guide

-- continued

Standard system call interfaces are used to access /proc files: open(2), close(2), read(2), and write(2). Most files describe process state and can only be open for reading. An open for writing allows process control; a read-only open allows inspection, but not control.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-3 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

Types of Files Introduction

Listed below are descriptions of the files that are contained in the /proc filesystem heirarchy. These files are described in more detail on the following pages.

Filename as

Mode rd/wr

ctl

wr

status

rd

psinfo

rd

map

rd

cred

rd

sigact

rd

object

N/A

lwp lwp#/lwpstatus lwp#/lwpctl

N/A rd wr

lwp#/lwpsinfo

??

-4 of 36 AIX 5L Internals

Function Contains the address-space image of the process Allows change to the process state or behaviour Contains state information about the process Information about the process needed by the ps(1) command Information about the virtual address map of the process Describes the credentials associated with the process Describes the disposition of all signals associated with the process A directory containing read-only files with names as they appear in the map file A directory for LWP State information for LWP lwp# Allows change to the LWP process state or behaviour of LWP lwp# Process info for LWP lpw#

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

The as File Introduction

The as file contains the address-space image of the process and can be opened for both reading and writing.

Accessing the file

lseek is used to position the file at the virtual address of interest and then the address space can be examined or changed through a read or write.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-5 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The ctl File Introduction

The ctl file is a write-only file to which structured messages are written directing the system to change some aspect of the process’s state or control its behavior in some way. The seek offset is not relevant when writing to this file.

Control messages

Individual LWPs also have associated lwpctl files. Process state changes are effected through control messages written to either to the ctl file of the process or to a specific lwpctl file. All control messages consist of an int naming the specific operation followed by additional data containing operands (if any). The effect of a control message is immediately reflected in the state of the process visible through appropriate status and information files. Multiple control messages can be combined in a single write(2) to a control file, but no partial writes are permitted; that is, each control message (operation code plus operands) must be presented in its entirety to the write and not in pieces over several system calls.

Descriptions of control messages

Descriptions of allowable control messages are included on page 20.

-6 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

The status File Introduction

The status file contains state information about the process and one of its LWPs (chosen according to the rules described below).

File format

The file is formatted as a struct pstatus containing the following members: long pr_flags; ushort_t pr_nlwp; sigset_t pr_sigpend; vaddr_t pr_brkbase; ulong_t pr_brksize; vaddr_t pr_stkbase; ulong_t pr_stksize; pid_t pr_pid; pid_t pr_ppid; pid_t pr_pgid; pid_t pr_sid; timestruc_t pr_utime; timestruc_t pr_stime; timestruc_t pr_cutime; timestruc_t pr_cstime; sigset_t pr_sigtrace; fltset_t pr_flttrace; sysset_t pr_sysentry; sysset_t pr_sysexit; lwpstatus_t pr_lwp;

/* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /*

Flags */ Total number of lwps in the process */ Set of process pending signals */ Address of the process heap */ Size of the process heap, in bytes */ Address of the process stack */ Size of the process stack, in bytes */ Process id */ Parent process id */ Process group id */ Session id */ Process user cpu time */ Process system cpu time */ Sum of children’s user times */ Sum of children’s system times */ Mask of traced signals */ Mask of traced faults */ Mask of system calls traced on entry */ Mask of system calls traced on exit */ "representative" LWP */

Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-7 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The status File Member description

-- continued

Here is a description of members of the status file:

Member pr_flags pr_nwlp pr_brkbase pr_brksize

pr_stkbase pr_stksize

pr_pid pr_ppid pr_pgid pr_sid pr_utime pr_stime pr_cutime pr_cstime pr_sigtrace pr_flttrace pr_sysentry pr_sysexit pr_lwp

Description A bit mask holding flags (flags are described below) Total number of LWPs in the process Virtual address of the process heap Size of process heap in bytes. The address formed by the sum of these values is the process break (see brk(2)). Virtual address of the process stack Size of the process stack in bytes. Each LWP runs on a separate stack; the process stack is distinguished in that the operating system will grow as necessary. Process ID Parent process ID Process group ID Session ID of the process User CPU time consumed by the process in seconds and nanoseconds System CPU time consumed by the process in seconds and nanoseconds Cumulative user CPU time consumed by the process in seconds and nanoseconds Cumulative system CPU time consumed by the process in seconds and nanoseconds Set of signals that are being traced (see PCSTRACE) Set of hardware faults that are being traced (see PCSFAULT) Set of system calls being traced on entry (see PCSENTRY) Set of system calls being traced on exit (see PCSEXIT) If the process is not a zombie, pr_lwp contains an lwpstatus_t structure describing a representative LWP. The contents of this structure ave the same meanin as if it were read from an lwpstatus file. Continued on next page

-8 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

The status File

pr_flags

-- continued

pr_flags is a bit-mask holding these flags:

Flag PR_ISSYS PR_FORK PR_RLC PR_KLC PR-ASYNC Multi-threaded applications

Guide

Description System process (see PCSTOP) Has its inherit-on-fork flag set (see PCSET) Has its run-on-last-close flag set (see PCSET) Has its kill-on-last-close flag set (see PCSET) Has its asynchronous-stop flag set (see PCSET)

When the process has more than one LWP, its representative LWP is chosen by the /proc implementation. The chosen LWP is a stopped LWP only if all the process’s LWPs are stopped, is stopped on an event of interest only if all the LWPs are so stopped, or is in a PR_REQUESTED stop only if there are no other events of interest to be found. The chosen LWP remains fixed as long as all the LWPs are stopped on events of interest and PCRUN is not applied to any of them. When applied to the process control file, every /proc control operation that must act on an LWP uses the same algorithm to choose which LWP to act on. Together with synchronous stopping (see PCSET), this enables an application to control a multiple-LWP process using only the process-level status and control files if it so chooses. More fine-grained control can be achieved using the LWP-specific files.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-9 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The psinfo file Introduction

The psinfo file contains information about the process needed by the ps(1) command. If the process contains more than one LWP, a representative LWP (chosen according to the rules described for the status file) is used to derive the status information.

File format

The file is formatted as a struct psinfo containing the following members: ulong_t pr_flag; ulong_t pr_nlwp; uid_t pr_uid; gid_t pr_gid; pid_t pr_pid; pid_t pr_ppid; pid_t pr_pgid; pid_t pr_sid; caddr_t pr_addr; long pr_size; long pr_rssize; timestruc_t pr_start; timestruc_t pr_time; dev_t pr_ttydev; char pr_fname[PRFNSZ]; char pr_psargs[PRARGSZ]; struct lwpsinfo pr_lwp;

/* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /*

process flags */ number of LWPs in process */ real user id */ real group id */ unique process id */ process id of parent */ pid of process group leader */ session id */ internal address of process */ size of process image in pages */ resident set size in pages */ process start time, time since epoch */ usr+sys cpu time for this process */ controlling tty device (or PRNODEV)*/ last component of exec()ed pathname*/ initial characters of arg list */ "representative" LWP */

Platform specific data

Some of the entries in psinfo, such as pr_flag and pr_addr, refer to internal kernel data structures and should not be expected to retain their meanings across different versions of the operating system. They have no meaning to a program and are only useful for manual interpretation by a user aware of the implementation details.

Zombies

psinfo is still accessible even after a process becomes a zombie.

Representative LWP

pr_lwp describes the representative LWP chosen as described under the pstatus file above. If the process is a zombie, pr_nlwp and pr_lwp.pr_lwpid are zero and the other fields of pr_lwp are undefined.

-10 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

The map File Introduction

The map file contains information about the virtual address map of the process. The file contains an array of prmap structures, each of which describes a contiguous virtual address region in the address space of the traced process.

File format

The prmap structure contains the following members: caddr_t ulong_t char off_t long long

Member description

pr_vaddr; pr_size; pr_mapname[32]; pr_off; pr_mflags; pr_filler[9];

/* Virtual address */ /* Size of mapping in bytes */ /* Name in /proc/pid/object */ /* Offset into mapped object, if any */ /* Protection and attribute flags */ /* For future use */

Members of the map file are described below:

Member

pr_mflags

Description

pr_vaddr

Virtual address of the mapping within the traced process

pr_size

Size of mapping in bytes

pr_mapname

If not empty string, contains name of a file in the object directory that can be opened for reading to yield a file descriptor for the object to which vitrual address is mapped.

pr_off

Offset within the mapped object (if any) to which the virtual address is mapped

pr_mflags

Protection and attribute flags (see below)

pr_filler

For future use

pr_mflags is a bit-mask of protection and attribute flags: Flag MA_READ MA_WRITE MA_EXEC MA_SHARED

Description Mapping is readable by the traced process Mapping is writable by the traced process Mapping is executable by the traced process Mapping changes are shared by mapped object Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-11 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The map File Contiguous address space

-- continued

A contiguous area of the address space having the same underlying mapped object may appear as multiple mappings because of varying read, write, execute, and shared attributes. The underlying mapped object does not change over the range of a single mapping. An I/O operation to a mapping marked MA_SHARED fails if applied at a virtual address not corresponding to a valid page in the underlying mapped object. Reads and writes to private mappings always succeed. Reads and writes to unmapped addresses always fail.

-12 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

The cred File Introduction

The cred file contains a description of the credentials associated with the process.

File format

The file is formatted as a struct prcred containing the following members: uid_t uid_t uid_t gid_t gid_t gid_t uint_t gid_t

pr_euid; pr_ruid; pr_suid; pr_egid; pr_rgid; pr_sgid; pr_ngroups; pr_groups[1];

/* /* /* /* /* /* /* /*

Effective user id */ Real user id */ Saved user id (from exec) */ Effective group id */ Real group id */ Saved group id (from exec) */ Number of supplementary groups */ Array of supplementary groups */

The list of associated supplementary groups in pr_groups is of variable length; pr_ngroups specifies the number of groups.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-13 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The sigact File Introduction

The sigact file contains an array of sigaction structures describing the current dispositions of all signals associated with the traced process. Signal numbers are displaced by 1 from array indexes, so that the action for signal number n appears in position n-1 of the array.

-14 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

lwp/lwpctl file Introduction

The lwpctl file is a write-only control file. The messages written to this file affect only the associated LWP rather than the process as a whole (where appropriate).

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-15 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The lwp/lwpstatus File Introduction

The lwp/lwpstatus file contains LWP-specific state information. This information is present in the status file of the process for its representative LWP, also.

File format

The file is formatted as a struct lwpstatus containing the following member long pr_flags; /* short pr_why; /* short pr_what; /* lwpid_t pr_lwpid; /* short pr_cursig; /* siginfo_t pr_info; /* struct sigaction pr_action; /* sigset_t pr_lwppend; /* stack_t pr_altstack; /* short pr_syscall; /* short pr_nsysarg; /* long pr_sysarg[PRSYSARGS];/* char pr_clname[PRCLSZ]; /* ucontext_t pr_context; /* pfamily_t pr_family; /*

Flags */ Reason for stop (if stopped) */ More detailed reason */ Specific LWP identifier */ Current signal */ Info associated with signal or fault */ Signal action for current signal */ Set of LWP pending signals */ Alternate signal stack info */ System call number (if in syscall) */ Number of arguments to this syscall */ Arguments to this syscall */ Scheduling class name */ LWP context */ Processor family-specific information */

Continued on next page

-16 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

The lwp/lwpstatus File Member description

Guide

-- continued

Here is a description of members of the lwpstatus file:

Member pr_flags pr_why pr_what pr_lwpid pr_cursig pr_info

pr_action pr_lwppend

pr_altstack pr_syscall

pr_nsysarg pr_sysarg pr_clname pr_context

pr_family

Description A bit mask holding flags (described below) Reason for LWP stop (if stopped). Possible values listed below.r More detailed reason for LWP stop. pr_why and pr_what together, describe the reason for a stopped LWP. Specific LWP identifier. Names the current signal; that is, the next signal to be delivered to the LWP. When the LWP is in a PR_SIGNALLED or PR_FAULTED stop, pr_info contains additional information pertinent to the particular signal or fault. (See sys/siginfo.h) Contains signal action information about the current signal (see sigaction(2)). It is undefined if pr_cursig is zero. Identifies any synchronously-generated or LWP-directed signals pending for the LWP. Does not include signals pending at the process leel. Contains the alternate signal stack information for the LWP. (see sigaltstack(2)). Number of the system call, if any, being executed by the LWP. It is nonzero if and only if the LWP is stopped on PS_SYSENTRY or PR_SYSEXIT or is asleep with a system call (PR_ASLEEP is set) If pr_syscall is non-zero, pr_nsysarg is the number of arguments to the system call Array of arguments to the system call. Contains the name of the scheduling class of the LWP. Contains the user context of the LWP, as if it had called getcontext(2). If the LWP is not stopped, all context values are undefined. Contains the CPU-family specific information about the LWP. Use of this field is not portable across different architectures. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-17 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

The lwp/lwpstatus File pr_flags

pr_flags is a bit-mask holding these flags:

Flag PR_STOPPED PR_ISTOP PR_DSTOP PR_STEP PR_ASLEEP PR_PCINVAL

pr_why

-- continued

Description LWP is stopped LWP is stopped on an event of interest (see PCSTOP) LWP has a stop directive in effect (see PCSTOP) LWP has a single-step directive in effect LWP is in an interruptible sleep within a system call LWP program counter register does not point to a valid address

Possible values of pr_why are:

Value PR_REQUESTED

Description Shows that the stop occurred in response to a stop directive, normally because PCSTOP was applied or because another LWP stopped on an event of interest and the asynchronous-stop flag (see PCSET) was not set for the process. pr_what is unused in this case.

PR_SIGNALLED

Shows that the LWP stopped on receipt of a signal (see PCSTRACE); pr_what holds the signal number that caused the stop (for a newlystopped LWP, the same value is in pr_cursig) shows that the LWP stopped on incurring a hardware fault (see PCSFAULT); pr_what holds the fault number that caused the stop Show a stop on entry to or exit from a system call (see PCSENTRY and PCSEXIT); pr_what holds the system call number. Sows that the LWP stopped because of the default action of a job control stop signal (see sigaction(2)); pr_what holds the stopping signal number.

PR_FAULTED

PR_SYSENTRY PR_SYSEXIT PR_JOBCONTROL

-18 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

The lwp/lwpsinfo File Introduction

The lwp/lwpsinfo file contains information about the LWP needed by ps(1). This information also is present in the psinfo file of the process for its representative LWP if it has one.

File format

The file is formatted as a struct psinfo containing the following members: ulong_t pr_flag; /* LWP flags */ lwpid_t pr_lwpid; /* LWP id */ caddr_t pr_addr; /* internal address of LWP */ caddr_t pr_wchan; /* wait addr for sleeping LWP */ uchar_t pr_stype; /* synchronization event type */ uchar_t pr_state; /* numeric scheduling state */ char pr_sname; /* printable character representing pr_state */ uchar_t pr_nice; /* nice for cpu usage */ int pr_pri; /* priority, high value = high priority */ timestruc_t pr_time; /* usr+sys cpu time for this LWP */ char pr_clname[8]; /* Scheduling class name */ char pr_name[PRFNSZ]; /* name of system LWP */ processorid_t pr_onpro; /* processor on which LWP is running */ processorid_t pr_bindpro; /* processor to which LWP is bound */ processorid_t pr_exbindpro; /* processor to which LWP is exbound */

Platformspecific data

Some of the entries in lwpsinfo, such as pr_flag, pr_addr, pr_state, pr_stype, pr_wchan, and pr_name, refer to internal kernel data structures and should not be expected to retain their meanings across different versions of the operating system. They have no meaning to a program and are only useful for manual interpretation by a user aware of the implementation details.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-19 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

Control Messages Introduction

Process state changes are affected through messages written to the ctl file of the process or to the lwpctl file of an individual LWP.

Sending control messages

All control messages consist of an int naming the specific operation followed by additional data containing operands (if any). Multiple control messages can be combined in a single write(2) to a control file, but no partial writes are permitted; that is, each control message (operation code plus operands) must be presented in its entirety to the write and not in pieces over several system calls.

ENOENT

Note that writing a message to a control file for a process or LWP that has exited elicits the error ENOENT.

List of messages

Here is a list of the allowable control messages:

Control Message PCSTOP PCDSTOP PCWSTOP PCRUN PCSTRACE PCSSIG PCKILL PCUNKILL PCSHOLD PCSFAULT

-20 of 36 AIX 5L Internals

Description Stops a LWPs Stops a LWPs Stops a LWPs Makes a LWP runnable again after a stop. Defines a set of signals to be traced in the process Contains the current signal and its associated signal information???? End the process or LWP immediately???? ???? Set the held signals for the specific or chosen LWP according to the operand sigset_t structure Define a set of hardware faults to be traced in the process

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

PCSTOP, PCDSTOP, and PCWSTOP Introduction

There are three control messages that stop LWPs. They perform in different ways. They are: • PCSTOP • PCDSTOP • PCWSTOP

PCSTOP

When applied to the process control file, directs all LWPs to stop and waits for them to stop. Completes when every LWP has stopped on an event of interest. When applied to an LWP control file, directs the specific LWP to stop and waits until it has stopped. Completes when the LWP stops on an event of interest, immediately if already so stopped.

PCDSTOP

When applied to the process control file, directs all LWPs to stop without waiting for them to stop. When applied to an LWP control file, directs the specific LWP to stop without waiting for it to stop

PCWSTOP

When applied to the process control file, simply waits for all LWPs to stop. Completes when every LWP has stopped on an event of interest. When applied to an LWP control file, simply waits for the LWP to stop. Completes when the LWP stops on an event of interest, immediately if already so stopped

Event of interest

An event of interest is either a PR_REQUESTED stop or a stop that has been specified in the process’s tracing flags (set by PCSTRACE, PCSFAULT, PCSENTRY, and PCSEXIT). A PR_JOBCONTROL stop is specifically not an event of interest. (An LWP may stop twice because of a stop signal; first showing PR_SIGNALLED if the signal is traced and again showing PR_JOBCONTROL if the LWP is set running without clearing the signal.) If PCSTOP or PCDSTOP is applied to an LWP that is stopped, but not on an event of interest, the stop directive takes effect when the LWP is restarted by the competing mechanism; at that time the LWP enters a PR_REQUESTED stop before executing any user-level code. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-21 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSTOP, PCDSTOP, and PCWSTOP

-- continued

Blocked control messages

A write of a control message that blocks is interruptible by a signal so that, for example, an alarm(2) can be set to avoid waiting forever for a process or LWP that may never stop on an event of interest. If PCSTOP is interrupted, the LWP stop directives remain in effect even though the write returns an error.

System process

A system process (indicated by the PR_ISSYS flag) never executes at user level, has no user-level address space visible through /proc, and cannot be stopped. Applying PCSTOP, PCDSTOP, or PCWSTOP to a system process or any of its LWPs elicits the error EBUSY.

-22 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

PCRUN Introduction

The control message PCRUN makes an LWP runnable again after a stop. The operand is a set of flags, contained in a ulong_t, describing optional additional actions.

Flag descriptions

Here is a description of the flags contained in the operand of PCRUN:

Flag PRCSIG PRCFAULT PRSTEP

PRSABORT

PRSTOP

Using PCRUN on an LWP

Description Clears the current signal, if any (see PCSSIG) Clears the current fault, if any (see PCCFAULT) Directs the LWP to execute a single machine instruction. On completion of the instruction, a trace trap occurs. If FLTTRACE is being traced, the LWP stops, otherwise it is sent SIGTRAP; if SIGTRAP is being traced and not held, the LWP stops. When the LWP stops on an event of interest the single-step directive is cancelled, even if the stop occurs before the instruction is executed. This operation requires hardware and operating system support and may not be implemented on all processors Is significant only if the LWP is in a PR_SYSENTRY stop or is marked PR_ASLEEP; it instructs the LWP to abort execution of the system call (see PCSENTRY, PCSEXIT). Directs the LWP to stop again as soon as possible after resuming execution (see PCSTOP). In particular if the LWP is stopped on PR_SIGNALLED or PR_FAULTED, the next stop will show PR_REQUESTED, no other stop will have intervened, and the LWP will not have executed any user-level code

When applied to an LWP control file PCRUN makes the specific LWP runnable. The operation fails (EBUSY) if the specific LWP is not stopped on an event of interest. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-23 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCRUN Using PCRUN on a process

-- continued

When applied to the process control file an LWP is chosen for the operation as described for /proc/pid/status. The operation fails (EBUSY) if the chosen LWP is not stopped on an event of interest. If PRSTEP or PRSTOP were requested, the chosen LWP is made runnable; otherwise, the chosen LWP is marked PR_REQUESTED. If as a result all LWPs are in the PR_REQUESTED stop state, they are all made runnable. Once an LWP has been made runnable by PCRUN, it is no longer stopped on an event of interest even if, because of a competing mechanism, it remains stopped.

-24 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

PCSTRACE Introduction

PCSTRACE Define a set of signals to be traced in the process: the receipt of one of these signals by an LWP causes the LWP to stop. The set of signals is defined using an operand sigset_t contained in the control message.

SIGKILL

Receipt of SIGKILL cannot be traced; if specified, it is silently ignored.

Held signals

If a signal that is included in a held signal set of an LWP is sent to the LWP, the signal is not received and does not cause a stop until it is removed from the held signal set, either by the LWP itself or by setting the held signal set with PCSHOLD or the PRSHOLD option of PCRUN.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-25 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCCSIG Introduction

PCCSIG

The current signal and its associated signal information for the specific or chosen LWP are set according to the contents of the operand siginfo structure (see ). If the specified signal number is zero, the current signal is cleared. An error (EBUSY) is returned if the LWP is not stopped on an event of interest. The semantics of this operation are different from those of kill(2), _lwp_kill(2), or PCKILL in that the signal is delivered to the LWP immediately after execution is resumed (even if the signal is being held) and an additional PR_SIGNALLED stop does not intervene even if the signal is being traced. Setting the current signal to SIGKILL ends the process immediately.

-26 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

PCKILL, PCUNKILL Introduction

PCKILL

If applied to the process control file, a signal is sent to the process with semantics identical to those of kill(2). If applied to an LWP control file, a signal is sent to the LWP with semantics identical to those of _lwp_kill(2). The signal is named in an operand int contained in the message. Sending SIGKILL ends the process or LWP immediately.

PCUNKILL

A signal is deleted, that is, it is removed from the set of pending signals. If applied to the process control file, the signal is deleted from the process’s pending signals. If applied to an LWP control file, the signal is deleted from the LWP’s pending signals. The current signal (if any) is unaffected. The signal is named in an operand int in the control message. It is an error (EINVAL) to attempt to delete SIGKILL.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-27 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSHOLD Introduction

Set the set of held signals for the specific or chosen LWP (signals whose delivery will be delayed if sent to the LWP) according to the operand sigset_t structure. SIGKILL or SIGSTOP cannot be held; if specified, they are silently ignored.

-28 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

PCSFAULT Introduction

PCSFAULT defines a set of hardware faults to be traced in the process: on incurring one of these faults an LWP stops. The set is defined via the operand fltset_t structure.

Fault names

Some fault names may not occur on all processors; there may be processor-specific faults in addition to these. Fault names include the following:

Fault Name FLTILL FLTPRIV FLTBPT

Description Illegal instruction Privileged instruction Breakpoint trap

FLTTRACE FLTACCESS FLTBOUNDS FLTIOVF FLTIZDIV FLTFPE FLTSTACK FLTPAGE

Trace trap Memory access fault (bus error) Memory bounds violation Integer overflow Integer zero divide Floating-point exception Unrecoverable stack fault Recoverable page fault

When not traced, a fault normally results in the posting of a signal to the LWP that incurred the fault. If an LWP stops on a fault, the signal is posted to the LWP when execution is resumed unless the fault is cleared by PCCFAULT or by the PRCFAULT option of PCRUN. FLTPAGE is an exception; no signal is posted. There may be additional processor-specific faults like this. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-29 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT pr_info field

-- continued

The pr_info field in /proc/pid/status or in /proc/pid/lwp/lw#/lwpstatus identifies the signal to be sent and contains machine-specific information about the fault. Signals can be any of the following and are described below:

PCCFAULT PCSENTRY, PCSEXIT PCSET PCRESET

PSREG PCSFPREG

PCNICE

PCCFAULT

The current fault (if any) is cleared; the associated signal is not sent to the specific or chosen LWP. These control operations instruct the process’s LWPs to stop on entry to or exit from specified system calls. Sets one or more modes of operation for the traced process. Resets these modes. The modes to be set or reset are specified by flags in an operand long in the control message: Sets the general registers for the specific or chosen LWP according to the operand gregset_t structure. Sets the floating-point registers for the specific or chosen LWP according to the operand fpregset_t structure. Sets the LWP’s nice(2) priority.

The current fault (if any) is cleared; the associated signal is not sent to the specific or chosen LWP. Continued on next page

-30 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT PCSENTRY, PCSEXIT

Guide

-- continued

These control operations instruct the process’s LWPs to stop on entry to or exit from specified system calls. The set of system calls to be traced is defined via an operand sysset_t structure. When entry to a system call is being traced, an LWP stops after having begun the call to the system but before the system call arguments have been fetched from the LWP. When exit from a system call is being traced, an LWP stops on completion of the system call just before checking for signals and returning to user level. At this point all return values have been stored into the LWP’s registers. If an LWP is stopped on entry to a system call (PR_SYSENTRY) or when sleeping in an interruptible system call (PR_ASLEEP is set), it may be instructed to go directly to system call exit by specifying the PRSABORT flag in a PCRUN control message. Unless exit from the system call is being traced the LWP returns to user level showing error EINTR.

PCSET

PCSET sets one or more modes of operation for the traced process. Continued on next page

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-31 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT PCRESET

-- continued

PCRESET resets these modes. The modes to be set or reset are specified by flags in an operand long in the control message. The flags are described below:

Flag

Description

PR_FORK (inherit-on-fork)

When set, the tracing flags of the process are inherited by the child of a fork(2) or vfork(2). When reset, child processes start with all tracing flags cleared.

PR_RLC (run-on-last-close)

When set and the last writable /proc file descriptor referring to the traced process or any of its LWPs is closed, all the tracing flags of the process are cleared, any outstanding stop directives are canceled, and if any LWPs are stopped on events of interest, they are set running as though PCRUN had been applied to them. When reset, the process’s tracing flags are retained and LWPs are not set running on last close.

PR_KLC (kill-on-last-close)

When set and the last writable /proc file descriptor referring to the traced process or any of its LWPs is closed, the process is exited with SIGKILL.

PR_ASYNC (asynchronous-stop)

When set, a stop on an event of interest by one LWP does not directly affect any other LWP in the process. When reset and an LWP stops on an event of interest other than PR_REQUESTED, all other LWPs in the process are directed to stop. It is an error (EINVAL) to specify flags other than those described above or to apply these operations to a system process. The current modes are reported in the pr_flags field of /proc/ pid/status

Continued on next page

-32 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

PCSFAULT

Guide

-- continued

EINVAL

It is an error (EINVAL) to specify flags other than those described above or to apply these operations to a system process. The current modes are reported in the pr_flags field of /proc/pid/status.

PCSREG

PCSREG sets the general registers for the specific or chosen LWP according to the operand gregset_t structure. There may be machinespecific restrictions on the allowable set of changes. PCSREG fails (EBUSY) if the LWP is not stopped on an event of interest.

PCSFPREG

PCSFPREG sets the floating-point registers for the specific or chosen LWP according to the operand fpregset_t structure. An error (EINVAL) is returned if the system does not support floating-point operations (no floating-point hardware and the system does not emulate floating-point machine instructions). PCSFPREG fails (EBUSY) if the LWP is not stopped on an event of interest.

PCNICE

The traced (or chosen) LWP’s nice(2) priority is incremented by the amount contained in the operand int. Only the super-user may better an LWP’s priority in this way, but any user may make the priority worse. This operation is significant only when applied to an LWP in the time-sharing scheduling class.

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-33 of 36

Guide

Draft Version for review, Sunday, 15. October 2000, proc.fm

Directories Introduction

Object directory

The object directory contains read-only files with names as they appear in the entries of the map file, corresponding to objects mapped into the address space of the target process. Opening such a file yields a descriptor for the mapped file associated with a particular address-space region. The name a.out also appears in the directory as a synonym for the executable file associated with the ‘‘text’’ of the running process. The object directory makes it possible for a controlling process to get access to the object file and any shared libraries (and consequently the symbol tables)--in general, any mapped files--without having to know the specific path names of those files.

lwp directory

The lwp directory contains entries each of which names an LWP within the containing process. These entries are directories containing additional files and are described beginning on page 15.

-34 of 36 AIX 5L Internals

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, proc.fm

Guide

Code Example Introduction

The following code is an simple example of how one process can use the /proc filesystem to access the address space of another. Provided with a single argument (the id of a currently running process), it prints the name of the process from the psinfo structure. #include #include #include main(int argc, char **argv) { char fname[512]; struct psinfo p; int fd; /* check for an argument */ if (argc != 2) exit(1); sprintf(fname, "/proc/%s/psinfo", argv[1]); /* check that the process id is still running */ if((access(fname, F_OK)) < 0) exit(1); fd = open(fname, O_RDONLY); read(fd, &p, sizeof(struct psinfo)); printf("process pid %s: exec path/args: %s %s\n", argv[1], p.pr_fname, p.pr_psargs); }

© Copyright IBM Corp. 2000

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

-35 of 36

Guide

-36 of 36 AIX 5L Internals

Draft Version for review, Sunday, 15. October 2000, proc.fm

Version 20001015 Course materials may not be reporduced in whole or in part without the prior writen permission of IBM.

© Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, lastpage.fm

© Copyright IBM Corp. 2000

Version YYYYMMDD Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Student Guide