Section 6.3. Booting Zones | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

6.3. Booting Zones

Although the virtualization that zones provides is spread throughout the source code, the primary implementation in the kernel can be found in zone.c. As with many Solaris frameworks, there is a big block comment at the start of the file which is very useful for understanding the lay of the land with respect to the code. Besides describing the data structures and locking strategy used for zones, there is a description of the states a zone can be in from the kernel's perspective and at what points a zone may transition from one state to another. For brevity, only the states covered during a zone boot are listed here:

* *   Zone States: * *   The states in which a zone may be in and the transitions are as *   follows: * *   ZONE_IS_UNINITIALIZED: primordial state for a zone. The partially *   initialized zone is added to the list of active zones on the system but *   isn't accessible. * *   ZONE_IS_READY: zsched (the kernel dummy process for a zone) is *   ready.  The zone is made visible after the ZSD constructor callbacks are *   executed.  A zone remains in this state until it transitions into *   the ZONE_IS_BOOTING state as a result of a call to zone_boot(). * *   ZONE_IS_BOOTING: in this shortlived-state, zsched attempts to start *   init.  Should that fail, the zone proceeds to the ZONE_IS_SHUTTING_DOWN *   state. * *   ZONE_IS_RUNNING: The zone is open for business: zsched has *   successfully started init.   A zone remains in this state until *   zone_shutdown() is called.                                                                         See os/zone.c

It is important to note here that there are a number of zone states not represented herethose are for zones which do not (yet) have a kernel context. An example of such a state is for a zone that is in the process of being installed. These states are defined in libzonecfg.h.

One of the players in the zone boot dance is the zoneadmd process which runs in the global zone and performs a number of critical tasks. Although much of the virtualization for a zone is implemented in the kernel, zoneadmd manages a great deal of a zone's infrastructure as outlined in zoneadmd.c:

/* * zoneadmd manages zones; one zoneadmd process is launched for each * non-global zone on the system. This daemon juggles four jobs: * * - Implement setup and teardown of the zone "virtual platform": mount and * unmount filesystems; create and destroy network interfaces; communicate * with devfsadmd to lay out devices for the zone; instantiate the zone * console device; configure process runtime attributes such as resource * controls, pool bindings, fine-grained privileges. * * - Launch the zone's init(1M) process. * * - Implement a door server; clients (like zoneadm) connect to the door * server and request zone state changes. The kernel is also a client of * this door server. A request to halt or reboot the zone which originates * *inside* the zone results in a door upcall from the kernel into zoneadmd. * * One minor problem is that messages emitted by zoneadmd need to be passed * back to the zoneadm process making the request. These messages need to * be rendered in the client's locale; so, this is passed in as part of the * request. The exception is the kernel upcall to zoneadmd, in which case * messages are syslog'd. * * To make all of this work, the Makefile adds -a to xgettext to extract *all* * strings, and an exclusion file (zoneadmd.xcl) is used to exclude those * strings which do not need to be translated. * * - Act as a console server for zlogin -C processes; see comments in zcons.c * for more information about the zone console architecture. * * DESIGN NOTES * * Restart: * A chief design constraint of zoneadmd is that it should be restartable in * the case that the administrator kills it off, or it suffers a fatal error, * without the running zone being impacted; this is akin to being able to * reboot the service processor of a server without affecting the OS instance. */ See zoneadmd.c

When a user wishes to boot a zone, zoneadm will attempt to contact zoneadmd via a door that is used by all three components for a number of things including coordinating zone state changes. If for some reason zoneadmd is not running, an attempt will be made to start it. Once that has completed, zoneadm tells zoneadmd to boot the zone by supplying the appropriate zone_cmd_arg_t request via a door call. It is worth noting that the same door is used by zoneadmd to return messages back to the user executing zoneadm and also as a way for zoneadm to indicate to zoneadmd the locale of the user executing the boot command so that messages are localized appropriately.

Looking at the door server that zoneadmd implements, there is some straightforward sanity checking that takes place on the argument passed via the door call as well as the use of some of the technology that came in with the introduction of process privileges in Solaris 10 (see Chapter 5).

if (door_ucred(&uc) != 0) {               zerror(&logsys, B_TRUE, "door_ucred");               goto out;               }               eset = ucred_getprivset(uc, PRIV_EFFECTIVE);               if (ucred_getzoneid(uc) != GLOBAL_ZONEID ||                   (eset != NULL ? !priv_ismember(eset, PRIV_SYS_CONFIG) :                   ucred_geteuid(uc) != 0)) {               zerror(&logsys, B_FALSE, "insufficient privileges");               goto out;               }               kernelcall = ucred_getpid(uc) == 0;               /*                * This is safe because we only use a zlog_t throughout the                * duration of a door call; i.e., by the time the pointer                * might become invalid, the door call would be over.                */               zlog.locale = kernelcall ? DEFAULT_LOCALE : zargp->locale;

Using door_ucred, the user credential can be checked to determine whether the request originated in the global zone, ^[1] whether the user making the request had sufficient privilege to do so^[2] and whether the request was a result of an upcall from the kernel. That last piece of information is used, among other things, to determine whether or not messages should be localized by localize_msg.

^[1] This is a bit of defensive programming since unless the global zone administrator were to make the door in question available through the non-global zone's own file system, there would be no way for a privileged user in a non-global zone to actually access door used by zoneadmd.

^[2] zoneadm itself checks that the user attempting to boot a zone has the necessary privilege but it's possible some other privileged process in the global zone might have access to the door but lack the necessary PRIV_SYS_CONFIG privilege.

It is within the door server implemented by zoneadmd that transitions from one state to another take place. There are two states from which a zone boot is permissible, installed and ready. From the installed state, zone_ready is used to create and bring up the zone's virtual platform that consists of the zone's kernel context (created using zone_create) as well as the zone's specific file systems (including the root file system) and logical networking interfaces. If a zone is supposed to be bound to a non-default resource pool, then that also takes place as part of this state transition.

When a zone's kernel context is created using zone_create, a zone_t structure is allocated and initialized. At this time, the status of the zone is set to ZONE_IS_UNINITIALIZED. Some of the initialization that takes place is in order to set up the security boundary which isolates processes running inside a zone. For example, the vnode_t of the zone's root file system, the zone's kernel credentials and the privilege sets of the zone's future processes are all initialized here.

Before returning back to the zoneadmd command, zone_create adds the primor-dial zone to a doubly-linked list and two hash tables^[3], one hashed by zone name and the other by zone ID. These data structures are protected by the zonehash_lock mutex which is then dropped after the zone has been added. Finally a new kernel process is then created, zsched, which is where kernel threads for this zone are parented. After calling newproc to create this kernel process, zone_create will wait using zone_status_wait until the zsched kernel process has completed initializing the zone and has set its status to ZONE_IS_READY.

^[3] Both of these are worth examining in the Solaris source base.

Since the user structure of the process initialization has not been completed, the first thing the new zsched process does is finish that initialization along with reparenting itself to PID 1 (the global zone's init, process). And since the future processes to be run within the new zone may be subject to resource controls, that initialization takes place here in the context of zsched.

After grabbing the zone_status_lock mutex in order to set the status to ZONE_IS_READY, zsched will then suspend itself, waiting for the zone's status to been changed to ZONE_IS_BOOTING.

Once the zone is in the ready state, zone_create returns control back to zoneadmd and the door server continues the boot process by calling zone_bootup This initializes the zone's console device, mounts some of the standard Solaris file systems like /proc and /etc/mnttab and then uses the zone_boot system call to attempt to boot the zone.

As the comment that introduces zone_boot points out, most of the heavy lifting has already been done either by zoneadmd or by the work the kernel has done through zone_create. As this point, zone_boot saves the requested boot arguments after grabbing the zonehash_lock mutex and then further grabs the zone_status_lock mutex in order to set the zone status to ZONE_IS_BOOTING. After dropping both locks, it is zone_boot that suspends itself waiting for the zone status is be set to ZONE_IS_RUNNING.

Since the zone's status has now been set to ZONE_IS_BOOTING, zsched now continues where it left off after it has suspended itself with its call to zone_status_wait_cpr After checking that the current zone status is indeed ZONE_IS_BOOTING, a new kernel process is created in order to run init in the zone. This process calls zone_icode, which is analogous to the traditional icode function that is used to start init in the global zone and in traditional UNIX environments. After doing some zone-specific initialization, each of the icode functions end up calling exec_init to actually exec the init process after copying out the path to the executable, /sbin/init, and the boot arguments. If the exec is successful, zone_icode will set the zone's status to ZONE_IS_RUNNING and in the process, zone_boot will pick up where it had been suspended. At this point, the value of zone_boot_err indicates whether the zone boot was successful or not and is used to set the global errno value for zoneadmd.

There are two additional things to note with the zone's transition to the running state. First of all, audit_put_record is called to generate an event for the Solaris auditing system so that it's known which user executed which command to boot a zone. In addition, there is an internal zoneadmd event generated to indicate on the zone's console device that the zone is booting. This internal stream of events is sent by the door server to the zone console subsystem for all state transitions, so that the console user can see which state the zone is transitioning to.