|
DataMuseum.dkPresents historical artifacts from the history of: CP/M |
This is an automatic "excavation" of a thematic subset of
See our Wiki for more about CP/M Excavated with: AutoArchaeologist - Free & Open Source Software. |
top - metrics - download
Length: 192640 (0x2f080) Types: TextFile Names: »D152«
└─⟦ae2411776⟧ Bits:30008864 Diskette med tekster der formodes at være 31-D-152…161 └─⟦this⟧ »D152«
8DESIGN OF A PORTABLE FILE SYSTEM5 8NUCLEUS5 Peter Mikkelsen Datalogisk Institut ved Københavns Universitet September 1980 A_B_S_T_R_A_C_T_: The report argues that the emergence of high-level sys- _Æ_æ_å_ tem programming languages makes it an attractive and practicable _?_>_å_ task to design a file system suited for use on even very diffe- rent computers. Objectives for such a design are presented, and a _Æ_æ_å_ file system nucleus is designed. This nucleus offers: hierarchi- ___å_ cal naming of files, simple authorization enforcement support, _?_>_å_ extendable facilities of resource management and integrity pro- tection. Only the most basic functions for retrieval and updating of the actual contents of files are offered. K_E_Y_W_O_R_D_S_: File system, file organization, naming of files, autho- rization enforcement. This report exists as an A/S Regnecentralen af 1979 report under the registration RCSL No 31-D 608. \f i T_A_B_L_E_ _O_F_ _C_O_N_T_E_N_T_S_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _P_A_G_E_ 1. INTRODUCTION .......................................... 1 1.1 Background ....................................... 1 1.2 Scope of the Project ............................. 2 1.3 This Report ...................................... 4 2. MODELS FOR STORAGE AND RETRIEVAL OF BACKING STORAGE DATA .................................................. 5 2.1 A Model for a Hierarchical File System ........... 6 2.1.1 Access Methods (AM) ....................... 8 2.1.2 Logical File System (LFS) ................. 8 2.1.3 Basic File System (BFS) ................... 8 2.1.4 File Organization Strategy Modules (FOSM) . 9 2.1.5 Device Strategy Modules (DSM) ............. 9 2.1.6 Input/Output Control System (IOCS) ........ 9 2.2 Functions of the File System: an Illustration .... 10 3. ENVIRONMENTAL CONSTRAINTS ON THE FILE SYSTEM .......... 12 3.1 Characteristics of Processor ..................... 12 3.1.1 Objectives ................................ 13 3.2 Characteristics of Backing Storage ............... 13 3.2.1 Objectives ................................ 14 3.3 Characteristics of Data .......................... 14 3.3.1 Objectives ................................ 18 3.4 Characteristics of Users ......................... 18 3.4.1 Objectives ................................ 19 3.5 Protection Against Errors ........................ 20 3.5.1 Objectives ................................ 21 4. DESIGN OF THE INDIVIDUAL LEVELS ....................... 22 5. BASIC FILE SYSTEM (BFS) ............................... 23 5.1 Objectives ....................................... 23 5.2 Design ........................................... 23 5.2.1 Organization of the Descriptors ........... 23 \f ii T_A_B_L_E_ _O_F_ _C_O_N_T_E_N_T_S_ _(_c_o_n_t_i_n_u_e_d_)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _P_A_G_E_ 5.2.2 Redundancy Concerning Location of Descriptors ............................... 27 5.2.3 Contents of Descriptors ................... 30 5.3 Functions of the BFS-Level ....................... 31 5.4 Concluding Remarks ............................... 32 6. FILE ORGANIZATION STRATEGY MODULE (FOSM) .............. 33 6.1 Objectives ....................................... 33 6.2 Design ........................................... 34 6.2.1 Page Organization of Files ................ 34 6.2.2 Organization of Index Tables .............. 37 6.2.2.1 Structure of the Individual Indices .......................... 38 6.2.2.2 The Address Space of the Individual Indices ............... 39 6.2.2.3 Representation of Index Tables in Core ............................. 40 6.2.2.4 Single-Level or Multi-Level Indices .......................... 43 6.2.3 Dynamics of Files ......................... 43 6.2.4 Unreadability of Deallocated Pages ........ 45 6.3 Functions of the FOSM-Level ...................... 46 6.4 Concluding Remarks ............................... 47 7. DEVICE STRATEGY MODULES (DSM) ......................... 48 7.1 Objectives ....................................... 48 7.2 Design ........................................... 49 7.2.1 Sequencing Strategies ..................... 49 7.2.2 Allocation/Deallocation of Pages, Administration ............................ 49 7.2.3 Allocation/Deallocation of Pages, Strategies ................................ 52 7.2.4 Marking of Malfunctioning Pages ........... 54 7.3 Functions of the DSM-Level ....................... 56 7.4 Concluding Remarks ............................... 56 \f iii T_A_B_L_E_ _O_F_ _C_O_N_T_E_N_T_S_ _(_c_o_n_t_i_n_u_e_d_)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _P_A_G_E_ 8. INPUT OUTPUT CONTROL SYSTEM (IOCS) .................... 57 8.1 Objectives ....................................... 57 8.2 Design ........................................... 57 8.3 Functions of the IOCS-Level ...................... 59 8.4 Concluding Remarks ............................... 59 9. ACCESS METHODS (AM) ................................... 60 10. LOGICAL FILE SYSTEM (LFS) ............................. 61 10.1 Objectives ....................................... 61 10.2 Design ........................................... 62 10.2.1 A Model for Naming ........................ 62 10.2.1.1 Special Objectives of Naming ..... 63 10.2.1.2 Names: An Abstract Model ......... 63 10.2.2 How Should Files be Named? ................ 66 10.2.2.1 Restrictions on the Digraph Structure ........................ 66 10.2.2.2 The Function NAME ................ 69 10.2.2.3 The Function DEFINECONTEXT ....... 70 10.2.2.4 Summary of Naming Restrictions ... 70 10.2.3 Structure of the Catalogs ................. 70 10.2.4 Authorization Enforcement ................. 72 10.2.5 Deleting of Files ......................... 75 10.3 Basic Functions of the LFS ....................... 79 10.4 Concluding Remarks ............................... 81 11. REMARKS ON IMPLEMENTATION ............................. 82 11.1 Structuring the File System ...................... 82 11.2 Transparancy of Hierarchical Levels .............. 83 12. A PRELIMINARY REFERENCE MANUAL ........................ 86 12.1 Introduction ..................................... 86 12.2 Components of the Backing Storage ................ 86 12.2.1 Volumes ................................... 86 12.2.2 Pages ..................................... 87 \f iv T_A_B_L_E_ _O_F_ _C_O_N_T_E_N_T_S_ _(_c_o_n_t_i_n_u_e_d_)_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _P_A_G_E_ 12.2.3 Files ..................................... 87 12.2.4 Catalogs .................................. 89 12.2.5 Volumedescriptions ........................ 90 12.3 Functions of the File System ..................... 91 12.3.1 Volume Handling ........................... 91 12.3.2 Catalog Handling .......................... 91 12.3.3 File Handling ............................. 92 12.3.4 Authorization Enforcement ................. 94 13. CONCLUSION ............................................ 97 A_P_P_E_N_D_I_X_: A. REFERENCES ............................................ 99 \f F_ 1_._ _ _ _ _ _ _ _ _I_N_T_R_O_D_U_C_T_I_O_N_ 1. The file system is an essential part of the operating system of any modern computer system. The importance of a well-suited file system has been growing -- and will no doubt continue to do so in the future -- concurrently with the need for, and development of still more sophisticated and potent backing storage devices. Distributed systems with configurations of several -- possibly very different-sized -- computers, have become increasingly prevalent. This has accentuated the need for compatible file systems, allowing easy communication of data between installa- tions -- for instance via exchange of backing storage media. The emergence of high-level system programming languages like Pascal (Jensen and Wirth 1975), Concurrent Pascal (Brinch Hansen 1977), Platon (Staunstrup et al. 1975), Beta (Kristensen et al. 1979), Modula (Wirth 1977), Ada (Wegner 1979), etc. has made it an attractive and practicable task to design a file system suited for use on even very different computer systems (yet with at least one thing in common: access to a compiler for the high level language in question). Such a portable file system has the additional advantage of reducing implementation expenses: it is less expensive to imple- ment o_n_e_ file system which can be used on a number of computer configurations, than implementing s_e_v_e_r_a_l_ individual file systems for the different configurations. To design such a file system (or at least the nucleus of such) is the goal of this project. 1_._1_ _ _ _ _ _ _ _B_a_c_k_g_r_o_u_n_d_ 1.1 The work of the present report was initiated in april 1980 in connection with a project at A/S Regnecentralen af 1979. \f The aim of that project was to design a file system for the RC700 microcomputer (based on the Zilog Z80 8-bit micro-processor). The file system were to be implemented in the high-level language Pascal80 (Staunstrup 1980), which is developed on basis of the language Platon (Staunstrup et al. 1975). It was soon decided that the use of Pascal80 for implementation could be utilized, allowing the file system to be used on other computer systems. It was therefore necessary to reconsider the design of the file system, as the premises were changed. The result of these re-con- siderations is presented in this report. 1_._2_ _ _ _ _ _ _ _S_c_o_p_e_ _o_f_ _t_h_e_ _P_r_o_j_e_c_t_ 1.2 Design of a file system for a range of computers might seem an impossible task to undertake. How could a single file system possibly encompass environments, ranging from single-user micro-computers with less than 64 Kb of core storage -- to a large-scale multi-user, multi-processor computer equipped with several kinds of backing storage devices? Well, it might be infeasible to create a g_e_n_e_r_a_l_ _p_u_r_p_o_s_e_ file system offering an assorted range of facilities, which would encompass the full range of configuration possibilities. That is the reason why the philosophy of this project is to con- struct a n_u_c_l_e_u_s_ for such a file system. That nucleus must offer a restricted -- but hopefully versatile -- f_u_n_d_a_m_e_n_t_ for superstructuring levels of higher abstraction and further facilities. The facilities may be elaborated diffe- rently to any extent, on the various computers. On the other hand, the nucleus must be able to perform as a file system in itself -- without built-on extra facilities. \f The notable point is that the proposed hierarchical philosophy ensures exchange of data between considerably differenced com- puter systems, since the lowest levels -- the nucleus -- are the same on any of the systems. With the above described design philosophy in mind, the defini- tions and scope of the file system presented in this report, could be stated as follows: - the file system should give access to the backing storage of a computer on an abstract level of f_i_l_e_s_. - a file is a named sequence of b_y_t_e_s_ presumably containing d_a_t_a_. The file is split into a number of fixed size p_a_g_e_s_. A file is described by, and identified through, various admin- istrative datastructures, a.o. the file d_e_s_c_r_i_p_t_o_r_ and one or more file n_a_m_e_s_ (the terms will be defined more precisely at a later point). - the file system offers only the most r_u_d_i_m_e_n_t_a_r_y_ _o_p_e_r_a_t_i_o_n_s_ to access and maintain the files. That is: it is not consi- dered part of the file system nucleus to offer acces to var- iable size records, as well as multibuffered i/o and simi- lars are considered extensions which higher levels of an ac- tual system should offer. - s_p_e_c_i_a_l_i_z_e_d_ _a_c_c_e_s_s_ _m_e_t_h_o_d_s_ (e.g. sequential access, indexed sequential access, hashed access, list-structured access etc.) -- are not considered part of the file system nucleus. - only r_a_n_d_o_m_ _a_c_c_e_s_s_ backing storage is considered. That is to say that devices like magnetic tape, paper tape, punched cards etc. is not considered in this report -- it is up to higher levels to implement uniform access ("generalized file system") even to such devices. \f - multi-user r_e_s_o_u_r_c_e_ _a_l_l_o_c_a_t_i_o_n_ _p_r_o_b_l_e_m_s_ and consequent scheduling problems are not considered part of the file system nucleus. 1_._3_ _ _ _ _ _ _ _T_h_i_s_ _R_e_p_o_r_t_ 1.3 The present report is written for persons who want a detailed knowledge of the considerations which led to the proposed file system design. The reader is supposed to be familiar with ordinary system pro- gramming techniques, and have a general knowledge of file systems -- at least from the application programmer>s point of view. Chapter 2 introduces a hierarchical model, modularizing the structure of file systems. The model offers the conceptual basis for the rest of the text. Chapter 3 investigates the environmental constraints on the want- ed file system; that is: it defines the demanded specifications of the system. The specifications are formulated as a set of objectives for the file system design. The individual modules/levels -- as demarcated in chapter 2 -- are discussed and designed in the chapters 4 to 10. Chapter 10 -- the Logical File System -- will probably draw most attention, since it is the decisions of this level that are of the most immediate interest to users of the file system. The functions of the Logical File System include the strategy for symbolic naming of files, and authorization enforcement procedures. Some remarks and recommendations for the actual implementation of the proposed file system are given in chapter 11. Finally -- before the concluding chapter 13 -- chapter 12 gives a "Preliminary Reference Manual" for the file system. The chapter presents an overview of the proposed file system. \f F_ 2_._ _ _ _ _ _ _ _ _M_O_D_E_L_S_ _F_O_R_ _S_T_O_R_A_G_E_ _A_N_D_ _R_E_T_R_I_E_V_A_L_ _O_F_ _B_A_C_K_I_N_G_ _S_T_O_R_A_G_E_ _D_A_T_A_ 2. A computer system consists of a number of hardware components with associated software. The components may conveniently be divided into two groups: - central processing unit and core storage - peripheral devices. The first group -- together with its associated software -- is supposed to manipulate data already present in the core storage. To introduce (input) new data in the core storage, or communicate data from the core storage to the external world (output), the second group -- peripheral devices -- is used. The software handling the transfer of data between the central and peripheral devices is called the i_n_p_u_t_/_o_u_t_p_u_t_ _s_y_s_t_e_m_ (I/O system). As stated in section 1.2 we will focus on the special class of peripheral devices known as "random access backing storage" de- vices, in this report. The subsystem of an I/O system, handling such devices is usually called the f_i_l_e_ _s_y_s_t_e_m_. The backing storage may be of considerable dimensions, not pos- sible to be encompassed in detail by the human mind. Therefore, to facilitate transfer of data to and from backing storage, it is necessary for the file system to maintain the administration of an abstract model of the backing storage. This model must order, group, relate, etc. the collection of data residing on backing storage, so that it enables storage and retrieval of any data- structure a given user may want to manipulate. Different models to this end may be considered. The classical approach is the notion of a file, consisting of a number of consecutive records (possibly of variable length). \f But other models have been proposed and implemented (IBM 1978). These models often have a close relationship to certain database designs, or the like. In this chapter we will look into only the classical approach. In section 2.1 a hierarchical file system model is described. This -- due to Madnick and Alsop (Madnick et al. 1969) -- has come to be considered the standard of the field. It would be an interesting point of further investigation to look into alternative and more exotic models for the storage and re- trieval of data on backing storage. However, it is not considered within the scope of the present project to elaborate further on this issue. 2_._1_ _ _ _ _ _ _ _A_ _M_o_d_e_l_ _f_o_r_ _a_ _H_i_e_r_a_r_c_h_i_c_a_l_ _F_i_l_e_ _S_y_s_t_e_m_ 2.1 The model presented in this section is primarily extracted from the papers of Madnick (Madnick 1969a, Madnick et al. 1969b). The reason for presenting the model is that it will furnish an excellent framework for discussion and analysis of the different parts of a file system. Furthermore, it will hopefully be a guide-line for the implementor of the file system, since it sug- gests a neatly structured and flexible implementation. The above mentioned papers describe a file system design model consisting of a hierarchy of six levels. Actual implementations of file systems may not necessarily conform exactly with these levels: levels may be combined, further subdivided, or even absent (or at least reduced to a trivial form). \f The six hierarchical levels to be described below are: - Access Methods (AM) - Logical File System (LFS) - Basic File System (BFS) - File Organization Strategy Modules (FOSM) - Device Strategy Modules (DSM) - Input/Output Control System (IOCS) (figur) Figure 2.1: Hierarchical file system. \f 2_._1_._1_ _ _ _ _ _A_c_c_e_s_s_ _M_e_t_h_o_d_s_ _(_A_M_)_ 2.1.1 This level offers an interface to the file system. The interface may include different file formatting functions: sequential files, random access files, fixed-length record files, variable- length record files, etc. It may include multiple-buffer facili- ties. It may hide differing physical characteristics of different backing storage devices. Obviously, a user may supply his own access method functions aug- menting the functions of this level. 2_._1_._2_ _ _ _ _ _L_o_g_i_c_a_l_ _F_i_l_e_ _S_y_s_t_e_m_ _(_L_F_S_)_ 2.1.2 This level transforms a user-supplied symbolic file name to an internal unique "f_i_l_e_ _i_d_e_n_t_i_f_i_e_r_". The tables of transformation data are usually called the c_a_t_a_l_o_g_(s). The symbolic filename will never enter any of the levels below this level. 2_._1_._3_ _ _ _ _ _B_a_s_i_c_ _F_i_l_e_ _S_y_s_t_e_m_ _(_B_F_S_)_ 2.1.3 The Basic File System finds and maintains -- for a given file identifier -- the corresponding "f_i_l_e_ _d_e_s_c_r_i_p_t_o_r_". The file de- scriptor contains the "system-associated" data, such as "length", "location", "reservation locks", etc. The essential functions of "opening" and "closing" a file are primarily performed in the BFS: when a file is opened, the BFS selects a free FOSM (File Organization Strategy Module), which it intializes with data from the file descriptor. Read and write operations can hereafter be directed immediately to the FOSM. "Close" releases the FOSM of the file. The maintenance of file descriptors is the responsibility of the BFS-level, too. \f 2_._1_._4_ _ _ _ _ _F_i_l_e_ _O_r_g_a_n_i_z_a_t_i_o_n_ _S_t_r_a_t_e_g_y_ _M_o_d_u_l_e_s_ _(_F_O_S_M_)_ 2.1.4 The user views a file as a collection of contiguous records. To optimize storage device performances, this may not be true in reality: the physical records of data may be scattered arbitrari- ly on the backing storage. When a FOSM receives an I/O-request concerning a given record with an associated logical address (i.e. the logical position within the file), it translates the logical address to an address of a physical record, and forwards the request to the appropriate Device Strategy Module. 2_._1_._5_ _ _ _ _ _D_e_v_i_c_e_ _S_t_r_a_t_e_g_y_ _M_o_d_u_l_e_s_ _(_D_S_M_)_ 2.1.5 Characteristics of access to records of different backing storage devices vary much. When several requests for access to records of a given device are queued up, it is the responsibility of the Device Strategy Module connected to the device, to produce an optimal sequence of re- quests. The strategy of sequencing obviously depends on charac- teristics such as latency and access time. Furthermore, the DSM administers the allocation and deallocation of records on the device. When a FOSM requests allocation of a number of records, the DSM will select these. 2_._1_._6_ _ _ _ _ _I_n_p_u_t_/_O_u_t_p_u_t_ _C_o_n_t_r_o_l_ _S_y_s_t_e_m_ _(_I_O_C_S_)_ 2.1.6 The IOCS consists of the collection of basic drivers to the back- ing storage devices. It is assumed that these drivers perform functions such as: queuing of i/o-requests, automatic error reco- very attempts in case of spurious errors, reasonable standard actions in case of irrepairable errors etc. \f 2_._2_ _ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _F_i_l_e_ _S_y_s_t_e_m_:_ _a_n_ _I_l_l_u_s_t_r_a_t_i_o_n_ 2.2 The hierarchy of the file system design model described above, may be i_l_l_u_s_t_r_a_t_e_d_ by means of functions: each level corresponds to a function defining the operations of the level. If we disregard the AM-level -- which is irrelevant or trivial in this connection -- we may define following abstract functions: LFS: Name File-id BFS: File-id Descriptor FOSM: Descriptor Logical-pageno Logical-address Device-id DSM: Logical-address Device-id Physical-address IOCS: Physical-address Data The meaning of the defining sets are hopefully self-evident. The basic functions of a file system are naturally "read" and "write". As an example we will use the function READ, defined as follows: READ: Name Logical-pageno Data or buffer = READ (file, page) buffer Data, file Name, page Logical-pageno \f In fact -- as one naturally must expect -- READ may be defined by means of the functions of the file system: READ = LFS BFS FOSM DSM IOCS or (1) READ (file, page) = IOCS (DSM (FOSM (BFS (LFS (file)), page))) This formula illustrates the usefulness of a function OPEN: If a sequence of READ>s is performed on the one and same file, the variable "file" will be constant. From formula (1) we get: (2) READ (file, page) = IOCS (DSM (FOSM (k, page))) where k = BFS (LFS (file)), The function (2) obviously involves less computation than (1), as the two functions BFS and LFS are eliminated. Consequently, we may define a function OPEN: OPEN: Name Descriptor OPEN = LFS BFS which for a sequence of READ-operations is constant, and thus only needs computation once. \f F_ 3_._ _ _ _ _ _ _ _ _E_N_V_I_R_O_N_M_E_N_T_A_L_ _C_O_N_S_T_R_A_I_N_T_S_ _O_N_ _T_H_E_ _F_I_L_E_ _S_Y_S_T_E_M_ 3. In this chapter the environmental constraints on the file system will be investigated. The environment will be treated in four: characteristics of - processor (section 3.1) - backing storage (section 3.2) - data -- contents of files (section 3.3) - users (section 3.4) As the survival of stored data is of vital importance in a com- puter system, it is necessary to consider the protection of these data. This will be done in section 3.5. In each of these five subsections, the considerations of constraints will be expressed in a set of objectives for the design of the file system. These objectives -- and deduced further points -- are the basis for the detailed discussion of file system design in chapters 5 to 10. 3_._1_ _ _ _ _ _ _ _C_h_a_r_a_c_t_e_r_i_s_t_i_c_s_ _o_f_ _P_r_o_c_e_s_s_o_r_ 3.1 The file system is to be used on microcomputers as well as large- scale mainframe computers. This leads to considerations on core- size, speed of processing and so on. It is a necessity that the file system is able to perform well, both on a small (core-) size and slow microcomputer as well as on a larger and faster com- puter. To perform well -- relative to the overall performance of the computer -- this probably means that the core resident part of the file system should be able to vary in size (controlled by initialization and configuration parameters). \f 3_._1_._1_ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 3.1.1 - The file system (implementation) should be structured so that its core storage demands are controlled by initialization/con- figuration parameters. 3_._2_ _ _ _ _ _ _ _C_h_a_r_a_c_t_e_r_i_s_t_i_c_s_ _o_f_ _B_a_c_k_i_n_g_ _S_t_o_r_a_g_e_ 3.2 The file system should -- ideally -- be for use on both slow (floppy-disks), medium fast (hard discs, drums, magnetic bubble memories), and very fast (semiconductor or core memory) media. In practice, this is hard to accomplish. Therefore some trade- offs must be made. In keeping with the current trends we will want to focus our attention on the disktype devices, primarily hard disk types. Magnetic bubble memory is not yet a serious alternative to hard disks. But prices will decrease dramatically within the next few years. Fortunately -- from the viewpoint of this design -- the access time characteristics of magnetic bubble memory resemble those of ordinary hard disks -- at least in principle: access time is divided in seek and latency time. Hence, hard disk access strategies are not necessarily irrelevant in connection with mag- netic bubble memory. It might be said that the fact that magnetic bubble memories are an order of magnitude faster than hard disks is just an added convenience. Figure 3.1: Backing storage characteristics, from (Hnatek 1979). \f Thus the prime problem is to balance the factors of slow floppy- disk-like devices against those of medium-fast devices. 3_._2_._1_ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 3.2.1 - the number of (random) accesses to the backing storage must be minimal for often used facilities of the file system. - organization of logically connected parts of backing storage (e.g. sequential files) should be optimized relative to the physical characteristics of different media. The file system must consequently be able to encompass resulting d_i_f_f_e_r_e_n_t_ strategies of allocation and device handling. 3_._3_ _ _ _ _ _ _ _C_h_a_r_a_c_t_e_r_i_s_t_i_c_s_ _o_f_ _D_a_t_a_ 3.3 The file system must be able to handle few as well as a great number of files. Also it must be able to handle small as well as huge files. Actual computer installations have been investigated with respect to the size of files. The tables of fig. 3.2, 3.3, and 3.4 show the distribution of file sizes in some typical computer-environments. It is clearly shown that the majority of files are relative small (less than 10 Kb). Data may be accessed in a wide varity of ways. Therefore diffe- rent overlays of internal structures on the files (random access organizations, list- and chain organizations etc.) should be well supported by the file system. \f F_ Figure 3.2: File sizes. The installation is a rather big minicomputer used for system development at A/S Regnecentralen af 1979. \f F_ Figure 3.3: File sizes. The installation is a medium-sized minicomputer used for system development at A/S Regnecentralen af 1979. \f F_ Figure 3.4: File sizes. The installation is a very big minicomputer system used for both batch and on-line applications. The system is situated at a customer of A/S Regnecentra- len af 1979. \f Many computer installations comprise a limited number of devices for backing storage. Typically there may be a few disk drives, but a lot of volumes (kits) which may be mounted on these drives. This situation is obviously caused by the relative high expence of backing storage devices (compared to the expence of the rest of the computer system). Consequently, the files may be situated on a lot of volumes, and all the volumes of the system is not necessarily mounted at the same time. 3_._3_._1_ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 3.3.1 - The file system must be able to handle few as well as a great number of files. - must be able to handle small as well as huge files. Especially small files must be handled efficiently. - the files may be located on a lot of volumes, not necessa- rily mounted concurrently (at the same time). - the files must be fully dynamic. Single files must be able to grow or shrink (at least at the tail of the file, but preferably anywhere: beginning, midst, or tail of the file). 3_._4_ _ _ _ _ _ _ _C_h_a_r_a_c_t_e_r_i_s_t_i_c_s_ _o_f_ _U_s_e_r_s_ 3.4 The file system may be situated in the context of single-user as well as multi-user systems. The file system must therefore be able to handle situations where a number of users access files on shared volumes of backing storage. This demands certain facili- ties of synchronization (exclusive and shared access etc.) and resource management. \f The multi-user situation also calls for more sophisticated naming facilities: users should be able to use filenames independent on other users choise of names, and files should be protected against accidental or malicious misuse by other users. Whatever protection rules are enforced, the environment is prone to update already existing data with false new data from time to time. Therefore it may be a facility of the file system automati- cally to dump (updated) data (backup). Admittedly, such facili- ties are on the borderline of what a file system -- in itself -- should be able to cope with. 3_._4_._1_ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 3.4.1 - Users must be able to choose filenames independently, with- out concern for duplicate filenames (otherwise they will have to guess at names not previously chosen). - it must be possible to specify accessrights allowing private as well as public use, so that a number of users may share the same file. - synchronization of concurrent use of single files must be implemented. - the file system must deliver such relevant data, that implementation of an augmenting resource management system is possible. S_e_c_o_n_d_a_r_y_ _o_b_j_e_c_t_i_v_e_s_: - automatic backup procedures should be supported by the file system. \f 3_._5_ _ _ _ _ _ _ _P_r_o_t_e_c_t_i_o_n_ 3.5 In any computer system, data are in danger of being improperly e_x_p_o_s_e_d_ or d_a_m_a_g_e_d_ (Lampson 1974). Protections are a defense against such threats. Threats may be caused by a_c_c_i_d_e_n_t_ or m_a_l_i_c_e_. Malice tries to make use of any weakness in a protection system. Malice may have much more far-reaching consequences than acci- dents (unobtrusive modification is much worse than outright de- struction). Thus, protection against malice is harder to estab- lish and at the same time more important. Accidents may damage data. Of course it may expose data as well, but this is presumably less harmful if no malice is present. Ac- cidents may be caused by user mistakes, software errors in system programs (bugs) or hardware malfunctioning. Considering tolerancy against faults, there are two important factors: - r_e_d_u_n_d_a_n_c_y_: Errors are only detected through the presense of redundancy. If enough redundancy, faults may even be re- paired. So the amount of redundancy in the different organ- izations of the file system is an important design parame- ter. - p_h_y_s_i_c_a_l_ _l_o_c_a_t_i_o_n_ _o_f_ _f_a_u_l_t_ _s_e_n_s_i_t_i_v_e_ _d_a_t_a_: Errors are prone to hit backing storage "clustered" around a single physical location (e.g. disrupting one or a few adjacent sectors of a disk). Therefore sensitive data should be d_i_s_t_r_i_b_u_t_e_d_, li- miting the effect of local errors: one disrupted sector of a disk should not cause total disruption of the disk ("casca- ding" of errors). \f 3_._5_._1_ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 3.5.1 - The administrative data of the file system are sensitive data, and should consequently be fault tolerant. This includes: 1) least possible "cascading" of minor errors. 2) regeneration of disrupted areas on basis of intact areas ("scavenging"). - authorization enforcement should be supported. - reservation and marking of malfunctioning areas (e.g. areas with persistent parity errors). \f F_ 4_._ _ _ _ _ _ _ _ _D_E_S_I_G_N_ _O_F_ _T_H_E_ _I_N_D_I_V_I_D_U_A_L_ _L_E_V_E_L_S_ 4. In the following chapters 5 to 10, the design of the individual levels of the model introduced in chapter 2 will be discussed. The discussion of the Access Methods (AM) level and the Logical File System (LFS) level will be postponed until after discussion of Basic File System (BFS) and lower levels. The latter levels must -- to a certain degree -- be independent of the type of en- vironment, since files generated in one milieu must be compatible with versions of the file system in another milieu. Each of the chapters 5 to 10 is divided in five: - short general description of the level in question - objectives - design - interface functions to the level - concluding remarks. The d_e_s_i_g_n_-sections consist of an iterative sequence of p_r_o_p_o_s_a_l_s_ for solution of a stated subproblem. Each proposal is clearly emphasized in the text. Examples of actual systems employing the proposal in question, is given immediately below each proposal. Following references to actual systems are used: Burroughs B5500 MCS (Ravn 1979) IBM OS/360 (Kristensen 1979) OS6 (Stoy and Strachey 1972) PDP SOLO (Brinch Hansen 1977) RC3600 (Olsen 1978) RC System 2 (Brinch Hansen 1971) RC System 3 (Aris et al. 1978) TITAN (Kristensen 1979). \f F_5_._ _ _ _ _ _ _ _ _B_A_S_I_C_ _F_I_L_E_ _S_Y_S_T_E_M_ _(_B_F_S_)_ 5. This module is concerned with maintenance of the file d_e_s_c_r_i_p_t_- o_r_s_. The desriptor provides the overall information about the file, such as "length" and "location". Data needed for controlling multiaccess to the file (shared and exclusive accessing - prevention of deadlocks, etc.) is also located in the file descriptor. 5_._1_ _ _ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 5.1 - Must give all relevant information on the overall status of the files - the descriptors are essential for accessing of files. They must be fault tolerant. 5_._2_ _ _ _ _ _ _ _D_e_s_i_g_n_ 5.2 The main problems of the BFS-level are: (1) the organization of descriptors, (2) redundancy in locating descriptors, and (3) the actual contents of a descriptor. 5_._2_._1_ _ _ _ _ _O_r_g_a_n_i_z_a_t_i_o_n_ _o_f_ _t_h_e_ _D_e_s_c_r_i_p_t_o_r_s_ 5.2.1 The simplest scheme for organization of the file descriptors is to associate them directly with the symbolic file names. P_r_o_p_o_s_a_l_: File descriptors are put in the catalog(s) together with the symbolic filenames. Thus there are one de- scriptor for each name. (RC system 3, PDP SOLO, IBM OS/360). \f This organization scheme -- coupled with a reasonable organiza- tion of the catalog(s) (e.g. hashed) -- ensures relative quick retrieval of all relevant information on a single file. On the other hand search lengths may grow fast, as the size of single entries are rather big (there can only be few entries per page), so the fillgrade must be kept low. A more serious problem is that of multiple naming of single phy- sical files. If users are to have individual and different names for a file, this implies that they have individual descriptors for the file. This leads to inconsistency problems among the de- scriptors, as the information contained in a descriptor, are at- tributes to the d_a_t_a_ of the file (and not to the n_a_m_e_s_ of the file), and consequently should be the same for every name-entry. P_r_o_p_o_s_a_l_: There is a unique file descriptor for every file. The descriptors are put in a special system file. Conse- quently a file has a unique internal name (called the f_i_l_e_ _i_d_e_n_t_i_f_i_e_r_) which is defined by the index of the descriptor in the file of descriptors. (OS6, Burroughs B5500 MCS) In this case the catalog system maps the symbolic names into identifiers. No inconsistency problem exists, as descriptors are unique. Searching for a given name in a catalog can be fast, because many entries may be found on the same page (size of entries is small, consisting only of a symbolic name and an identifier). On the other hand it is slower to access the data of a given file, because it is necessary to access the catalog as well as the descriptor to find the physical address of the first file- page. \f Of course the slowdown would not be significant if the whole de- scriptorfile was held in core. This is not realistic as it would mean that filedescriptors for every accessible file in the system should exist in core, introducing severe restrictions on the number of files. Such a limitation would be antagonistic to the stated objectives of the file system. Another problem is the vulnerability of a descriptorfile: if the file is erroneously deleted or disrupted the whole associated volume of backing storage will become inaccessible. If the pages of the file are clustered physically on the backing storage (which normally is optimal for allocation of files) a hardware error disrupting even a minor part of the backing storage (e.g. a few sectors of a disk) may render the access to several files impossible. P_r_o_p_o_s_a_l_: File descriptors are part of the individual files (e.g. placed on the first page of any file). A file is unam- biguously identified (within a volume) by the address of its first page. This proposal combines the best features of the two former pro- posals: - the catalog system is totally independent on the Basic File System, because files can be identified uniquely by internal names (it is assumed that the address of a given file>s first page is fixed through the whole lifetime of the file). Alas, no inconsistency problem! - no superfluous backing storage accesses in connection with "opening" of a file are needed, as the first page of the file has to be read anyway (this page contains information about the location of the single pages of the file, as shall be seen in a following chapter. - descriptors are automatically distributed on the volume, there- by diminishing the risk of catastrophes when single pages de- velop hardware faults. \f Unfortunately, the assumption of fixed first pages of all files, is very inconvenient when one considers reorganization. Reorgani- zation obviously involves moving of files, and it is difficult -- if not impossible -- to find a reasonable procedure for reorga- nizing the files of a volume, and at the same time updating the references to first pages (file identifiers) in all of the cata- logs (no backward references from files to catalogs exist). P_r_o_p_o_s_a_l_: File descriptors are placed at the first page of any file. Addresses of such first pages ("descriptor pages") are put in a table, the "File Location Table", contained in a special system file. A file is identifi- ed unambiguously (within a volume), by an index to the table-entry referencing the first page of the file. This proposal resembles the last-but-one proposal. It ensures the independence of file identifiers and file location thus enabling efficient reorganization. Unfortunately it suffers from some of the same shortcomings as the last but one proposal: - opening of a file requires one extra access to find the address of the descriptor page in the proposed File Location Table. - the File Location Table is very vulnerable, since the data are clustered. As regards the first shortcoming: this is the obvious effect of creating independence between file identifiers and file locations (in conjunction with the demand of no backward references from files to their file names). Therefore this must regrettedly be accepted, in the good cause of flexibility and reorganization. Remark: Unlike the previous proposal a lot of table entries may be placed on a single table page (the entries contain only an address). If time used for opening of files is critical, it may be practicable, to let the whole (or a large part of the) table\f be resident in core. This should naturally be controlled by an initialization parameter to the file system. As regards the second shortcoming (vulnerability of table), this may be eliminated by the following scheme: - the File Location Table index of a file is stored in the descriptor when created (i.e. "pushed down" into the descriptor). Descriptors are distributed and therefore fault tolerant, seen as a whole. - the recovery program finds all the descriptors of the volume, thereby reconstructing the File Location Table. This leaves the problem of finding every descriptor on a volume without use of the File Location Table. That is the subject of the next subsection. 5_._2_._2_ _ _ _ _ _R_e_d_u_n_d_a_n_c_y_ _C_o_n_c_e_r_n_i_n_g_ _L_o_c_a_t_i_o_n_ _o_f_ _D_e_s_c_r_i_p_t_o_r_s_ 5.2.2 As pointed out in the previous subsection, it is essential for recovery after faults on the File Location Table, that the loca- tion of all file descriptors may be found, without use of the File Location Table. There are two different approaches to this: (1) A separate table, referencing all descriptors, exists. This could for instance be a bit table, referencing all the pages on which descriptors reside. Or it could simply be a copy of the File Location Table. Use of this method implies that the table must be maintained in parallel with the File Location Table. (2) The descriptor pages are marked in such a way that they can be identified by a recovery program ("scavenger"), scanning all the pages of the volume (e.g. Redell 1980). \f It does seem impractical to mark every page of a volume, this requiring a little of the data area of every page. The problem must be solved otherwise: The descriptor pages con- tain a special identification in the first few words. The chances of duplicating such an identification -- by acci- dent, or by maliciousness -- on a non-descriptor page, could be kept extremely low, by using a well-chosen and suffici- ently long identification. Since method (1) adds overhead to normal system operation, method (2) will be adapted in the final design. The problem of choosing a reasonable identification for descriptor pages is rather dependent on the actual implementation of the file system. The subject will therefore not be elaborated in detail here, but a few hints will be given: - the identification should not contain ASCII-characters (i.e. no low-valued bytes), since they are common in datafiles. - there must be no valid identification on unallocated pages: when a file is deleted, the identification on its descriptor page must be marked or destroyed. - the length must be chosen so that the probability of interpreting a wrong page as a descriptor page is extremely low. An example: Define p(xDd1Uu xDd2Uu ... xDdnUu) as the probability of meeting the byte-pattern xDd1Uu, xDd2Uu, ... xDdnUu (xDdiUu is a byte-value) in the first n bytes of any page on a given volume. The problem is to define an identification xDd1Uu, xDd2Uu ... xDdnUu, ensuring a sufficiently low p(xDd1UuxDd2Uu ... xDdnUu). \f If we -- for the sake of simplicity -- assume that all values of any byte on a data-page are equally probable we obtain following: (1) all byte values are independent, i.e.: p(xDd1UuxDd2Uu ... xDdnUu) = p(xDd1Uu) p(xDd2Uu) ... p(xDdnUu) (2) for any two byte-values xDd1Uu and xDd2Uu M_M_ _1_ _ p(xDd1Uu) = p(x) = 2Uu-8Dd = P_P_ 256 (3) p(xDd1UuxDd2Uu ... xDdnUu) = p(xDd1Uu) p(xDd2Uu) ... p(xDdnUu) = (p(xDd1Uu))UunDd = 2Uu-8nDd If we choose n = 4, then M_M_ _1_ _ p(idDd1Uu idDd2Uu idDd3Uu idDd4Uu) = 2Uu-32Dd = P_P_ 2 G. This means that the probability of misinterpreting one or more pages of a 250 Mb-disk (which contains approximately 500 K pages), is: M_M_ P_P_ If we choose n = 6 we obtain analogously: M_M_ P = 2Uu-48Dd 500K = 0.2 10Uu-6Dd% P_P_ 6 The above considerations serve to show that it is feasible to reconstruct the File Location Table by scanning all the pages of a backing storage volume -- if the descriptor page identification is carefully chosen. \f 5_._2_._3_ _ _ _ _ _C_o_n_t_e_n_t_s_ _o_f_ _D_e_s_c_r_i_p_t_o_r_s_ 5.2.3 The descriptor contains attributes to the d_a_t_a_ of the file, such as l_e_n_g_t_h_ and l_o_c_a_t_i_o_n_ of the file. Please note, that another set of attributes must be associated with the n_a_m_e_s_ of the file, defining use- and access-rights (for authorization enforcement). Superimposing levels to the file system may want to insert their own attributes in the descriptor. This may be done in a special user-accessible area of the descriptor -- the t_a_i_l_. The descriptor could furthermore contain information for con- trolling of multiaccess, such as: - open _count (number of times the file has been "opened" without later "closing") - update _count (number of current open connections in update-mode) - retrieve _count (number of current open connections in retrieve _mode) - exclusive _reserved (a flag: if set by a user opening the file, all other users are excluded from access to the file -- this may for instance be used in connection with writeaccess), and some sort of information concerning the last updating of the file (for backup purposes), e.g.: - time and date of last update. \f 5_._3_ _ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _B_F_S_-_L_e_v_e_l_ 5.3 A reasonable set of functions, accessing the actual contents of files and manipulating descriptors, is: open (file _identifier, mode) close (file _identifier) get (file _identifier, buffer, page _no, page _count) put (file _identifier, buffer, page _no, page _count) create (descriptor _mask, file _identifier) delete (file _identifier); extend (file identifier, amount) truncate (file _identifier, amount) settail(file _identifier, tail) mount (volumename, device) dismount (device) The f_i_l_e_ _i_d_e_n_t_i_f_i_e_r_ is the internal name of the file as given by the catalog system. M_o_d_e_ indicates write- or read-access or both. Furthermore it specifies private (exclusive) or shared access. P_a_g_e_ _n_o_ is the logical number of the first page to be read/writ- ten. P_a_g_e_ _c_o_u_n_t_ denotes the number of pages from (or to) b_u_f_f_e_r_, to be transferred. D_e_s_c_r_i_p_t_o_r_ _m_a_s_k_ contains relevant data (attri- butes) for creation of a new file. A_m_o_u_n_t_ is the number of pages to allocate/deallocate. T_a_i_l_ contains the data to be put in the tail of the descriptor. It should be noted that the format of get- and put- functions facilitates multiple page transfers to and from backing storage. The interface would obviously be simpler if only offering single- page transfers: get/put (file _identifier, buffer, page _no). How- ever, multiple-page transfer is, in many cases, much more efficient: All the pages of a program may be loaded by issue of a single read-command to the controller of the backing storage (on\f a disk all the pages of a single surface of a cylinder may read in one revolution). This will be discussed in detail later. Mount and dismount are functions to manage the including and removal of backing storage volumes. V_o_l_u_m_e_n_a_m_e_ is a userdefined symbolic name of the volume to be mounted. A simple check is performed to ensure the correct identity of the mounted volume. D_e_v_i_c_e_ is an identification of the device on which the volume is, or shall be, mounted. 5_._4_ _ _ _ _ _ _ _C_o_n_c_l_u_d_i_n_g_ _R_e_m_a_r_k_s_ 5.4 All the stated objectives of section 5.1 are reasonably met. \f F_6_._ _ _ _ _ _ _ _ _F_I_L_E_ _O_R_G_A_N_I_Z_A_T_I_O_N_ _S_T_R_A_T_E_G_Y_ _M_O_D_U_L_E_ _(_F_O_S_M_)_ 6. This level defines the internal organization of a single file. It maintains this information in case of creation, deletion, extension, or "shrinking" of a file. 6_._1_ _ _ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 6.1 Prime objectives: - dynamics: a file should be able to grow or shrink arbitrarily, at least at the tail of the file - the amount of allocated or deallocated pages must be a figure available to the user (for resource management purposes) - the time for access of single pages of the file must be pre- dictable within reasonable limits - sequential access (e.g. loading of programs) must be fast - random access must be reasonably fast - file allocation (and deletion) must be reasonably fast - when a page is deallocated (e.g. by deletion of a file), the data contained herein should never be seen by another user incidently allocating the same page later. Admittedly such recreation of -- possibly sensitive -- data is very random and data can furthermore always be effectively deleted by the user -- if so wanted -- by overwriting them before deallocation of the corresponding pages. However, in keeping with Murphy>s laws: if not forced by the system, users will not act reasonable! \f Secondary objectives: - a file should be able to grow or shrink anywhere: beginning, midst, or tail - files may be arbitrarily big, encompassing more volumes of backing storage. 6_._2_ _ _ _ _ _ _ _D_e_s_i_g_n_ 6.2 The design of the FOSM-level is treated in four subsections: - page organization of files - organization of index tables - dynamics of files - unreadability of deallocated pages 6_._2_._1_ _ _ _ _ _P_a_g_e_ _O_r_g_a_n_i_z_a_t_i_o_n_ _o_f_ _F_i_l_e_s_ 6.2.1 The three wellknown techniques of file organization are: - contiguous organization - chained organization - index table organization. P_r_o_p_o_s_a_l_: Files are organized contiguously. (RC System 2). This is a simple strategy: the file is identified by the address of its first page. The address of any page in the file is easily found by a simple computation. Many files can be open at the same time, as organization overhead in terms of core storage is low. Sequential accessing time is minimal (for the sake of simplicity it will be assumed in the following that latency time will be minimal when accessing sequential pages of the backing storage. This assumption may in fact not be true for all devices, but a cunning mapping from the logical pagenumbers of the device, into\f actual physical locations, may prove sufficient to fulfil the assumption as shall be seen later -- cf. subsection 7.2.3). Remark: The postulate that contiguous files minimize sequential access time is true only for single-user systems. As Madnick (Madnick 1969a) points out, the question is open to dispute when two sequential files are scanned concurrently on the same volume, since this will defeat the assumed sequential accessing! The notable drawback of the proposal is that file allocation (or deletion) is unacceptably slow: files must be compacted by copy- ing from time to time. Otherwise waste of space will be excessi- ve, since deleted files will create large holes of potentially unusable space. The solution to the problem is to scatter the individual pages of the file: P_r_o_p_o_s_a_l_: Files are organized by chaining of pages. Every page of a file contains a pointer to the next page. Again a relative simple strategy: the file is identified by the address of its first page, organization overhead is small in terms of core storage and file allocation, deletion, extension, and shrinking may be fast. The chief deficiency of the method is the use of variable time for accessing of random pages. This could be solved by holding the chain fields of the file in core storage, instead of distibuting them on the individual pages. P_r_o_p_o_s_a_l_: Files are organized by chaining of pages. Chains are held in core storage. (RC System 3) Access time is still variable but tolerably so (at least for files not excessively huge). However, organization overhead in\f terms of core storage is considerable. The only realistic way to employ the method is to hold a s_i_n_g_l_e_ chain table (perhaps per volume) in core, describing all files as well as free pages. If such a scheme is used many files may be open at the same time, since opening of yet another file will not add notably to the required core size. On the other hand the total number of backing storage pages con- nected to the system will be severely limited by the core size. This may be partially overcome by making the pagesize very big -- or better: address pages in "chunks" (called slices), making files consist of a number of slices, each referencing a fixed number of contiguous pages. This allows the chain table to span a greater number of pages using the same amount of core. Unfortuna- tely this method introduces an unpleasant dependence between the size of available core storage and organization of backing sto- rage (i.e. the size of the slices). Restricted core size implies big slices with consequent wasting of backing storage (in mean half a slice is wasted per file). Furthermore the choise of slice size may be difficult, because volumes could be exchanged between computer systems of different available core size. Another drawback is the sensitivity of the chaintable (which of course must be stored on backing storage as well) towards errors disrupting small parts of it. A ruined chaintable of a few pages, makes the whole volume of possibly several hundred thousand pages inaccessible. Finally, the total number of files is limited (at least by the number of chain fields in the chain table). This may be tolerable on many computer installations, but it is not tolerable in gen- eral, since the file system should be applicable on an assorted range of configurations. P_r_o_p_o_s_a_l_: Files are organized by index tables. Each file contains such a table on its first page. (PDP SOLO, RC3600, OS6) \f This method makes the core size overhead for file organization proportional to the number of opened files. Naturally it limits the number of opened files on a given installation, but the pro- blem is tolerable, since the number of concurrently opened files is prone to grow with the size of available core on different computer configurations (demands to the degree of "multi-using" naturally tends to grow with the available core size: small core- size micro-computers are usually expected to be single-user sys- tems, for instance). Extension and shortening of files is simple, assuming the needed free space is available in the index table. Allocation and deletion of files does not involve time consuming activities, as far as the FOSM-level is concerned. Sequential access time is a function of the allocation strategy employed when the file was created (and perhaps extended). The actual allocation of pages might very well be contiguous in many cases. Yet, the method i_s_ rather inefficient in terms of core storage. This means that the index tables should be as compressed as pos- sible. This is not naturally compatible with the objective of ar- bitrarily big file sizes. So, let us turn the problem of organi- zation of index tables. 6_._2_._2_ _ _ _ _ _O_r_g_a_n_i_z_a_t_i_o_n_ _o_f_ _I_n_d_e_x_ _T_a_b_l_e_s_ 6.2.2 The organization of an index table has several degrees of freedom: - structure of the individual indices, - the address space of the individual indices, - representation of index tables in core, - single-level or multi-level indices. \f 6_._2_._2_._1_ _ _ _S_t_r_u_c_t_u_r_e_ _o_f_ _t_h_e_ _I_n_d_i_v_i_d_u_a_l_ _I_n_d_i_c_e_s_ 6.2.2.1 P_r_o_p_o_s_a_l_: The individual indices reference chunks of pages. The chunks are of fixed size (e.g. one page). (PDP SOLO, OS6) Gives easy and fast indexing of a given page. Extension or shor- tening is simple (if there is room enough in the indextable) although it might involve moving of indices within the table. Unfortunately the method is grossly core-consuming, as the index- table of a file will grow proportionally to the file size. This core-consumption may be reduced by employing following method: P_r_o_p_o_s_a_l_: The individual indices reference chunks of pages. The chunks are of variable size (multiplum of slice size). (RC3600) In this case the number of indices is proportional to the degree of scattering of files: if a file is contiguously allocated, only one index is needed. In the worst case a file will be totally scattered, requiring double as much space as in the previous pro- posal (an index contains two words: address and length). However, if a huge file is so scattered that the latter proposal requires more space than the former, it might be advisable to reorganize the volume anyway. Reorganization should compress files (i.e. allocate them contiguously). Finding the address of a given page may involve some searching in the index table, but if the searching is conducted in core this is absolutely tolerable. The only notable drawback is that this organization does not al- low refinement in connection with the concept of "sparsely fil- led" files. Such files (e.g. large random accessed hash coded files where only a few of its pages actually contain data) could easily be implemented by the previous proposal (just placing zeroes (say) in the indices referring to empty pages not yet allocated). However, this is not considered a serious problem,\f since such file organizations also could be implemented on top of the latest proposed organization. This would just require a table (contained in the file) mapping the hashcoded page numbers into the actual (internal) page numbers. Since the problem was that of s_p_a_r_s_e_l_y_ filled files, this method would not introduce much overhead compared to the first one. All in all the method is preferable to the first proposed. 6_._2_._2_._2_ _ _ _T_h_e_ _A_d_d_r_e_s_s_ _S_p_a_c_e_ _o_f_ _t_h_e_ _I_n_d_i_v_i_d_u_a_l_ _I_n_d_i_c_e_s_ 6.2.2.2 When using computers with a word-length of 16 bits, there is a problem of obtaining sufficient address space for large devices. A large device may easily comprise several hundred thousand pages. One 16-bit word will only address up to 64 K pages. What to do? P_r_o_p_o_s_a_l_: A new abstract level is introduced: a u_n_i_t_. A unit comprises up to 64 K pages. A device may comprise one or several units. Each unit is an independent address space. This ensures the address space to be maximally 64 K. The trade- off is the added complexity in the interface to devices (you should not only distinguish devices but units too). Furthermore files are restricted to the size of 64 K pages. P_r_o_p_o_s_a_l_: A pageaddress consists of 32 bits. Thus the address space is 4 G pages. A 4 G range for pages is certainly large enough (if there are 512 bytes per page the maximum address space comprises 2 Tb = 2 x 10Uu12Dd bits), and that is larger than any currently available storage device). \f This method may work well with processors offering 32-bit arith- metic -- which will be the case of many of the coming micro-pro- cessors: Motorola MC68000, Zilog Z8000, Intel 8086, etc. Unfortu- nately it is not the case of many current computers. If no "more- word arithmetic" is supported, it will be very tedious to use 32-bit arithmetic. Considering the fast development of computers, it is chosen to solve the problems by brute force: All page-addresses consist of 32 bit, but no double-word arithmetic is applied in computers where this is not absolutely needed. In most cases an address space of 64 K pages i_s_ enough (at least for microcomputers) and if a larger address space is needed this will often be connected with larger computers (offering 32-bit arithmetic)... 6_._2_._2_._3_ _ _ _R_e_p_r_e_s_e_n_t_a_t_i_o_n_ _o_f_ _I_n_d_e_x_ _T_a_b_l_e_s_ _i_n_ _C_o_r_e_ 6.2.2.3 When a file has been opened it is natural to place some administrative data in core storage, easing access to the file. Amongst such data is the index table for the file. If this is n_o_t_ placed in core, an access to an arbitrary page of the file will involve two backing storage accesses -- first to retrieve information from the indextable concerning the location of the requested page, followed by the actual access. P_r_o_p_o_s_a_l_: The indextable of a file is placed in core storage whenever the file is opened (it is of course removed when the file is closed). (PDP SOLO) This is the natural thing to do. Access times for arbitrary pages of the file are kept minimal. Furthermore it fulfils the ever im- portant property of access time predictability. On the other hand a lot of files may be open at the same time, consuming a lot of core storage for their indextables. \f In matter of efficiency this may be acceptable on large com- puters. But it is certainly not attractive in connection with small microcomputers and the like. Again, as the prime objective of the file system is to span the whole range of computers, the consumption of core for indextables must be restricted. P_r_o_p_o_s_a_l_: Only part of an open file>s indextable is put in core. When a page not referenced in this part is requested, the whole indextable will be read from backing storage and a new part (according to some clever strategy) is put in core. (RC3600) This is in fact a method of "paging" the indextables. The nice thing about the method is that arbitrarily big and complex indextables (e.g. in several levels) could be allowed easily. Unfortunately the trade-offs are obvious: If the size of the core part of an indextable is smaller than the whole indextable for a given file, chances are - naturally - that an arbitrary access to the file, will reference a page "outside" this part (give a "page fault"). Thus the mean access time for the file will increase -- in fact proportional to the degree of fragmentation of the file. Furthermore predictability of access times will be lost (for the user, whom the actual organization of files is irrelevant). One could design various crafty strategies for the continous re- newal of the core part of index tables. Many such strategies are known from the litterature on program-execution involving paging- methods. It is also well-known that these techniques work relati- ve well as long as programs are executed sequentially. However, when execution involves jumps to arbitrary points in the program, there is a danger of severe decrease in execution speed -- so- called "trashing". \f The problem of "paging" the indextable of an open file are ana- logous. As long as accessing is sequential, everything may work relative well. But if access is random (which is often the case for datafiles) trashing may occur no matter how clever the paging strategy. This is obviously a dilemma between antagonistic objectives: in- dextables must be used (allowing scattering of files) -- and: the importance of fast access to the individual pages of an open file. The compromise may lie in the possibility of "compressing" files to a reasonable degree of scattering (i.e. organize the files in a limited number of chunks, thereby diminishing the number of in- dices in index tables). P_r_o_p_o_s_a_l_: Indextables are restricted to a limited number of chunks. Indextables of open files are core resident. When the size of indextables is restricted, they will -- occasio- nally -- show full, but this is not an irrepairable situation, as the backing storage could be reorganized. Admittedly it may cause some disadvantage to user, but this should be balanced by the ever minimal and always predictable access times. One may consider implementing automatic reorganization, initiated by the file system whenever needed. However, this will add unac- ceptably to the complexity of the file system nucleus. If it is essential to have a given volume reorganized continuously, a reasonable way would be to put the reorganization utility program in an individual low-priority process of its own. Of course this will require some careful synchronization between file system and reorganization process. The maximum number of indices in the index-table is hard to de- fine. The number 25 has rather arbitrarily been chosen, hopefully balancing the need for low consumption versus the realistic degree of fragmentation. \f Judging from the empirical figures of section 3.3, the choise is not too bad: If a slice size (i.e. minimal allocation unit) of 3 is chosen, 25 chunks will be able to reference at least 75 pages in the very w_o_r_s_t_ of cases (that is to say: no two chunks can be placed contiguously). The tables show that more than 80 percent of all files were of less size. 6_._2_._2_._4_ _ _ _S_i_n_g_l_e_-_L_e_v_e_l_ _o_r_ _M_u_l_t_i_-_L_e_v_e_l_ _I_n_d_i_c_e_s_ 6.2.2.4 The above chosen method for core representation of index tables inhibits any practical use of multilevel indexing. It should be noted however, that a sort of multilevel indexing system could be implemented o_n_ _t_o_p_ of the here described file system. The higher levels of indices might be represented by individual files con- taining references to other files. Since a high-level interface to the file system obviously may utilize such techniques, there is no urgent need for multi-level techniques in the file system "nucleus". 6_._2_._3_ _ _ _ _ _D_y_n_a_m_i_c_s_ _o_f_ _F_i_l_e_s_ 6.2.3 It was stated as an objective of this level that files should be able to grow or shrink -- at least at the tail -- but preferably anywhere in the file. Why demand this of a file system? Well, of course any possible problem of this nature might be sol- ved by simply copying the file into a new file with the wanted format. But copying of huge files is a time-consuming activity. Since appending new portions of data to a file is a frequent operation, it is not feasible to copy every time. Alas, e_x_t_e_n_d_i_n_g_ of files (at the tail) must be a facility of the file system. \f When a file is extended, we want to allocate the pages of the extension in as few chunks as possible -- all in line with the general philosophy of allocation. If a file is extended one page at a time, there will be no possible way to determine a reason- able strategy, as to which free pages to allocate. The alternative is to allocate all the needed pages at once. If the number of needed pages is not known at the time of extending, it may be profitable to allocate the maximum-number of pages possibly needed. This implies that it must be a facility of the file system to t_r_u_n_c_a_t_e_ a file (getting rid of the unwanted pages after an extension). These are the arguments for dynamics in the tail of a file. Of course they may be applied to the corresponding problem of dyna- mics anywhere in a file. However, one should balance the need for general dynamics against the added complexity to the system. It occurs that examples of natural applications using general dy- namics are: (1) text editors and (2) random access file organi- zations. As regards the text editors, arbitrary dynamics might make fast insertion of pieces of text possible. On the other hand it is hard to imagine a text editor system not preserving a copy of the old text for security reasons. So it seems that text editors are not really a strong argument for the general dynamics. Random access file organizations are probably a much better ar- gument. General dynamics may facilitate to make "cleaner" designs of such organizations, for instance hiding overflow problems. However, the present design will argue that problems c_a_n_ be sol- ved within the framework of tail dynamics, and the file system will consequently be restricted on this point. \f If the need for general dynamics becomes prevalent, the file sys- tem could be expanded with moderate effort to meet such demands. 6_._2_._4_ _ _ _ _ _U_n_r_e_a_d_a_b_i_l_i_t_y_ _o_f_ _D_e_a_l_l_o_c_a_t_e_d_ _P_a_g_e_s_ 6.2.4 One of the stated objectives is to protect deallocated pages from being read by arbitrary users, allocating the same page later. There are two possible solutions to this: (1) overwriting deallo- cated pages (e.g. by zeroes) or (2) administratively prohibiting any possible reading of such pages. The former solution is obviously the safer one: data of deallocated pages is p_h_y_s_i_c_a_l_l_y_ deleted, never to be accessed again. On the other hand it will be rather time consuming. Since the latter solution may be applied safely -- as shown below -- this will be used in the final design. P_r_o_p_o_s_a_l_: Reading of pages not yet written (in the context of the actual file) is prohibited. To fulfil this proposal we need two length-variables in the file descriptor: (1) the total number of pages allocated to the file and (2) the number of pages actually written onto the file. Let us first consider sequential reading and writing. In this case reading past the number of written pages should simply be prohibited. This allows the user to see his own written data, and nothing else. Admittedly things look a little different for random access files: When a random access file is created it must contain a given number of pages regardless of whether there are any data on these pages or not. However, this objection may be discarded, observing that a random access file of arbitrary size may be created by first creating and opening it for sequential writing, secondly write the wanted number of pages (e.g. writing zeroes),\f and finally -- after closing -- regard it as (and possible change its attributes to) a random access file. Thus unreadability of pages not yet written could be ensured by obeying these simple rules: - two length-variables are associated with each file: ALLOCATED-LENGTH and WRITTEN-LENGTH. - following assertions must hold: (1) WRITTEN-LENGTH <_ ALLOCATED-LENGTH (2) If page number R is read from the file: R <_ WRITTEN-LENGTH (3) If page number W is written onto the file: W <_ WRITTEN-LENGTH + 1. 6_._3_ _ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _F_O_S_M_-_L_e_v_e_l_ 6.3 Functions on the level of FOSM are: open (descriptor, mode) close get (buffer, page _no, page _count) put (buffer, page _no, page _count) extend (amount) truncate (amount). The only function that requires explicit identification of the target file is "open". Reason is: it is assumed that open initia- lizes a (not-in-use) "FOS-Module" which then will represent the file. The rest of the functions will accordingly, in some way -- dependent on the implementation language -- refer to such a spec- ific module. The functions are thus "entries" (in the Concurrent Pascal (Brinch Hansen 1977) sense of the word) to any of the individual modules. \f D_e_s_c_r_i_p_t_o_r_ is the file descriptor found by the Basic File System. It is used for initialization of the "FOS-Module". M_o_d_e_, b_u_f_f_e_r_, p_a_g_e_ _n_o_, a_m_o_u_n_t_, and p_a_g_e_ _c_o_u_n_t_ is described in section 5.3. 6_._4_ _ _ _ _ _ _ _C_o_n_c_l_u_d_i_n_g_ _R_e_m_a_r_k_s_ 6.4 The design of the FOSM-level is essential to the performance of the whole file system. None of the p_r_i_m_a_r_y_ objectives of section 6.1 are neglected. However, both of the s_e_c_o_n_d_a_r_y_ objectives are chosen n_o_t_ to be met in this design: Files are n_o_t_ allowed to grow or shrink any- where, and files are n_o_t_ allowed to encompass more volumes of backing storage: As regards the reasons for the limited dynamics of files, this is extensively discussed in subsection 6.2.3. The limitation of files to single volumes is not considered really serious. Backing storage devices must be chosen to fulfil the demands of the applications using them! \f F_ 7_._ _ _ _ _ _ _ _ _D_E_V_I_C_E_ _S_T_R_A_T_E_G_Y_ _M_O_D_U_L_E_S_ _(_D_S_M_)_ 7. The modules on this level define and describe the (possibly different sorts of) devices on which backing storage is mounted. Since the physical characteristics of the various known devices may differ a lot, miscellaneous optimizing access strategies m_a_y_ be embedded in the modules. The possible strategies for a given device are split in two groups: (1) sequencing of i/o-requests and (2) allocation/deallo- cation of pages on the device. 7_._1_ _ _ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 7.1 Prime Objectives: - when a file is created or extended with a number of pages, these pages should be allocated in as few chunks as possible (i.e. preferably contiguous). Reasons are: minimizes size of indextable; makes loading of programfiles fast. - allocation/deallocation should be reasonably fast. - pages developing faults must be marked, so that they are unallocatable. Secondary objectives: - it must be easy to embed new sequencing strategies (for i/o- requests) in the modules. This is primarily a requirement to the implementation structure of the file system. Reason is: it is not considered part of the file system nucleus to employ sequencing strategies, since these are not imperative to the function of the file system (a given device may be ac- cessed with or without sequencing strategies with equal results -- but presumably with differing efficiency). On the other hand, sequencing may improve efficiency considerably on some computer installations (e.g. in multiprogramming, multi-user environments), and should therefore be readily supported by the file system. \f 7_._2_ _ _ _ _ _ _ _D_e_s_i_g_n_ 7.2 7_._2_._1_ _ _ _ _ _S_e_q_u_e_n_c_i_n_g_ _S_t_r_a_t_e_g_i_e_s_ 7.2.1 Strategies for sequencing i/o-requests are -- as said above -- highly device dependent and are not an immediate part of the de- sign to be discussed. Please refer to the litterature, e.g. (Teorey et al. 1972) or (Wong 1980), for further discussion on different algorithms. 7_._2_._2_ _ _ _ _ _A_l_l_o_c_a_t_i_o_n_/_D_e_a_l_l_o_c_a_t_i_o_n_ _o_f_ _P_a_g_e_s_,_ _A_d_m_i_n_i_s_t_r_a_t_i_o_n_ 7.2.2 The DSM must contain datastructures and algorithms to administer the allocation and subsequent deallocation of backing storage pages. Various well-known techniques to do this job, exist. These will be presented below: P_r_o_p_o_s_a_l_: Free pages are c_h_a_i_n_e_d_ together. (RC System 3) This method has very little overhead in terms of core consump- tion. The only thing to store is the address of the next page to allocate (deallocated pages may be inserted before this, or they may be inserted elsewhere in the chain). The chainfields could be located on the individual pages, as the pages are not used for data. However, when allocation strategies are employed, trying to find a suitable c_h_u_n_k_ of contiguous pages, the method is not attrac- tive. Search times will be excessive. If the chainfields were to be placed in a chaintable in core sto- rage, the method would show distinct similarities with the method of chaining file pages. This method was not adopted. For analo- gous reasons the corresponding method will not be applied here either. \f P_r_o_p_o_s_a_l_: The free pages are organized as a f_i_l_e_, which grows or shrinks as pages are deallocated or allocated. This method might work very well if indextables were big enough allowing arbitrarily fragmentated files. The notable point of the method is that the tools (i.e. file system procedures) needed to maintain such a datastructure are already present; why, they are needed in the user interface to manipulate ordinary files. The method might be a little inefficient in terms of execution time. The indextable organization chosen in subsection 6.2.2.3 is not sufficient for this method. P_r_o_p_o_s_a_l_: The free pages are administered by a "bit table". In this table every bit represents a page. If the bit is set (to "1") this means that the corresponding page is free; if not set (set to "0") it is allocated. (PDP SOLO, RC3600) This gives an excellent survey of the free pages. It is easy to compute the size of individual "holes" of free pages, allowing efficient searching. The method may be rather core consuming. A typical-sized volume currently holds from 100.000 to 1.000.000 pages (if page size is 512 bytes). This will require a bit table of 6 to 63 kilobytes. This is of course not realistic. P_r_o_p_o_s_a_l_: The free pages are administered by a bit table. The bit table is a file itself and as such "p_a_g_e_a_b_l_e_". This gives a relative uniform approach to the handling of bit tables on different sorts of devices, large as small. On the other hand, allocation -- and especially deallocation -- may involve several "page faults" decreasing efficiency. \f P_r_o_p_o_s_a_l_: The free pages are administered by a bit table. Every bit references a fixed number of pages. This number constitutes the minimal unit of allocation (called a "s_l_i_c_e_"). (RC3600) This method utilizes the idea that pages may be allocated profit- ably in units of several pages. Reasons are: - most files are of size greater than one page. - files should as far as possible be allocated contiguously. Allocation in slices tends to help to this purpose. - extending of files involving allocation of pages will tend to be less frequent. Often there will be space enough in the last allocated slice for minor extensions. - "holes" of free space on the backing storage tend to be larger (measured in number of pages), because pages are deallocated in units of slice length too. Large holes are more useful for al- location purposes. The main draw-back of the method is the amount of wasted space on backing storage. This is proportional to the slice size, since the theoretical mean size of wasted space per file is half the slice length. The actual slice size should therefore be balanced according to the above pro et contras. The slice size need not be a universal constant, it could very well vary from volume to volume. In fact it might be feasible to let the slice size on large volumes tend to be large (as the waste of space here might be of relative little importance), and oppositely, let the slicesize of small volumes be small. Other factors may influence the choise of slice size, such as the ex- pected mean size of the files to be placed on the volume: the space wasted is relative little when files are huge -- a file of\f 1000 pages with a 10-page slicelength wastes a maximum of one percent of the occupied space; not so a file of 1 page (up to 90 percent waste). Even though this technique may be employed, it is not reasonable to restrict the size of the bittable to one page (or any other universally fixed size), because this will introduce an unwanted restriction on the number of files possible on any volume (the number of files would be limited by -- at least -- the number of bits in the bittable). The compromise is to apply both of the two last methods proposed, thereby getting a "pageable" table of free slices. P_r_o_p_o_s_a_l_: The free slices are administered by a "pageable" bit- table. If time consumption by allocation and deallocation is critical, one should choose a slice-size producing a bit table of size less than a page (eliminating "page faults") remembering the trade-off in terms of wasted space on backing storage. 7_._2_._3_ _ _ _ _ _A_l_l_o_c_a_t_i_o_n_/_D_e_a_l_l_o_c_a_t_i_o_n_ _o_f_ _P_a_g_e_s_,_ _S_t_r_a_t_e_g_i_e_s_ 7.2.3 The stated objective of allocation strategies (cf. section 7.1) was to allocate the requested number of pages in as few chunks as possible. This objective was under the assumption that contiguously located pages minimize access time when access is sequential. The postulate is not true for all devices. It is not all device- controllers that allow access to an arbitrary number of conti- guous pages (sectors, segments, etc.). Some demand every block accessed individually. This will defeat the postulated minimal access time. \f Take as an example a hard disk device: the rotational gap between two neighbouring pages may be too short to allow the system to start another access. If pages were placed contiguously in such a situation, a lot of disk revolutions would be lost when accessing sequentially. Therefore the pages should be interleaved in such a way that two consecutive pages on the same cylinder always are placed with another page parting them. This implies that all pages on (one side of) a cylinder could be accessed in two revo- lutions. The location of logically contiguous pages on two neighbouring cylinders may also be optimized (i.e. considering the rotational gap for cylinder shifts). The technique is refered to as "s_c_r_e_w_i_n_g_" of pages. The mapping between physical locations and logically contiguous locations could be placed in software as well as hardware ("soft screw" respectively "hard screw"). When placed in software this should preferably be a facility of the driver for the device. Pages written with the use of a "soft screw" will be practically impossible to read if not accessed via the same "soft screw" technique. If the "soft screw" is not pro- perly standardized, this may cause problems in connection with exchanging of media between different installations. It is simpler to use "hard screw" (if the physical charachteris- tics of the device permit it). This is done by the "formatting" the media, writing appropriate headers on the sectors of each cylinder. This technique makes the software invariant to the used screwing strategy. Other devices may use other and more refined optimization stra- tegies. We shall not elaborate further on this here, but it will be assumed that such techniques may be incorporated in the file system easily. \f All in all we may stick to the objective of allocation in as few chunks as possible, since this -- under reasonable conditions -- will tend to minimize access time for sequential accessing. The problem of allocation of pages -- by the means of a bit table -- is therefore reduced to the well-known problem of dynamic sto- rage allocation. Please refer to e.g. Knuth (Knuth 1969), Vol. 1, page 435 ff. for a more detailed discussion of strategies. In accordance with the conclusions of Knuth we will choose the "f_i_r_s_t_-_f_i_t_" method as the basis of allocation strategy. If no fit is found, the requested number of pages will be split into two approximately equal parts, and the algorithm will be used (recur- sively) on both of these parts. This modification to the first fit method is totally heuristic. It would be nice to try alterna- tive techniques in a simulation model before application. However, it is important to notice, that the allocation strategy may be changed at any time, since it has no effect whatsoever on the system>s ability to access already allocated pages. Thus, if a better strategy is found, this will be applied (different stra- tegies may even be used in different environments). 7_._2_._4_ _ _ _ _ _M_a_r_k_i_n_g_ _o_f_ _M_a_l_f_u_n_c_t_i_o_n_i_n_g_ _P_a_g_e_s_ 7.2.4 Pages, which develop persistent errors, should be marked in some way to ensure future unaccessibility. It is natural to use the Bit Map for this. Slices containing malfunctioning pages are considered "allocated", though not to any specific file. This is, however, more easily said than done. A malfunctioning page can only be detected during r_e_a_d_i_n_g_ of the page (if a write-operation is specified to an ordinary disk-con- troller, it will write the data unconditionally - no matter how defect the page may be). \f This means that malfunctioning pages typically are detected during reading of a file. In this situation we cannot just mark the relevant slice as "allocated" in the Bit Map, since this is the case already (it is allocated to the file we are reading)! What to do? P_r_o_p_o_s_a_l_: Whenever a file is d_e_l_e_t_e_d_, it will be check-read to discover any malfunctioning pages. All pages, but the malfunctioning ones, will be deallocated. Simple to implement, but it is unacceptable in terms of time-con- sumption. Consider deleting a file of 40.000 (say) pages -- patience will be required! P_r_o_p_o_s_a_l_: A separate bit table -- the Malfunctions Bit Table (MBT) -- is maintained. The table references all mal- functioning slices. Whenever a malfunctioning page is discovered the associated slice will be marked in MBT. Whenever a slice is deallocated, it is checked in the MBT that this slice is ok. If it is not ok, no bit will be set in the ordinary Bit Table of free slices. This method does not add notably to system overhead. Furthermore, it has the additional advantage of maintaining a complete table of the malfunctioning areas of the backing storage. Such a table may prove useful in a number of situations. Thus, this last method will be chosen. \f 7_._3_ _ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _D_S_M_-_L_e_v_e_l_ 7.3 The set of functions of a single Device Strategy Module will look something like: get (buffer, page _addr, page _count) put (buffer, page _addr, page _count) mount (volume _description) dismount allocate (amount, chunk) deallocate (chunk) B_u_f_f_e_r_ and p_a_g_e_ _c_o_u_n_t_ are described in section 5.3. P_a_g_e_ _a_d_d_r_ is the address -- within the given volume -- of the first page to be transferred, or marked defect. V_o_l_u_m_e_ _d_e_s_c_r_i_p_t_i_o_n_ contains in- formation concerning the characteristics of the volume to be mounted; used for initialization of the module. A_m_o_u_n_t_ is the number of slices to be allocated by the "first-fit" method. C_h_u_n_k_ indicates location and length of the chunk allocated or (to be) allocated. 7_._4_ _ _ _ _ _ _ _C_o_n_c_l_u_d_i_n_g_ _R_e_m_a_r_k_s_ 7.4 The design of the DSM-level as described above, meets all of the stated objectives of section 7.1. One should notice that the DSM-level must be implemented with care, since several demands to the flexibility and modularity of the implementation have been stated. \f F_ 8_._ _ _ _ _ _ _ _ _I_N_P_U_T_ _O_U_T_P_U_T_ _C_O_N_T_R_O_L_ _S_Y_S_T_E_M_ _(_I_O_C_S_)_ 8. This level defines the actual software interface to the hardware of external devices. Furthermore it performs functions such as: error retrial, exeption handling, etc. Since the level is highly implementation and configuration de- pendent, it will not be elaborated in detail here. In fact only one single aspect will be treated: the size of pages (a page being the minimal quantum of bytes accessed from an arbitrary backing storage device, in one operation of the IOCS). The objectives and discussion is accordingly restricted to this single issue. 8_._1_ _ _ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 8.1 Prime objectives: - the fixed page size should balance the factors of buffer claim (on core storage), and the transfer rate of data. Secondary objectives: - the page size should support all common formatting standards of backing storage devices (allowing use of pre-formatted media). 8_._2_ _ _ _ _ _ _ _D_e_s_i_g_n_ 8.2 The effective transfer rate (in bps) is higher for large p_a_g_e_ s_i_z_e_s_ than for small (as the administration overhead is approxi- mately constant). This fact points to large page sizes. Core storage is a critical resource. Large page sizes consume a lot of core for buffers. This points to small page sizes. \f Balancing these factors is hard because circumstances may be so different from installation to installation (core storage is cer- tainly a critical resource on a 8-bit microcomputer with a maxi- mum address-space of 64 Kb, as well as high transfer rates may be essential in a large multi-user system). We have -- rather arbitrarily, but hopefully balancing the above mentioned factors -- chosen a page size of 512 bytes. This page size supports the most common sector size standards for floppy disks. As "standard" for floppy disks (diskettes), the IBM formats are generally prevailing. The clasical IBM standards are: 128, 256, and 512 bytes per sector (ref.: IBM not dated). These may all be supported by the chosen page length of 512 bytes/sector, since more than one sector may constitute a page (e.g. 4 sectors of 128 bytes may be read into a 512 byte page- buffer). IBM has introduced another format of 1024 bytes/sector lately. This cannot -- without considerable difficulties -- be supported by a 512 byte page size! However, the format does not -- as of this time -- seem to be generally accepted, and predictions are, that it will be dropped or at least limited in use. Since the expences of a 1024 bytes pagesize format are considerable in respect to core consumption of pagebuffers, we choose not to be impeded by the small chances of a 1024 bytes/sector standard being accepted. \f 8_._3_ _ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _I_O_C_S_-_L_e_v_e_l_ 8.3 As stated above the actual interface to the IOCS may be rather environment-dependent. A general outline for the functions may look like this: get (buffer, page _addr, page _count) put (buffer, page _addr, page _count) control (request) B_u_f_f_e_r_, p_a_g_e_ _a_d_d_r_ and p_a_g_e_ _c_o_u_n_t_ are described in section 7.3. R_e_q_u_e_s_t_ denotes the wanted control-function of the driver. 8_._4_ _ _ _ _ _ _ _C_o_n_c_l_u_d_i_n_g_ _R_e_m_a_r_k_s_ 8.4 The stated objectives of section 8.1 are met. \f F_9_._ _ _ _ _ _ _ _ _A_C_C_E_S_S_ _M_E_T_H_O_D_S_ _(_A_M_)_ 9. The Access Methods is the user>s interface to the file system. The facilities of this interface is highly dependent on applica- tions, environments, etc. It may comprise techniques such as: fixed or variable length record accessing, single or multibuf- fered accessing, random or sequential accessing, specialized database record accessing, etc. As stated in chapter 3, it is not considered part of this file system design to define or discuss facilities of the AM-level. It should be noted, however, that the file system has been de- signed to support developments of different problem-oriented access methods. If the AM-level had been incorporated into the file system nucleus, the user would have been constrained by a small set of limited access methods. \f F_ 1_0_._ _ _ _ _ _ _ _L_O_G_I_C_A_L_ _F_I_L_E_ _S_Y_S_T_E_M_ _(_L_F_S_)_ 10. The primary component of this module is the c_a_t_a_l_o_g_ _s_y_s_t_e_m_. This is responsible for the mapping of (userdefined) symbolic n_a_m_e_s_ on to the internal and unique file i_d_e_n_t_i_f_i_e_r_s_. Furthermore the mo- dule offers authorization enforcement facilities. In fact, the services and facilities of the LFS may be numerous. Since one of our main goals for the file system nucleus is to keep the size within limits, the actual LFS of this design will be kept at a minimum, only offering basic functions. These func- tions must be chosen with care, so that augmenting levels, imple- menting more refined and potent naming systems, are well suppor- ted. The following subsections investigate the properties of naming, and based on this analysis, decide on (1) the structure of cata- logs, (2) a simple authorization enforcement scheme, and (3) which basic functions the LFS should offer. Search-techniques in catalogs are not discussed in this report. 1_0_._1_ _ _ _ _ _ _O_b_j_e_c_t_i_v_e_s_ 10.1 Prime objectives: - the catalogs must be able to hold a very large amount of names (preferably no fixed limits on the number of names in a cata- log). - flexible, but easy-to-understand rules for protection of other user>s names. - sharing of files between users must be allowed (if so specified by the "owner" of the file). - a set of access-rights must be associated with each name, so that a simple authorization scheme can be enforced. \f - catalogs should be distributed on volumes (no single volume should hold the names of files on every other volume, since such a catalog could grow to enormous dimensions - and would be a very vulnerable point in case of errors - "if the mastercata- log goes, the whole system goes ..."). Secondary objectives: - catalogs on different volumes of backing storage (for instance disc-packs) should not be totally independent: inter-volume references are a practical tool. The danger of such references are that they may lead to repeated mounting of volumes just to continue a path in a chain of references. 1_0_._2_ _ _ _ _ _ _D_e_s_i_g_n_ 10.2 The discussion of the Logical File System is divided into follow- ing main items: - a model for naming - how should files be named? - how should the catalog system be structured? - how should a simple authorization enforcement scheme be implemented? - deleting of files 1_0_._2_._1_ _ _ _ _A_ _M_o_d_e_l_ _f_o_r_ _N_a_m_i_n_g_ 10.2.1 In this subsection we will consider filenames as such, i.e. the string of characters representing a symbolic reference to the data structure: a "file". We will not be bothered with reflecti- ons on how to organize the underlying data which represents the names. \f 1_0_._2_._1_._1_ _ _S_p_e_c_i_a_l_ _O_b_j_e_c_t_i_v_e_s_ _o_f_ _N_a_m_i_n_g_ 10.2.1.1 Following special objectives concerning the naming of files could be drawn (in accordance with the general objectives of the Logi- cal File System): - File names are chosen and used by humans: naming rules must allow names which are easily recognized and remembe- red by humans. - In a file system a file is considered an o_b_j_e_c_t_ of the class of files. It is natural to divide the class into groups of objects in different ways. A mechanism of divi- ding often used, is the hierarchical structuring (since it mirrors the common structure of labour-organization -- amongst human workers as well as tasks/projects). However, other structures of division may be used too. - The set of accessible filenames should not be static. Dif- ferent users may see different subsets (maybe even totally independent subsets). - Users seeing separate subsets of filenames must be able to name distinct files with the one and same name. User A may see a file "PIP" which has nothing to do with the file "PIP" of user B. - Different users with shared access to a single file must be able to name this file individually, i.e. a file may have several names. 1_0_._2_._1_._2_ _ _N_a_m_e_s_:_ _A_n_ _A_b_s_t_r_a_c_t_ _M_o_d_e_l_ 10.2.1.2 Now, let us try to express the notions of "naming" abstractly. First of all we define the words: "context", "partial filename", and "complete filename". \f The objective of allowing one name to represent several separate files means that the interpretation of a filename depends on the c_o_n_t_e_x_t_ in which it is used. This leads to the notion of "c_o_m_p_l_e_t_e_" and "p_a_r_t_i_a_l_" filenames. A complete filename is -- by definition -- unambiguous. Partial filenames may very well be ambiguous, but the context in which they are used will determine their corresponding complete file- names -- thereby giving them exact and unambiguous interpretati- ons. The construction of complete filenames may be a facility of the AM-level. It will automatically append some sort of context identification (prefix) to any user specified filename. This implies that a filename always is unambiguous (complete) in the interface between AM and LFS. The context of a user is defined and perhaps changed by some sort of supervisory system (possibly embedded in the AM-level). The properties of a naming system could be represented as a directed graph G. The context of any user is represented by a unique vertex of the graph G, called his "context-vertex". A user is said to have "access to" a vertex V if and only if there is an (not necessarily unique) oriented path (eDd1Uu, eDd2Uu, ---, eDdnUu) from V to his context-vertex CV (eDdiUu is the name of edge i in the digraph). The tuple (eDd1Uu, eDd2Uu, ---, eDdnUu) is called the "partial name" of vertex V as seen from CV. Vertices with in-degree equal to zero are called "files", all others are called "catalogs". This implies that a user with context-vertex of type "file" has access to n_o_ other vertices, whereas a user with context-vertex of type catalog may have access to a number of vertices -- in fact to the sub-digraph of which the context-vertex is a root. \f T_ (tegning) Figure 10.1: Example of a digraph structure. E_x_a_m_p_l_e_: User A with context-vertex C1 has a_c_c_e_s_s_ _t_o_ C1, F1, C3, F2, F3, and F4. User B with context-vertex C2 has access to C2, F4, F5, &_ and F6. M_M_m_ P_a_r_t_i_a_l_ _n_a_m_e_s_ of F1 as seen from C1 could be (e ), (e , P_P_p_ 2 1 M_M_m_ e ), (e , e , e ), etc. P_P_p_ 2 1 1 2 As F4 are accessible from both C1, and C2, partial names M_M_m_ of F4 could be e.g. (e ), (e ), (e , e ), etc. P_P_p_ 3 4 5 4 To interpret this abstract model in terms of the well-known noti- ons of filenames (as a string of characters) and context (as the state in which a user finds the file system, defining the subset of filenames accessible), we define two f_u_n_c_t_i_o_n_s_ on an arbitrary digraph D. First, define following sets: E = e e is an edge of D M_M_m_ n P = (e ,e ,...,e ) E (e ,e ,...e ) is an oriented path in D P_P_p_ 1 2 n 1 1 n V = v v is a vertex in D C = c c defines a "context" S = s s is a string of characters \f Then define the two functions: NAME : P S DEFINECONTEXT: C V The situation of this point is: Associated with every user of the file system is a context c (this context may be defined in various ways: by a user-password, by explicit command of the user, etc.). If we interpret the ver- tices of file-type as file i_d_e_n_t_i_f_i_e_r_s_ (i.e. for every file in the file system there exists one unique file-type vertex), we may consequently interpret the oriented paths finishing in the vertex defined by the value of D_E_F_I_N_E_C_O_N_T_E_X_T_(c) (and originating in a file-type vertex), as representations of the accessible file- names. The function N_A_M_E_ maps these paths into ordinary string filenames. 1_0_._2_._2_ _ _ _ _H_o_w_ _S_h_o_u_l_d_ _F_i_l_e_s_ _b_e_ _N_a_m_e_d_?_ 10.2.2 Now the problem of choosing a reasonable naming method corres- ponds to solving the following sub-problems. - definition of restrictions on the digraph structure - definition of the function NAME - definition of the function DEFINECONTEXT. 1_0_._2_._2_._1_ _ _R_e_s_t_r_i_c_t_i_o_n_s_ _o_n_ _t_h_e_ _D_i_g_r_a_p_h_ _S_t_r_u_c_t_u_r_e_ 10.2.2.1 P_r_o_p_o_s_a_l_: Only one vertex of catalog type exists. The rest of the vertices in the digraph are all of file-type. The di- graph forms an oriented tree. (RC system 2, PDP SOLO) \f T_ Figure 10.2: Global catalog. &_ This is the simplest imaginable naming system. Only a single non- trivial context-vertex exists, namely the catalog-type vertex. All users have shared access (at least at the level of naming) to all files. Of course this does not meet the objectives of the naming system. P_r_o_p_o_s_a_l_: The digraph forms an oriented tree with three levels. The root may be called the "Master File Directory" (MFD); rest of the catalog-type vertices may be called "User File Directories" or (UFD) "subcatalogs". (TITAN, RC3600). Figure 10.3: Two-level catalog structure. \f Different users may have different context-vertices (correspond- ing to the UFD>s) in this proposal. Naming is still very simple, but it does not allow multiple naming of a file. P_r_o_p_o_s_a_l_: The digraph forms an oriented tree with an arbitrary number of levels. (RC system 3) Figure 10.4: Tree-structure. As an example of this proposal we may take the RC system 3 naming system (used on RC4000 and RC8000): Slightly s_i_m_p_l_i_f_i_e_d_ we can describe the rules of accessibility and protection as follows: Every filename in the file system is connected with two integers, forming an interval (called the baseinterval of the filename). Every process accessing the file system is connected with a base-interval defining the accessible and protected filenames. A filename can be used unrestrictedly from a given process, if (and only if) the base-interval of the filename is completely covered by the base-interval of the process. \f Furthermore a filename can be accessed (for reading) if the base-interval of the filename completely covers the base-inter- val of the process. These two access rules obviously corresponds to t_w_o_ independent oriented trees of the sort proposed: an "update-tree" and a "read-only-tree". The notable drawback of the proposal is that it does not allow multiple naming of a file. This was a primary objective of the LFS. The problem may be overcome by slackening the restrictions, al- lowing the file-type vertices to have out-degree greater than one (this means that the digraph is not restricted to the oriented t_r_e_e_ organization). P_r_o_p_o_s_a_l_: The digraph is r_o_o_t_e_d_, but otherwise unrestricted. (OS6) This is the most liberal proposal possible. It offers completely optional choise of naming system for any augmenting level. The proposal will be adapted in the final design. 1_0_._2_._2_._2_ _ _T_h_e_ _F_u_n_c_t_i_o_n_ _N_A_M_E_ 10.2.2.2 We will make two restrictions on the function NAME: M_M_m_ - two a_l_m_o_s_t_ identical paths (e , e , e , x) and (e , e , e , y) P_P_p_ 1 2 3 1 2 3 should be named a_l_m_o_s_t_ identically (to be easily recognized by the humans). This could be fulfilled if: NAME (eDd1Uu, eDd2Uu) = NAME(eDd1Uu) d NAME(eDd2Uu) where d is some sort of character delimiter. Reasonable and often used choises for d are ".", "/", or ":". \f - to ensure unambiguiety in connection with the above restric- tion, we must define: NAME(eDd1Uu) 0=/5 NAME (eDd2Uu) if eDd1Uu and eDd2Uu both are edges initiating in the same vertex (no two names in a catalog may be identical). 1_0_._2_._2_._3_ _ _T_h_e_ _F_u_n_c_t_i_o_n_ _D_E_F_I_N_E_C_O_N_T_E_X_T_ 10.2.2.3 The function DEFINECONTEXT maps the abstract notion of a "user- context" into an actual vertex of the naming-digraph. The user-context may be defined by means of passwords, explicit command, jobnames, etc. This is definitely not a problem which the file system nucleus should deal with. It will be left to aug- menting levels. 1_0_._2_._2_._4_ _ _S_u_m_m_a_r_y_ _o_f_ _N_a_m_i_n_g_ _R_e_s_t_r_i_c_t_i_o_n_s_ 10.2.2.4 The naming of files may be represented by a d_i_g_r_a_p_h_. The vertices of the digraph are of f_i_l_e_ type if their in-degree equals zero, otherwise these are of c_a_t_a_l_o_g_ type. The digraph must be r_o_o_t_e_d_ and a_c_y_c_l_i_c_. Every edge of the digraph has a name (string of ASCII-characters). The names of the edges ending in a catalog- type vertex are the names c_o_n_t_a_i_n_e_d_ in this catalog. No two identical names may be contained in the same catalog. The c_o_m_- p_l_e_t_e_ name of a vertex (file or catalog), is the concatenated string of edge-names (delimited by a special character such as "."), so that the edges form a path from the vertex to the root. 1_0_._2_._3_ _ _ _ _S_t_r_u_c_t_u_r_e_ _o_f_ _t_h_e_ _C_a_t_a_l_o_g_s_ 10.2.3 This subsection deals with the actual implementation of the di- graph structure described in subsection 10.2.2.1. There are two types of vertices: files and catalogs. The imple- mentation of file-type vertices has been discussed in great\f detail in previous chapters. So it is the implementation of catalog-type vertices that remains. The problem is: how to represent catalog-type vertices by actual data-structures? P_r_o_p_o_s_a_l_: All catalog-information is collected in o_n_e_ file, a "Master-catalog". This proposal is obviously relative simple to implement. The file should be organized allowing random access. Searching may therefore be fast, since the organization scheme could be made to fit the usage precisely. The notable disadvantage of the proposal is, that the catalog information is not distributed on the individual volumes: The volume holding the Master-catalog must always be mounted! If there is a lot of files on the backing storage, the catalog could grow to enormous dimensions. This is naturally in conflict with the objectives of the LFS. P_r_o_p_o_s_a_l_: There is a single "volume-directory" file on each volume, containing the collection of all the catalogs of the volume. (RC System 3) This proposal fulfils the stated objectives. Still, the volume- catalog may grow, containing numerous entries, if the volume is big. It may be difficult to dimension the catalog optimally, because it is practically impossible to predict the actual number and distibution of entries. This may cause some degradation of performance (overflow problems etc.). P_r_o_p_o_s_a_l_: Every catalog-type vertex corresponds to an individual catalog file. (OS6) \f This is probably the "cleanest" design. Catalog files may be dimensioned and organized individually, according to the typicals of the files they reference. The danger of the method is that "long" filenames containing several concatenated simple names, may be rather time-consuming to interpret, since every individual catalog in the "path" of the filename must be (1) opened, (2) searched, and possibly (3) closed. However, this problem may be more or less remedied by opening frequently used catalogs permanently. The last proposal will be chosen for the final design. 1_0_._2_._4_ _ _ _ _A_u_t_h_o_r_i_z_a_t_i_o_n_ _E_n_f_o_r_c_e_m_e_n_t_ 10.2.4 Authorization enforcement is of great importance both in a multi-user environment and in a single-user environment, since it not only protects against maliciousness -- but also against accidental errors. Therefore it is a necessary and essential part of the LFS. On the other hand it should certainly not be sophisticated, adding too much complexity to the file system nucleus: Sophistication should be left to augmenting systems. The naming system described in the previous subsections partly controls the users access to files (a user can only access files which he can name). But this is not sufficient because it is an "all-or-nothing" sort of enforcement. It is necessary to control access to files in a more minute way. \f A reasonable set of possible authorizations for a user of a file (catalog or data-file) are: authorization to: - RETRIEVE data - UPDATE data - ADJUST length of file - DELETE file. The main problem of authorization enforcement is: how are author- izations granted to separate users of the same file? It would involve substantial additional administration overhead to the file system, if it were to maintain separate descriptions of the individual users (so-called user-profiles), defining the authority of each one. Such natural extensions will be left to augmenting systems. Instead we will use the naming system exclusively, for defining "users". If two users are to have different authority relative to a file, they must name the file individually, thus allowing dif- ferent authorization attributes to be associated with each names. In keeping with this convention, it is natural to define the o_w_n_e_r_-name of a file, as the name under which the file were ori- ginally created. If other names are linked to this name (i.e. re- ferencing the same file) they must be g_r_a_n_t_e_d_ authority by the owner-name. A very simple -- but effective -- scheme for granting of author- ities is: whenever a name A is linked to a name B, B will be granted authorities which are a subset of A>s. The authorities of a name may be changed if and only if the (not mentioned earlier) authority AUTHORIZE is contained within the authoriza- tion attributes of the name. \f E_x_a_m_p_l_e_: A file is created under the name of A with following authorities: A: RETRIEVE, UPDATE, ADJUST, DELETE, AUTHORIZE. That is: all authorities. Since the user of A wants to protect his file against deleting, he defines that only the authorities RETRIEVE, UPDATE, and ADJUST may be granted to linked names: A: R_E_T_R_I_E_V_E_ , U_P_D_A_T_E_, A_D_J_U_S_T_, DELETE, AUTHORIZE If a name B is linked to A, it inherits following authorities: B: RETRIEVE, UPDATE, ADJUST The user of B -- in his place -- wants to protect the file against any updating from subsequently linked names, and defi- nes only RETRIEVE as inheritable: B: R_E_T_R_I_E_V_E_, UPDATE, ADJUST A name C linked to B, inherits: C: RETRIEVE In case of compound filenames (i.e. concatenated names, e.g. "CAT1.CAT2.PIP") the set of authorities in effect is the i_n_t_e_r_s_e_c_t_i_o_n_ of the sets for each of the individual simple names. Thus, all the files of a catalog may be protected against upda- ting, if the UPDATE authorization is absent from the authority of the catalog name. In fact it will not even be possible to insert new, nor delete existing names in the catalog. \f E_x_a_m_p_l_e_: Consider a multi-user system in which every user has his own private subtree catalog structure. The authorizations associated with the root of this subtree are presumably unrestricted. To prevent interference between users, no user subtree may contain backward references to catalog levels beyond that of the subtree root. On the other hand, a number of files must be sharable between the users: utility programs, system programs, compilers, etc. This may easily be accomplished by the scheme outlined in figure 10.5: Figure 10.5: Example of multi-user directory structure. Authorizations are showed at the edges. It will not be possible for any user (who has not access to the system root) to update any file in the subtree of utilities. \f This scheme offers basic facilities for propagating authorities in the naming system. However, it lacks the ability to protect a user against own accidental misuse (d_e_f_e_n_s_i_v_e_ protection). If a user temporarily wants to write-protect (say) a file, he can only do so by removing some of his own authorities, possibly never to regain them. This is unacceptable, so we define another set of attributes, p_r_o_t_e_c_t_i_o_n_s_, exactly equal to the -- not to mistake for -- authorizations (viz. RETRIEVE, UPDATE, ADJUST, DELETE, AUTHORIZE). Protections may be altered at any time and are local to each name. 1_0_._2_._5_ _ _ _ _D_e_l_e_t_i_n_g_ _o_f_ _F_i_l_e_s_ 10.2.5 Deleting of files is a non-trivial problem in the chosen catalog- structure. The point is: A given file may be referenced by sev- eral names. If the file is requested to be deleted through one of the names, what should happen with the other names, referencing the same file. P_r_o_p_o_s_a_l_: The entry in question in the File Location Table is marked "deleted", and the actual file is deleted ef- fectively. Later references to the file will be stopped (and notified) by the deleted FLT-entry. The proposal solves the problem, but unfortunately the File Loca- tion Table will grow indefinitely, since no entry may ever be ef- fectively deleted and re-used. P_r_o_p_o_s_a_l_: A counter is inserted into every FLT-entry. The counter gives the number of undeleted names pointing at the en- try. When the counter is zero, the entry may be re-used for another file. This proposal seems allright at first look, but it does have a well-known, classical drawback (Knuth 1969). \f Deleting of a name is the same as deleting (at least) one edge in the corresponding digraph. If -- after deletion -- one or more vertices become isolated from the root of the original digraph, they are unaccessible, and should therefore be effectively dele- ted. Counting of references to a FLT-entry corresponds to count- ing of the o_u_t_-degree of the referenced file -- in terms of the digraph model of subsection 10.2.1.2. When the out-degree is zero, the vertex is unaccessible, and must be deleted together with all the edges that end in the vertex. One should think that by recursive application of the method, the unaccessible parts of the digraph will be deleted. Unfortunately that is not so in all cases. Consider the following example: Figure 10.6: Example of cyclic digraph. where C1 is the root. If the edge eDd1Uu is deleted C2 and C3 will become unaccessible, and should consequently be deleted. However, the above proposed "garbage collection" algorithm cannot do the job: both C2 and C3 have outdegree equal to one. The reason for this trouble is, that the algorithm does not work, if the "garbage" contains c_y_c_l_i_c_ references! It is extremely difficult to prohibit cyclic references without at the same time restricting the digraph structure unnecessarily (the digraph could for instance be restricted to a t_r_e_e_, ensuring \f acyclicity), since it is difficult to check a structure for cyclic references. The problem may be solved by either (1) restrictions on the digraph structure, (2) check of acyclicity on file-creation time, (3) check for cyclic references at file-deletion time, or (4) use of an individual garbage collection utility program checking the whole digraph structure. A_n_ _e_x_a_m_p_l_e_ of a possible method for acyclicity enforcement: A status variable called LEVEL is retained in every file of type catalog. This variable denotes the level (i.e. the l_o_n_g_e_s_t_ path of references from the root). LEVEL for MAINCAT: 0 CAT1: 1 CAT2: 2 CAT3: 3 Figure 10.7: Example The LEVEL variable is easily computed when a catalog is created (add one to the level of the mother-catalog). When a new reference between two existing catalogs is inserted (an entry "CATB" (say) -- referencing the catalog B -- is inserted in catalog A (say)), it should be checked that the LEVEL of the referenced catalog (catalog B) is greater than (and not equal to) the LEVEL of the "mother"- catalog (catalog A). If this condition is met no cyclic re- ferences can be introduced. \f This example suffers from adding unnecessary restrictions to the acyclicity restriction, too. Admittedly, every acyclic digraph may be constructed under the restrictions of the example, but it is very critical how the creations of inter-catalog references are sequenced. Consider Fig. 10.7: the reference eDd3Uu m_u_s_t_ be created before eDd4Uu. Instead of implementing either a simple, but unnecessary restric- tive, or a very complex acyclicity enforcement system, it is the decision of the present design to leave such to augmenting levels of software. That is, no acyclicity enforcement scheme is implemented in the design. This decision seems reasonable, remembering that the possible amount of garbage is restricted to a number of File Location Table entries. 1_0_._3_ _ _ _ _ _ _B_a_s_i_c_ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _L_F_S_ 10.3 Since the LFS is augmenting the BFS-level in the hierarchical structure of the file system, the LFS-functions handling files resembles the corresponding functions of the BFS (cf. section 5.3). The difference is that the file identifier supplied in the func- tions of the BFS should be substituted by a symbolic filename in the functions of the LFS. The syntactical rules for naming of files -- as defined by the restrictions on the function NAME (cf. section 10.2 -- could be expressed like this in Backus-Naur notation: <name'::= <catalogname' <d' <simple name' <catalogname'::= <catalogname' <d' <simple name' <simple name'::= "algol-like identifier" <d'::= >.> >:> >/> \f However, it is not necessarily fortunate to express filenames as a string of concatenated catalog- and file-names, in the a_r_g_u_- m_e_n_t_s_ of LFS-functions. The recursive definition above suggests a way to express a filename in the functions of the LFS, namely: (catalog _identifier, simple _name). Instead of using the complete filename (and consequently interpreting it) every time a function is called (e.g. OPEN ("cat1.cat2.cat3.pip", ...)), one may find the file _identifier (the unique internal name, cf. subsection 5.2.1) for the "catalog-part" of the name string (that is "cat1.cat2.cat3" in the above example), and just stating the simple name ("pip") after this (OPEN (catid, "pip", ...)). This way to name files on the LFS-level offers potent facilities to augmenting levels. An augmenting naming system may for instan- ce maintain a table of catalog _identifiers with an entry for each user. Now the user only has to specify a simple name to reference a file -- the naming system automatically supplies the catalog _identifier of the user. The catalog _identifiers may be initialized by means of user-supplied passwords, thereby imple- menting a simple authorization enforcement scheme on u_s_e_r_s_. A reasonable set of functions on the LFS-level is: mount(volumename, device, cat _id) dismount(device) connect(cat _id, name, new _cat _id) disconnect(cat _id) lookup(cat _id, name, attributes, descriptor) open(cat _id, name, mode, file _id) close(file _id) get(file _id, buffer, first _page, count) put(file _id, buffer, first _page, count) \f create(cat _id, name, init _data) delete(cat _id, name) linkname(cat _id _1, newname, cat _id _2, oldname) rename(cat _id, oldname, newname) extend(cat _id, name, amount) truncate (cat _id, name, amount) set _tail(cat _id, name, tail) set _authority(cat _id, name, authorizations) set _inheritance(cat _id, name, inheritable _authorizations) set _protection(cat _id, name, protections) Please refer to chapter 6 for description of the functions. 1_0_._4_ _ _ _ _ _ _C_o_n_c_l_u_d_i_n_g_ _R_e_m_a_r_k_s_ 10.4 None of the objectives of section 10.1 are neglected. The chosen catalog structure is compatible with the file system of UNIX (Richie et al. 1974), in the sense that the interface of UNIX may be simulated, provided that a proper set of restrictions and supporting facilities are built on, in a superimposing level to the LFS. Owing to the increasing popularity of UNIX and related systems (Hanson 1980) this seems worthwhile to note if considering standardization on the level of access interface. \f F_1_1_._ _ _ _ _ _ _ _R_E_M_A_R_K_S_ _O_N_ _I_M_P_L_E_M_E_N_T_A_T_I_O_N_ 11. As stated in the introduction (section 1.1) the present report is part of a development-project at A/S Regnecentralen af 1979. The outcome of this project must be an actual implementation of the designed file system. However, it is not within the scope of this report to describe implementation in detail, but a general outline of the ideas and structure of a possible implementation will be given in this chapter. The intended implementation language is PASCAL80 (Staunstrup 1980). 1_1_._1_ _ _ _ _ _ _S_t_r_u_c_t_u_r_i_n_g_ _t_h_e_ _F_i_l_e_ _S_y_s_t_e_m_ 11.1 The individual levels of the design presented in chapter 4 are conseived as p_r_o_c_e_s_s_e_s_ (or incarnations of processes). Figure 11.1: Processes of the file system, nested. \f Figure 11.1 shows the processes of the system. The processes are -- in keeping with ordinary PASCAL80 principles -- conceived as n_e_s_t_e_d_ processes. However, it may be opportune -- in a not-so-distant future -- to distribute the file system over a number of loosely connected computers, each with its own private memory. Fortunately such possibilities are inherent in the structure of the implementation language PASCAL80, because all modularization of programs is done by dividing the code into separate processes (processes contain completely closed address spaces; the only information that can be imported or exported from a process must go through process p_a_r_a_m_e_t_e_r_s_ or through message s_e_m_a_p_h_o_r_e_s_). Consequently the organization of fig. 11.1 is preferably restruc- tured -- from the n_e_s_t_e_d_ organization to a "h_o_r_i_z_o_n_t_a_l_" organi- zation of processes. The horizontal organization of fig. 11.2 allows distribution of the individual processes over a number of connected computers. Figure 11.2: Processes of the file system, "horizontal" design. 1_1_._2_ _ _ _ _ _ _T_r_a_n_s_p_a_r_a_n_c_y_ _o_f_ _H_i_e_r_a_r_c_h_i_c_a_l_ _L_e_v_e_l_s_ 11.2 The hierarchical model of chapter 2 proved very convenient for the design of the file system. \f However, in the actual implementation the structure need not be too rigoristic. It is not at all efficient if a procedural call to one level simply results in an exactly identical call to the next level (i.e. the level is transparent with respect to the requested operation). An example of such unwanted overhead is the call of "get". As illustrated in section 2.2 this does not involve LFS or BFS (be- cause open performs the operations of these levels). It would indeed be ineffective if "get" were to be called through both LFS and BFS. The actual entry-points of the levels are illustrated in fig. 11.3. \f Figure 11.3: Interconnections in the file system. \f F_ 1_2_._ _ _ _ _ _ _ _A_ _P_R_E_L_I_M_I_N_A_R_Y_ _R_E_F_E_R_E_N_C_E_ _M_A_N_U_A_L_ 12. 1_2_._1_ _ _ _ _ _ _I_n_t_r_o_d_u_c_t_i_o_n_ 12.1 The present text describes a PASCAL80-implemented File System. The File System is intended to administer the backing storage of a computer system. This backing storage may consist of several -- possibly different -- backing storage devices, e.g. hard disks and floppy disks. It is the scope of this manual to give the reader an introductory overview of the components and functions of the file system. Con- sequently, descriptions of functions etc. are sketchy and by no means sufficient. The reader should refer to the future "System Programmes Referen- ce Manual" for detailed description of the File System. 1_2_._2_ _ _ _ _ _ _C_o_m_p_o_n_e_n_t_s_ _o_f_ _t_h_e_ _B_a_c_k_i_n_g_ _S_t_o_r_a_g_e_ 12.2 1_2_._2_._1_ _ _ _ _V_o_l_u_m_e_s_ 12.2.1 The backing storage of a computer system consists of one or more v_o_l_u_m_e_s_. Each volume resides on a d_e_v_i_c_e_. A given volume is not necessarily tied to one particular device: volumes may be ex- changed or replaced. Each volume has a v_o_l_u_m_e_n_a_m_e_, given to it by the initialization program formatting the volume. It is the responsibility of the user to ensure that volumenames are unique. A volume may comprise a maximum of 4 G (billion) pages. \f Figure 12.1: Example. Volume 2, 3, and 4 are mounted on the devices 2, 3, and 1 respectively. The volumes 1 and 5 are not mounted (dismounted). 1_2_._2_._2_ _ _ _ _P_a_g_e_s_ 12.2.2 A volume consists of a number of pages. One page contains 512 bytes (octets). Pages may be a_l_l_o_c_a_t_e_d_ or f_r_e_e_ (i.e. not-in-use). Free pages are not allocated individually: the smallest unit of allocation is a s_l_i_c_e_ (consisting of a number of pages). The slice-size may vary from volume to volume. The free slices of a volume are administered by the B_i_t_ _T_a_b_l_e_, in which every slice of the volume is represented by one bit (a slice is free if the corresponding bit is set). 1_2_._2_._3_ _ _ _ _F_i_l_e_s_ 12.2.3 Each volume contains a number of f_i_l_e_s_. A file is the basic facility for permanent storage of data. \f A file consists of a (logical) sequence of p_a_g_e_s_. The individual pages (or rather: slices) of a file need not necessarily be placed contiguously on the volume; they may be scattered in c_h_u_n_k_s_ of slices. Figure 12.2: Example The administrative data concerning the whereabouts of the chunks of a file are placed on the first page of the file. A file may comprise a maximum of 25 chunks of arbitrary size. A file may hold all the pages of a volume. Figure 12.3: File organization. \f The first page moreover contains the d_e_s_c_r_i_p_t_o_r_ of the file. The descriptor describes the status of the file in terms of length, reservation locks, etc. The first page of a file is called the d_e_s_c_r_i_p_t_o_r_ _p_a_g_e_ of the file. The locations of files are administered by the F_i_l_e_ _L_o_c_a_t_i_o_n_ T_a_b_l_e_ (FLT). This table contains the address of every file (i.e. the address of its first page). No two table entries point to the same file; that is: the table index of a given file is unambi- guous within the volume. Therefore the FLT-index of a file to- gether with the corresponding volumename is used as a unique in- ternal name for that file. The pair of data (volumename, FLT-index) is called a f_i_l_e_ _i_d_e_n_t_- i_f_i_e_r_. 1_2_._2_._4_ _ _ _ _C_a_t_a_l_o_g_s_ 12.2.4 Users must be able to associate symbolic n_a_m_e_s_ with files. Symbolic names -- together with the corresponding file identifiers -- are stored in special files: c_a_t_a_l_o_g_s_. There may be several catalogs on a single volume. A symbolic name may reference a catalog, since symbolic names represent file identifiers and catalogs are files themselves. This facilitates hierarchical (or rather graph-represented) naming systems. A single file identifier may be designated by several different names. A name may designate a file identifier on another volume. A special catalog exists on each volume: the Rootcatalog. This catalog is the "root" of the hierarchy of catalogs on that volume (i.e. every catalog may be reached, starting at the Rootcatalog).\f Whenever a new volume is mounted, the File System delivers a re- ference to the corresponding Rootcatalog. Figure 12.4: Catalog organization. 1_2_._2_._5_ _ _ _ _V_o_l_u_m_e_d_e_s_c_r_i_p_t_i_o_n_s_ 12.2.5 Every volume contains a special system file: the V_o_l_u_m_e_d_e_s_c_r_i_p_- t_i_o_n_. This file holds data concerning the status of the volume (slice size, volumename) etc. Furthermore it contains the Bit Table and the File Location Table. To be able to interpret the datastructures of an arbitrary vol- ume, the following convention regarding the Volumedescription file is made: the descriptor page of the Volumedescription file is always placed as the first page of the volume. \f This means that the Volumedescription file always can be read without knowledge of the exact locations of File Location Table, Rootcatalog, etc. It is the responsibility of the data in the Volume-description file to point such datastructures out. The volumedescription file contains an extra bit table -- besides the Bit Table described in subsection 12.2.2. This bit table is called the "Malfunctions Bit Table" (MBT), and it specifies all slices in which persistent errors have been detected. 1_2_._3_ _ _ _ _ _ _F_u_n_c_t_i_o_n_s_ _o_f_ _t_h_e_ _F_i_l_e_ _S_y_s_t_e_m_ 12.3 Below is listed the functions of the File System nucleus. The notation is casual but hopefully self-explanatory. Each of the calls to the file system may result in an error condition. The return from such a situation is for simplicity not represented. 1_2_._3_._1_ _ _ _ _V_o_l_u_m_e_ _H_a_n_d_l_i_n_g_ 12.3.1 Volumes may be mounted or dismounted: MOUNT (VOLUMENAME, DEVICE, CAT-ID) DISMOUNT (DEVICE) DEVICE specifies the device on which the volume should be mounted/dismounted. When a volume is mounted its name will be checked against VOLUMENAME. The Rootcatalog of the mounted volume will be referenced by CAT-ID upon return of MOUNT. 1_2_._3_._2_ _ _ _ _C_a_t_a_l_o_g_ _H_a_n_d_l_i_n_g_ 12.3.2 A catalog must be c_o_n_n_e_c_t_e_d_ ("opened") before it can be proces- sed. When a user connects a catalog, he will be supplied with a shorthand reference to this catalog. In fact this is what hap- pens in the function MOUNT of the previous subsection: when a volume is MOUNTed the Rootcatalog will automatically be connect- ed, and its shorthand-reference delivered in CAT-ID. \f A user only knows catalogs through symbolic names (which in their place are found in another catalog). Whenever a (simple) name is specified, it must be accompanied by a reference to the catalog in which to find the name. CONNECT(CAT-ID, NAME, NEW-CAT-ID) connects a new catalog. The user describes the catalog by a sym- bolic name in NAME and a catalog reference CAT-ID in which to find the name NAME. Upon return the catalog will be opened and referenced by NEW-CAT-ID. When the catalog has been processed it may be "closed" by means of: DISCONNECT(CAT-ID) A specific name NAME contained in the catalog referenced by CAT- ID may be found by means of: LOOKUP(CAT-ID, NAME, ATTRIBUTES, DESCRIPTOR) This function returns the attributes of the found name, and the descriptor of the file 1_2_._3_._3_ _ _ _ _F_i_l_e_ _H_a_n_d_l_i_n_g_ 12.3.3 When a file is to be processed it must be opened. After proces- sing it may be closed (to release tied-up resources). OPEN(CAT-ID, NAME, MODE, FILE-ID) CLOSE(FILE-ID) MODE denotes the wanted use of the file. It may be either RETRIE- VE, UPDATE, or both. Furthermore it must be specified whether PRIVATE or SHARED access is wanted. The file is designated by (CAT-ID, NAME). A shorthand reference to the file is returned in FILE-ID. \f The contents of the file referenced by FILE-ID may be processed by: GET(FILE-ID, BUFFER, FIRST-PAGE, COUNT) PUT(FILE-ID, BUFFER, FIRST-PAGE, COUNT) which transfer a number of pages. FIRST-PAGE denotes the first page to be transferred to or from the file, and COUNT the number of subsequent pages. Pages are put or taken from BUFFER. New files may be created: CREATE(CAT-ID, NAME, INIT-DATA) A file with the simple name NAME -- which is inserted into the catalog referenced by CAT-ID -- is created with initialization data (for attributes and descriptor) as in INIT-DATA. The file may be a catalog. A file is deleted by DELETE(CAT-ID, NAME) When DELETE is called, the designated file is marked for deletion. The file is not actually deleted, however, until all connections to the file has been CLOSEd. A catalog-file cannot be deleted, unless it is empty. All other names referencing the same file will -- after deletion of the file -- be referencing an e_m_p_t_y_ entry in the File Location Table. A new name may be linked to an already existing name by means of: LINKNAME(CAT-ID-1, NEWNAME, CAT-ID-2, OLDNAME) \f A new simple NEWNAME is inserted into the catalog designated by CAT-ID-1. The name will reference the same file as OLDNAME of the catalog designated by CAT-ID-2. A name may be changed (within a catalog) by means of: RENAME(CAT-ID, OLDNAME, NEWNAME) Finally a file may be extended or truncated: EXTEND(CAT-ID, NAME, AMOUNT) TRUNCATE(CAT-ID, NAME, AMOUNT) AMOUNT is the number of s_l_i_c_e_s_ to allocate/deallocate. Finally -- as a service function to augmenting file system levels -- following function allows arbitrary data to be written into the descriptor of a file: SET-TAIL(CAT-ID, NAME, TAIL) where TAIL contains the data to be inserted into the descriptor. The actual format and length of TAIL is implementation dependent. 1_2_._3_._4_ _ _ _ _A_u_t_h_o_r_i_z_a_t_i_o_n_ _E_n_f_o_r_c_e_m_e_n_t_ 12.3.4 The f_i_r_s_t_ _l_e_v_e_l_ of authorization enforcement is the naming sys- tem: users can only access the files they can name (if a user only knows one catalog reference, he can only access the subtree (or rather: subgraph) of files of which this catalog is a root). The s_e_c_o_n_d_ _l_e_v_e_l_ of authorization enforcement concerns the individual symbolic names of files: associated with each simple\f name is a set of authorization-attributes. The total set comprises: Authorization to: - RETRIEVE (data) - UPDATE (data) - ADJUST (length of file) - DELETE (file) - AUTHORIZE (i.e. redefine the set of authorizations) The set of authorizations for a given name defines the allowed use of the corresponding file. When a file (possibly catalog) "FILE" (say) is OPENed (CONNECTED), the set of authorizations associated with the file, is computed as the i_n_t_e_r_s_e_c_t_i_o_n_ of the two sets associated with (1) the name "FILE" and (2) the catalog in which "FILE" was found. If AUTHORIZE is contained in the set of authorizations for a name, the authorizations may be redefined by: SET-AUTHORITY(CAT-ID, NAME, AUTHORIZATIONS) When a new name is linked (by LINKNAME) to an existing name, it will inherit a s_u_b_s_e_t_ of the authorizations of the existing name. This subset of "inheritable" authorizations may for a given name be defined by: SET-INHERITANCE(CAT-ID, NAME, INHERITABLE-AUTHORIZATIONS) The t_h_i_r_d_ _l_e_v_e_l_ of authorization enforcement is totally local to each name. As a means to protect a name against accidental mis- use, a set of p_r_o_t_e_c_t_i_o_n_s_ is associated with each name. The set of protections comprises exactly the same elements as the total set of authorizations (viz. RETRIEVE, UPDATE, ADJUST, DELETE, AUTHORIZE). Whenever a function is performed via a given name, it is not only tested against the authorization of the name, but against the intersection of both authorizations and protections. \f The difference between authorizations and protections is twofold: - protections have no influence whatsoever on inheritance of authorizations. - protections may be redefined unconditionally at any time. Thus, protections are a t_e_m_p_o_r_a_r_y_ guard against a_c_c_i_d_e_n_t_s_. The protections of a given name is redefined by: SET-PROTECTION(CAT-ID, NAME, PROTECTIONS) \f F_ 1_3_._ _ _ _ _ _ _ _C_O_N_C_L_U_S_I_O_N_ 13. The report has shown a design of a file system nucleus offering basic (extendable) but potent facilities. The emergence of high-level system programming languages makes it an attractive and practicable task to design a file system suited for use on even very different computers. The advantages are two- fold: portability of files between different computer systems, and rationalization of implementation. It is the intention that the facilities of the file system nu- cleus should be extended by augmenting systems. Differing compu- ter systems may thus extend in different directions, but still preserve compatibility. The objectives of the presented file system are chosen according- ly. Special emphasis has been placed on the subjects of naming and support for authorization enforcement (i.e. the Logical File Sys- tem), providing the file system nucleus with such basic facili- ties, which augmenting systems may extend to a potent interface in a multi-user environment. It is worth noting that the (Logical) file system interface may be restricted (all within the general philosophy of the design) to simulate the interface of the UNIX file system (Ritchie et al. 1974). It is the opinion of some that UNIX -- or something closely resembling UNIX -- will be chosen as a standard for the environ- mental file system interface of the currently developed system programming language ADA. Since this language may come to be the general standard of the field, it seems reassuring that the pre- sent file system may be modelled to adhere to such standards. \f \f F_ A_._ _ _ _ _ _ _ _ _R_E_F_E_R_E_N_C_E_S_ A. Aris, T.A. and B. Tveden-Jørgensen: RC8000 Monitor, part 2. Reference Manual, RCSL No 31-D 477, A/S Regnecentralen, Copenhagen 1978 Brinch Hansen, P.: RC4000 Software Multiprogramming System, A/S Regnecentralen, RCSL No 55-D 140, Copenhagen, February 1971. Brinch Hansen, P.: The Architecture of Concurrent Programs, Englewood Cliffs NJ, Prentice-Hall, 1977 Hnatek, E. R.: Semiconductor Memory Update - Part 1, Computer Design p. 67, Dec. 1979 IBM Diskette, The. General Information Manual. GA 21-9182-2, not dated. IBM System/38. Technical Developments, IBM 1978 Jensen, K. and Niklaus Wirth: PASCAL, User Manual and Report, Springer-Verlag, New York 1975 Knuth, D. E.: The Art of Computer Programming, Addison-Wesley, 1969 Kristensen, B. B. et al.: BETA Language Proposal. DAIMI PB-98, June 1979 Kristensen, Karl Aage: Filsystemer. Speciale nr. 59 ved DIKU, 1979 Lampson, B.W.: Redundancy and Robustness in Memory Protection IFIP 74, p. 128-132, 1974 Madnick, S. E. and Alsop, J. W.: A Modular Approach to File System Design, SJCC 1969b \f Madnick, S. E.: Design Strategies for File Systems: A Working Model. Selected Papers from FILE 68, p. 117 Amsterdam 1969a Olsen, J.C.: RC3600 Catalog System, System Programmer>s Guide, RCSL No 43-GL 7915, A/S Regnecentralen, Copenhagen 1978 Ravn, A. P.: I/O in Operating Systems - Principles and Implementation (draft). DIKU 1980 Ravn, A. P.: Noter til kursus: Ind- og Udlæsning på Operativsystemniveau, DIKU 1979 Redell, D.D. et al.: Pilot: An Operating System for a Personal Computer. CACM, vol. 23, No 2, February 1980 Ritchie, D.M. et al.: The UNIX Time-Sharing System, vol. 17, No 7, July 1974 Staunstrup, Jørgen: Pascal80 Report. A/S Regnecentralen af 1979, RCSL No 52-AA 964, 1980 Staunstrup, J. og S. M. Sørensen: PLATON, A High Level Language for System Programming, RECAU, R-75-59, May 1975 Stoy, J.E. and C. Strachey: OS6 - An Experimental Operating System for a Small Computer, part 1: CACM, vol. 15, No 2, part 2: CACM, vol. 15, No 3, June 1972 Teorey, T.J. and T.B. Pinkerton: A Comparative Analysis of Disk Scheduling Policies. CACM, Vol. 15, No 3, pp. 177-184, March 1972 Wegner, P.: Programming with ADA: An Introduction by Means of Graduated Examples. SIGPLAN Notices vol. 14, No 12, December 1979 Wirth, N.: MODULA: a Language for Modular Multiprogramming, Software Practice and Experience 7, 1, pp. 3-55, 1977 Wong, C.K.: Minimizing Expected Head Movement in One-Dimensional and Two-Dimensional Mass Storage Systems. Computing Surveys, vol. 12, No 2, June 1980 \f \f «eof»