Monday, 6 August 2012

file Organisation:-

FILE ORGANISATION AND ITS TYPES
      Just as arrays, lists, trees and other data structures are used to implement data Organisation in main memory,.
      a number of strategies are used to support the Organisation of data in secondary memory.
      A file organisation is a technique to organise data in the secondary memory.
      File Organisation is a way of arranging the records in a file when the file is stored on the disk.
      Data files are organized so as to facilitate access to records and to ensure their efficient storage.
      . Selection of File Organisations is dependent on two factors as shown below:
1.    Typical DBMS applications need a small subset of the DB at any given time.
2.    When a portion of the data is needed it must be located on disk, copied to
     memory for processing and rewritten to disk if the data was modified.
      A DBMS supports several file Organisation techniques. The important task of the DBA is to choose a good Organisation for each file, based on its type of use.
      The particular organisation most suitable for any application will depend upon such factors as the kind of external storage available, types of queries allowed, number of keys, mode of retrieval and mode of update.
The Figure1 illustrates different file organisations based on an access key.
Figure 1: File Organisation techniques
 



1 Heap files (unordered file)
      Basically these files are unordered files.
      It is the simplest and most basic type. T
      heap files consist of randomly ordered records.
      The records will have no particular order.
      The operations we can perform on the records are insert, retrieve and delete.
      The features of the heap file or the pile file Organisation are:
1.    New records can be inserted in any empty space that can accommodate them.
2.    When old records are deleted, the occupied space becomes empty and available for any new insertion.
3.    If updated records grow; they may need to be relocated (moved) to a new empty space. This needs to keep a list of empty space.
Advantages of heap files
1. This is a simple file Organisation method.
2. Insertion is somehow efficient.
3. Good for bulk-loading data into a table.
4. Best if file scans are common or insertions are frequent.
Disadvantages of heap files
1. Retrieval requires a linear search and is inefficient.
2. Deletion can result in unused space/need for reorganisation.
2 Sequential File Organisations
      The most basic way to organise the collection of records in a file is to use sequential Organisation.
      Records of the file are stored in sequence by the primary key field values.
      They are accessible only in the order stored, i.e., in the primary key order.
      This kind of file Organisation works well for tasks which need to access nearly every record in a file, e.g., payroll.
      In a sequentially organised file records are written consecutively when the file is created and must be accessed consecutively when the file is later used for input              (Figure 2).



Acno
Name
City
101
John
Latur
102
Raman
Ahmedpur
103
Sagar
Latur

Figure 2: Structure of sequential file

      A sequential file maintains the records in the logical sequence of its primary key values.
      Sequential files are inefficient for random access, however, are suitable for sequential access.
      A sequential file can be stored on devices like magnetic tape that allow sequential access.
      On an average, to search a record in a sequential file would require to look into half of the records of the file.

 File Organisations in DBMS
      Updating a sequential file usually creates a new file so that the record sequence on primary key is maintained.
      The update operation first copies the records till the record after which update is required into the new file and then the updated record is put followed by the remainder of records.
      Thus method of updating a sequential file automatically creates a backup copy.
      Adding a record requires shifting of all records from the point of insertion to the end of file to create space for the new record.
      On the other hand deletion of a record requires a compression of the file space.
      However, simple queries are time consuming for large files.

Advantages of Sequential File Organisation
It is fast and efficient when dealing with large volumes of data that need to be processed periodically (batch system).

Disadvantages of sequential File Organisation
Requires that all new transactions be sorted into the proper sequence for sequential access processing.
• Locating, storing, modifying, deleting, or adding records in the file require rearranging the file.
• This method is too slow to handle applications requiring immediate updating or responses.

3 Indexed (Indexed Sequential) File Organisation
      It organises the file like a large dictionary, i.e., records are stored in order of the key but an index is kept which also permits a type of direct access.
      The records are stored sequentially by primary key values and there is an index built over the primary key field.
      The retrieval of a record from a sequential file, on average, requires access to half the records in the file, making such inquiries not only inefficient but very time consuming for large files.
      To improve the query response time of a sequential file, a type of indexing technique can be added.
      Thus, an index is a mechanism for faster search.
      An index can be small enough to be read into the main memory.
      A sequential (or sorted on primary keys) file that is indexed on its primary key is called an index sequential file.
      The index allows for random access to records, while the sequential storage of the records of the file provides easy access to the sequential records.
Fig. Indexed file organiaation
4.4.4 Hashed File Organisation
      Hashing is the most common form of purely random access to a file or database.
      It is also used to access columns that do not have an index as an optimisation technique.
      Hash functions calculate the address of the page in which the record is to be stored based on one or more fields in the record.
      The records in a hash file appear randomly distributed across the available space.
      It requires some hashing algorithm and the technique.
      Hashing Algorithm converts a primary key value into a record address. The most popular form of hashing is division hashing with chained overflow.
Advantages of Hashed file Organisation
1. Insertion or search on hash-key is fast.
2. Best if equality search is needed on hash-key.
Disadvantages of Hashed file Organisation
1. It is a complex file Organisation method.
2. Search is slow.
3. It suffers from disk space overhead.
4. Unbalanced buckets degrade performance.
5. Range search is slow.
Overview of indexes:-
Basic Concepts
·         Indexing mechanisms are used to speed up access to desired data.
·         Search Key - attribute to set of attributes used to look up records in a file.
·         An index file consists of records (called index entries) of the form
·         Index files are typically much smaller than the original file
·         Two basic kinds of indices:
o    Ordered indices:  search keys are stored in sorted order
o    Hash indices:  search keys are distributed uniformly across “buckets” using a “hash function”.
Index Evaluation Metrics:-
·         Access time
·         Insertion time
·         Deletion time
·         Space overhead
·         Access types supported efficiently.  E.g.,
o    records with a specified value in the attribute
o    or records with an attribute value falling in a specified range of values.
o    This strongly influences the choice of index, and depends on usage.
Ordered Indices:-
·         In an ordered index, index entries are stored sorted on the search key value.  E.g., author catalog in library.
·         Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
o    Also called clustering index
o    The search key of a primary index is usually but not necessarily the primary key.
·         Secondary index: an index whose search key specifies an order different from the sequential order of the file.  Also called
non-clustering index.
·         Index-sequential file: ordered sequential file with a primary index.
Dense Index Files:-
·         Dense index — Index record appears for every search-key value in the file.
Sparse Index Files:-
·         Sparse Index:  contains index records for only some search-key values.
·         Only applicable when records are sequentially ordered on search-key
·         To locate a record with search-key value K we:
·         Find index record with largest search-key value < K
·         Search file sequentially starting at the record to which the index record points


Multilevel Index:-
·         If primary index does not fit in memory, access becomes expensive.
·         Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.
o    outer index – a sparse index of primary index
o    inner index – the primary index file
·         If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.
·         Indices at all levels must be updated on insertion or deletion from the file.
Secondary Indices:-
·         Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index) satisfy some condition.
o    Example 1: In the account relation stored sequentially by account number, we may want to find all accounts in a particular branch
o    Example 2: as above, but where we want to find all accounts with a specified balance or range of balances
·         We can have a secondary index with an index record for each search-key value
·         Index record points to a bucket that contains pointers to all the actual records with that particular search-key value.
·         Secondary indices have to be dense, since the file is not sorted by the search key.

Tree structured indexing:-
B+-Tree Index Files
B+-tree indices are an alternative to indexed-sequential files.
·         Disadvantage of indexed-sequential files
o    performance degrades as file grows, since many overflow blocks get created. 
o    Periodic reorganization of entire file is required.
·         Advantage of B+-tree index files: 
o    automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. 
o    Reorganization of entire file is not required to maintain performance.
·         (Minor) disadvantage of B+-trees:
o    extra insertion and deletion overhead, space overhead.
·         Advantages of B+-trees outweigh disadvantages
o    B+-trees are used extensively


B+-Tree Node Structure
·         Typical node
o    Ki are the search-key values
o    Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).
·         The search-keys in a node are ordered
o    K1 < K2 < K3 < . . . < Kn–1
·         Usually the size of a node is that of a block
·         Example of a B+-tree

B+-tree for account file (n = 3)
B+-Tree Index File
·         All paths from root to leaf are of the same length
·         Each node that is not a root or a leaf has between én/2ù and n children.
·         A leaf node has between é(n–1)/2ù and n–1 values
·         Special cases:
o    If the root is not a leaf, it has at least 2 children.
o    If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values.
Leaf Nodes in B+-Trees
Properties of a leaf node:
·         For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a bucket of pointers to file records, each record having search-key value KiOnly need bucket structure if search-key does not form a primary key.
·         If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than Lj’s search-key values
·         Pn points to next leaf node in search-key order
Non-Leaf Nodes in B+-Trees
·         Non leaf nodes form a multi-level sparse index on the leaf nodes.  For a non-leaf node with m pointers:
o    All the search-keys in the subtree to which P1 points are less than K1
o    For 2 £ i £ n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki
o    All the search-keys in the subtree to which Pn points have values greater than or equal to Kn–1
Example of B+-tree
B+-tree for account file (n = 5)
·         Leaf nodes must have between 2 and 4 values
(
é(n–1)/2ù and n –1, with n = 5).
·         Non-leaf nodes other than root must have between 3 and 5 children (é(n/2ù and n with n =5).
·         Root must have at least 2 children.
§  Observations about B+-trees
·         Since the inter-node connections are done by pointers, “logically” close blocks need not be “physically” close.
·         The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
·         The B+-tree contains a relatively small number of levels
§  Level below root has at least 2* én/2ù values
§  Next level has at least 2* én/2ù * én/2ù values
§  .. etc.
o    If there are K search-key values in the file, the tree height is no more than é logén/2ù(K)ù
o    thus searches can be conducted efficiently.
·         Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time (as we shall see some details, and more in the book).



No comments:

Post a Comment