file Organisation:-
FILE ORGANISATION AND ITS TYPES
• Just as arrays, lists,
trees and other data structures are used to implement data Organisation in main
memory,.
• a number of strategies
are used to support the Organisation of data in secondary memory.
• A file organisation is a
technique to organise data in the secondary memory.
• File Organisation is a
way of arranging the records in a file when the file is stored on the disk.
• Data files are organized
so as to facilitate access to records and to ensure their efficient storage.
• . Selection of File
Organisations is dependent on two factors as shown below:
1. Typical DBMS
applications need a small subset of the DB at any given time.
2.
When a portion of the data is needed it must be located on disk,
copied to
memory for processing and rewritten to
disk if the data was modified.
• A DBMS supports several
file Organisation techniques. The important task of the DBA is to choose a good
Organisation for each file, based on its type of use.
• The particular
organisation most suitable for any application will depend upon such factors as
the kind of external storage available, types of queries allowed, number of
keys, mode of retrieval and mode of update.
The Figure1 illustrates different file organisations based on an access key.
Figure 1: File Organisation
techniques
1
Heap files (unordered file)
• Basically these files are unordered
files.
• It is the simplest and most basic
type. T
• heap files consist of randomly
ordered records.
• The records will have no particular
order.
• The operations we can perform on the
records are insert, retrieve and delete.
• The features of the heap file or the
pile file Organisation are:
1. New records can be inserted in any
empty space that can accommodate them.
2.
When
old records are deleted, the occupied space becomes empty and available for any
new insertion.
3.
If
updated records grow; they may need to be relocated (moved) to a new empty
space. This needs to keep a list of empty space.
Advantages
of heap files
1. This is a simple file Organisation method.
2. Insertion is somehow efficient.
3. Good for bulk-loading data into a table.
4. Best if file scans are common or insertions are
frequent.
Disadvantages
of heap files
1. Retrieval requires a linear search and is inefficient.
2. Deletion can result in unused space/need for
reorganisation.
2
Sequential File Organisations
• The most basic way to organise the
collection of records in a file is to use sequential Organisation.
• Records of the file are stored in
sequence by the primary key field values.
• They are accessible only in the
order stored, i.e., in the primary key order.
• This kind of file Organisation works
well for tasks which need to access nearly every record in a file, e.g.,
payroll.
• In a sequentially organised file
records are written consecutively when the file is created and must be accessed
consecutively when the file is later used for input (Figure 2).
Acno
|
Name
|
City
|
101
|
John
|
Latur
|
102
|
Raman
|
Ahmedpur
|
103
|
Sagar
|
Latur
|
Figure
2: Structure of sequential file
• A sequential file maintains the
records in the logical sequence of its primary key values.
• Sequential files are inefficient for
random access, however, are suitable for sequential access.
• A sequential file can be stored on
devices like magnetic tape that allow sequential access.
• On an average, to search a record in
a sequential file would require to look into half of the records of the file.
File Organisations in DBMS
• Updating a sequential file usually
creates a new file so that the record sequence on primary key is maintained.
• The update operation first copies
the records till the record after which update is required into the new file
and then the updated record is put followed by the remainder of records.
• Thus method of updating a sequential
file automatically creates a backup copy.
• Adding a record requires shifting of
all records from the point of insertion to the end of file to create space for
the new record.
• On the other hand deletion of a
record requires a compression of the file space.
• However, simple queries are time
consuming for large files.
Advantages
of Sequential File Organisation
• It
is fast and efficient when dealing with large volumes of data that need to be
processed periodically (batch system).
Disadvantages
of sequential File Organisation
• Requires
that all new transactions be sorted into the proper sequence for sequential
access processing.
•
Locating, storing, modifying, deleting, or adding records in the file require
rearranging the file.
• This method is too slow to handle applications
requiring immediate updating or responses.
3 Indexed (Indexed Sequential) File Organisation
• It organises the file like a large
dictionary, i.e., records are stored in order of the key but an index is kept
which also permits a type of direct access.
• The records are stored sequentially
by primary key values and there is an index built over the primary key field.
• The retrieval of a record from a
sequential file, on average, requires access to half the records in the file,
making such inquiries not only inefficient but very time consuming for large
files.
• To improve the query response time
of a sequential file, a type of indexing technique can be added.
• Thus, an index is a mechanism for
faster search.
• An index can be small enough to be
read into the main memory.
• A sequential (or sorted on primary
keys) file that is indexed on its primary key is called an index sequential
file.
• The index allows for random access
to records, while the sequential storage of the records of the file provides easy
access to the sequential records.
Fig. Indexed file organiaation
4.4.4
Hashed File Organisation
• Hashing is the most common form of
purely random access to a file or database.
• It is also used to access columns
that do not have an index as an optimisation technique.
• Hash functions calculate the address
of the page in which the record is to be stored based on one or more fields in
the record.
• The records in a hash file appear
randomly distributed across the available space.
• It requires some hashing algorithm
and the technique.
• Hashing Algorithm converts a primary
key value into a record address. The most popular form of hashing is division
hashing with chained overflow.
Advantages
of Hashed file Organisation
1. Insertion or search on hash-key is fast.
2. Best if equality search is needed on hash-key.
Disadvantages
of Hashed file Organisation
1. It is a complex file Organisation method.
2. Search is slow.
3. It suffers from disk space overhead.
4. Unbalanced buckets degrade performance.
5. Range search is slow.
Overview of
indexes:-
Basic
Concepts
·
Indexing mechanisms are used to
speed up access to desired data.
·
Search
Key - attribute to set of attributes used to look up
records in a file.
·
An index file consists of records (called index entries) of the form
·
Index files are typically much
smaller than the original file
·
Two basic kinds of indices:
o Ordered indices: search keys
are stored in sorted order
o Hash indices: search keys
are distributed uniformly across “buckets” using a “hash function”.
Index Evaluation Metrics:-
·
Access time
·
Insertion time
·
Deletion time
·
Space overhead
·
Access types supported
efficiently. E.g.,
o records with
a specified value in the attribute
o or records
with an attribute value falling in a specified range of values.
o This
strongly influences the choice of index, and depends on usage.
Ordered
Indices:-
·
In an ordered index, index
entries are stored sorted on the search key value. E.g., author catalog in library.
·
Primary index: in a sequentially
ordered file, the index whose search key specifies the sequential
order of the file.
o Also called clustering
index
o The search
key of a primary index is usually but not necessarily the primary key.
·
Secondary index: an
index whose search key specifies an order different from the sequential order
of the file. Also called
non-clustering index.
non-clustering index.
·
Index-sequential file:
ordered sequential file with a primary index.
Dense Index
Files:-
·
Dense index — Index record
appears for every search-key value in the file.
Sparse Index
Files:-
·
Sparse Index: contains index records for only some
search-key values.
·
Only applicable
when records are sequentially ordered on search-key
·
To locate a record with
search-key value K we:
·
Find index record with largest
search-key value < K
·
Search file sequentially starting
at the record to which the index record points
Multilevel
Index:-
·
If primary index does not fit in
memory, access becomes expensive.
·
Solution: treat primary index
kept on disk as a sequential file and construct a sparse index on it.
o outer index
– a sparse index of primary index
o inner index
– the primary index file
·
If even outer index is too large
to fit in main memory, yet another level of index can be created, and so on.
·
Indices at all levels must be
updated on insertion or deletion from the file.
Secondary Indices:-
·
Frequently, one wants to find all
the records whose values in a certain field (which is not the search-key of the
primary index) satisfy some condition.
o Example 1:
In the account relation stored sequentially by account number, we may
want to find all accounts in a particular branch
o Example 2:
as above, but where we want to find all accounts with a specified balance or
range of balances
·
We can have a secondary index
with an index record for each search-key value
·
Index record points to a bucket
that contains pointers to all the actual records with that particular
search-key value.
·
Secondary indices have to be
dense, since the file is not sorted by the search key.
Tree structured indexing:-
B+-Tree
Index Files
B+-tree
indices are an alternative to indexed-sequential files.
·
Disadvantage of
indexed-sequential files
o performance
degrades as file grows, since many overflow blocks get created.
o Periodic
reorganization of entire file is required.
·
Advantage of
B+-tree index files:
o automatically
reorganizes itself with small, local, changes, in the face of insertions and
deletions.
o Reorganization
of entire file is not required to maintain performance.
·
(Minor)
disadvantage of B+-trees:
o extra
insertion and deletion overhead, space overhead.
·
Advantages of B+-trees
outweigh disadvantages
o B+-trees
are used extensively
B+-Tree Node Structure
·
Typical node
o
Ki
are the search-key values
o
Pi
are pointers to children (for non-leaf nodes) or pointers to records or buckets
of records (for leaf nodes).
·
The
search-keys in a node are ordered
o
K1 < K2
< K3 < . . . < Kn–1
·
Usually
the size of a node is that of a block
·
Example of a B+-tree
B+-tree for account
file (n = 3)
B+-Tree Index
File
·
All
paths from root to leaf are of the same length
·
Each node
that is not a root or a leaf has between én/2ù and n children.
·
A leaf
node has between é(n–1)/2ù and n–1 values
·
Special
cases:
o
If the
root is not a leaf, it has at least 2 children.
o
If the
root is a leaf (that is, there are no other nodes in the tree), it can have
between 0 and (n–1) values.
Leaf Nodes in B+-Trees
Properties of a leaf node:
·
For i
= 1, 2, . . ., n–1, pointer Pi either points to a file
record with search-key value Ki, or to a bucket of pointers
to file records, each record having search-key value Ki. Only need bucket structure if search-key
does not form a primary key.
·
If Li,
Lj are leaf nodes and i < j, Li’s
search-key values are less than Lj’s search-key values
·
Pn points to next leaf node in search-key order
Non-Leaf Nodes in B+-Trees
·
Non
leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers:
o
All the
search-keys in the subtree to which P1 points are less than K1
o
For 2 £ i £ n – 1, all the
search-keys in the subtree to which Pi points have values
greater than or equal to Ki–1 and less than Ki
o
All the
search-keys in the subtree to which Pn points have values
greater than or equal to Kn–1
Example of B+-tree
B+-tree for account
file (n = 5)
·
Leaf
nodes must have between 2 and 4 values
(é(n–1)/2ù and n –1, with n = 5).
(é(n–1)/2ù and n –1, with n = 5).
·
Non-leaf
nodes other than root must have between 3 and 5 children (é(n/2ù and n with n
=5).
·
Root
must have at least 2 children.
§ Observations about B+-trees
·
Since
the inter-node connections are done by pointers, “logically” close blocks need
not be “physically” close.
·
The
non-leaf levels of the B+-tree form a hierarchy of sparse indices.
·
The B+-tree
contains a relatively small number of levels
§
Level
below root has at least 2* én/2ù values
§
Next
level has at least 2* én/2ù * én/2ù values
§
.. etc.
o
If
there are K search-key values in the file, the tree height is no more
than é logén/2ù(K)ù
o
thus
searches can be conducted efficiently.
·
Insertions
and deletions to the main file can be handled efficiently, as the index can be
restructured in logarithmic time (as we shall see some details, and more in the
book).
No comments:
Post a Comment