Went Blog

OLAP 与 OLTP的区别

操作数据库系统与数据仓库的区别

由于大多数人都熟悉商用关系数据库系统，将数据仓库与之比较，就容易理解什么是数据仓库。联机操作数据库系统的主要任务是执行联机事务和查询处理。这种系统称为联机事务处理（OLTP）系统。它们涵盖了一个组织的大部分日常操作，如购买、库存、制造、银行、工资、注册、记帐等。另一方面，数据仓库系统在数据分析和决策方面为用户或“知识工人”提供服务。这种系统可以用不同的格式组织和提供数据，以便满足不同用户的形形色色需求。这种系统称为联机分析处理（OLAP）系统。

OLTP 和 OLAP 的主要区别概述如下。

用户和系统的面向性：OLTP 是面向顾客的，用于办事员、客户和信息技术专业人员的事务和查询处理。OLAP 是面向市场的，用于知识工人（包括经理、主管和分析人员）的数据分析。

数据内容：OLTP 系统管理当前数据。通常，这种数据太琐碎，难以用于决策。OLAP 系统管理大量历史数据，提供汇总和聚集机制，并在不同的粒度级别上存储和管理信息。这些特点使得数据容易用于见多识广的决策。

数据库设计：通常，OLTP 系统采用实体-联系（ER）模型和面向应用的数据库设计。而 OLAP 系统通常采用星型或雪花模型和面向主题的数据库设计。

视图：OLTP 系统主要关注一个企业或部门内部的当前数据，而不涉及历史数据或不同组织的数据。相比之下，由于组织的变化，OLAP 系统常常跨越数据库模式的多个版本。 OLAP 系统也处理来自于不同组织的信息，由多个数据存储集成的信息。由于数据量巨大，OLAP 数据也存放在多个存储介质上。

访问模式：OLTP 系统的访问主要由短的原子事务组成。这种系统需要并行控制和恢复机制。然而，对 OLAP 系统的访问大部分是只读操作（由于大部分数据仓库存放历史数据，而不是当前数据），尽管许多可能是复杂的查询。

OLTP 和 OLAP 的其它区别包括数据库大小、操作的频繁程度、性能度量等。

SQL-Transaction

SQL-Transaction Statements control transactions in database access. This subset of SQL is also called the Data Control Language for SQL (SQL DCL).

There are 2 SQL-Transaction Statements:

COMMIT Statement -- commit (make persistent) all changes for the current transaction
ROLLBACK Statement -- roll back (rescind) all changes for the current transaction

Transaction Overview

A database transaction is a larger unit that frames multiple SQL statements. A transaction ensures that the action of the framed statements is atomic with respect to recovery.

A SQL Modification Statement has limited effect. A given statement can only directly modify the contents of a single table (Referential Integrity effects may cause indirect modification of other tables.) The upshot is that operations which require modification of several tables must involve multiple modification statements. A classic example is a bank operation that transfers funds from one type of account to another, requiring updates to 2 tables. Transactions provide a way to group these multiple statements in one atomic unit.

In SQL92, there is no BEGIN TRANSACTION statement. A transaction begins with the execution of a SQL-Data statement when there is no current transaction. All subsequent SQL-Data statements until COMMIT or ROLLBACK become part of the transaction. Execution of a COMMIT Statement or ROLLBACK Statement completes the current transaction. A subsequent SQL-Data statement starts a new transaction.

In terms of direct effect on the database, it is the SQL Modification Statements that are the main consideration since they change data. The total set of changes to the database by the modification statements in a transaction are treated as an atomic unit through the actions of the transaction. The set of changes either:

Is made fully persistent in the database through the action of the COMMIT Statement, or
Has no persistent effect whatever on the database, through:

the action of the ROLLBACK Statement,
abnormal termination of the client requesting the transaction, or
abnormal termination of the transaction by the DBMS. This may be an action by the system (deadlock resolution) or by an administrative agent, or it may be an abnormal termination of the DBMS itself. In the latter case, the DBMS must roll back any active transactions during recovery.

The DBMS must ensure that the effect of a transaction is not partial. All changes in a transaction must be made persistent, or no changes from the transaction must be made persistent.

Transaction Isolation

In most cases, transactions are executed under a client connection to the DBMS. Multiple client connections can initiate transactions at the same time. This is known as concurrent transactions.

In the relational model, each transaction is completely isolated from other active transactions. After initiation, a transaction can only see changes to the database made by transactions committed prior to starting the new transaction. Changes made by concurrent transactions are not seen by SQL DML query and modification statements. This is known as full isolation or Serializable transactions.

SQL92 defines Serializable for transactions. However, fully serialized transactions can impact performance. For this reason, SQL92 allows additional isolation modes that reduce the isolation between concurrent transactions. SQL92 defines 3 other isolation modes, but support by existing DBMSs is often incomplete and doesn't always match the SQL92 modes. Check the documentation of your DBMS for more details.

Transaction isolation controls the visibility of changes between transactions in different sessions (connections). It determines if queries in one session can see changes made by a transaction in another session. There are 4 levels of transaction isolation. The level providing the greatest isolation from other transactions is Serializable.

At transaction isolation level Serializable, a transaction is fully isolated from changes made by other sessions. Queries issued under Serializable transactions cannot see any subsequent changes, committed or not, from other transactions. The effect is the same as if transactions were serial, that is, each transaction completing before another one is begun.

At the opposite end of the spectrum is Read Uncommitted. It is the lowest level of isolation. With Read Uncommitted, a session can read (query) subsequent changes made by other sessions, either committed or uncommitted. Read uncommitted transactions have the following characteristics:

Dirty Read -- a session can read rows changed by transactions in other sessions that have not been committed. If the other session then rolls back its transaction, subsequent reads of the same rows will find column values returned to previous values, deleted rows reappearing and rows inserted by the other transaction missing.
Non-repeatable Read -- a session can read a row in a transaction. Another session then changes the row (UPDATE or DELETE) and commits its transaction. If the first session subsequently re-reads the row in the same transaction, it will see the change.
Phantoms -- a session can read a set of rows in a transaction that satisfies a search condition (which might be all rows). Another session then generates a row (INSERT) that satisfies the search condition and commits its transaction. If the first session subsequently repeats the search in the same transaction, it will see the new row.

The other transaction levels -- Read Committed, Repeatable Read and Serializable, will not read uncommitted changes. Dirty reads are not possible. The next level above Read Uncommitted is Read Committed, and the next above that is Repeatable Read.

In Read Committed isolation level, Dirty Reads are not possible, but Non-repeatable Reads and Phantoms are possible. In Repeatable Read isolation level, Dirty Reads and Non-repeatable Reads are not possible but Phantoms are. In Serializable, Dirty Reads, Non-repeatable Reads, and Phantoms are not possible.

The isolation provided by each transaction isolation level is summarized below:

	Dirty Reads	Non-repeatable Reads	Phantoms
Read Uncommitted	Y	Y	Y
Read Committed	N	Y	Y
Repeatable Read	N	N	Y
Serializable	N	N	N

Note: SQL92 defines the SET TRANSACTION statement to set the transaction isolation level for a session, but most DBMSs support a function/method in the Client API as an alternative.

SQL-Schema Statements in Transactions

The 3rd type of SQL Statements - SQL-Schema Statements, may participate in the transaction mechanism. SQL-Schema statements can either be:

included in a transaction along with SQL-Data statements,
required to be in separate transactions, or
ignored by the transaction mechanism (can't be rolled back).

SQL92 leaves the choice up to the individual DBMS. It is implementation defined behavior.

COMMIT Statement

The COMMIT Statement terminates the current transaction and makes all changes under the transaction persistent. It commits the changes to the database. The COMMIT statement has the following general format:

COMMIT [WORK]

WORK is an optional keyword that does not change the semantics of COMMIT.

ROLLBACK Statement

The ROLLBACK Statement terminates the current transaction and rescinds all changes made under the transaction. It rolls back the changes to the database. The ROLLBACK statement has the following general format:

ROLLBACK [WORK]

WORK is an optional keyword that does not change the semantics of ROLLBACK.

SQL Tutorial Main Page.

Pasted from <http://www.firstsql.com/tutor5.htm>

Hash Join & Merge Join

Merge Join :

Oracle performs a join between two sets of row data using the merge

join algorithm. The inputs are two separate sets of row data. Output is

the results of the join. Oracle reads rows from both inputs in an

alternating fashion and merges together matching rows in order to

generate output. The two inputs are sorted on join column.

Hash Join :

Oracle performs a join between two sets of row data using hash join

algorithm. Input and Output same as Merge Join. Oracle reads all rows

from the second input and builds a hash structure (like has table in

java), before reading each row from the first input one at a time. For

each row from the first input, the hash structure is probed and matching

rows generate output.

Performance Tuning SQL Server Joins

One of the best ways to boost JOIN performance is to limit how many rows need to be JOINed. This is especially beneficial for the outer table in a JOIN. Only return absolutely only those rows needed to be JOINed, and no more. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

If you perform regular joins between two or more tables in your queries, performance will be optimized if each of the joined columns have their own indexes. This includes adding indexes to the columns in each table used to join the tables. Generally speaking, a clustered key is better than a non-clustered key for optimum JOIN performance. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

If you have two or more tables that are frequently joined together, then the columns used for the joins on all tables should have an appropriate index. If the columns used for the joins are not naturally compact, then considering adding surrogate keys to the tables that are compact in order to reduce the size of the keys, thus decreasing read I/O during the join process, increasing overall performance. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

JOIN performance has a lot to do with how many rows you can stuff in a data page. For example, let's say you want to JOIN two tables. Most likely, one of these two tables will be smaller than the other, and SQL Server will most likely select the smaller of the two tables to be the inner table of the JOIN. When this happens, SQL Server tries to put the relevant contents of this table into the buffer cache for faster performance. If there is not enough room to put all the relevant data into cache, then SQL Server will have to use additional resources in order to get data into and out of the cache as the JOIN is performed.

If all of the data can be cached, the performance of the JOIN will be faster than if it is not. This comes back to the original statement, that the number of rows in a table can affect JOIN performance. In other words, if a table has no wasted space, it is much more likely to get all of the relevant inner table data into cache, boosting speed. The moral to this story is to try to get as much data stuffed into a data page as possible. This can be done through the use of a high fillfactor, rebuilding indexes often to get rid of empty space, and to optimize datatypes and widths when creating columns in tables. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

Keep in mind that when you create foreign keys, an index is not automatically created at the same time. If you ever plan to join a table to the table with the foreign key, using the foreign key as the linking column, then you should consider adding an index to the foreign key column. An index on a foreign key column can substantially boost the performance of many joins. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

Avoid joining tables based on columns with few unique values. If columns used for joining aren’t mostly unique, then the SQL Server optimizer may not be able to use an existing index in order to speed up the join. Ideally, for best performance, joins should be done on columns that have unique indexes. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

For best join performance, the indexes on the columns being joined should ideally be numeric data types, not CHAR or VARCHAR, or other non-numeric data types. The overhead is lower and join performance is faster. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

For maximum performance when joining two or more tables, the indexes on the columns to be joined should have the same data type, and ideally, the same width.

This also means that you shouldn't mix non-Unicode and Unicode datatypes when using SQL Server 7.0 or later. (e.g. VARCHAR and NVARCHAR). If SQL Server has to implicitly convert the data types to perform the join, this not only slows the joining process, but it also could mean that SQL Server may not use available indexes, performing a table scan instead. [6.5, 7.0, 2000, 2005] Updated 7-25-2005

*****

When you create joins using Transact-SQL, you can choose between two different types of syntax: either ANSI or Microsoft. ANSI refers to the ANSI standard for writing joins, and Microsoft refers to the old Microsoft style of writing joins. For example:

ANSI JOIN Syntax

SELECT fname, lname, department

FROM names INNER JOIN departments ON names.employeeid = departments.employeeid

Former Microsoft JOIN Syntax

SELECT fname, lname, department

FROM names, departments

WHERE names.employeeid = departments.employeeid

If written correctly, either format will produce identical results. But that is a big if. The older Microsoft join syntax lends itself to mistakes because the syntax is a little less obvious. On the other hand, the ANSI syntax is very explicit and there is little chance you can make a mistake.

For example, I ran across a slow-performing query from an ERP program. After reviewing the code, which used the Microsoft JOIN syntax, I noticed that instead of creating a LEFT JOIN, the developer had accidentally created a CROSS JOIN instead. In this particular example, less than 10,000 rows should have resulted from the LEFT JOIN, but because a CROSS JOIN was used, over 11 million rows were returned instead. Then the developer used a SELECT DISTINCT to get rid of all the unnecessary rows created by the CROSS JOIN. As you can guess, this made for a very lengthy query. Unfortunately, all I could do was notify the vendor's support department about it, and they fixed their code.

The moral of this story is that you probably should be using the ANSI syntax, not the old Microsoft syntax. Besides reducing the odds of making silly mistakes, this code is more portable between database, and eventually, I imagine Microsoft will eventually stop supporting the old format, making the ANSI syntax the only option. [6.5, 7.0, 2000] Updated 11-1-2005

*****

If you have to regularly join four or more tables to get the recordset you need, consider denormalizing the tables so that the number of joined tables is reduced. Often, by adding one or two columns from one table to another, the number of joins can be reduced, boosting performance. [6.5, 7.0, 2000] Updated 11-1-2005

*****

If your join is slow, and currently includes hints, remove the hints to see if the optimizer can do a better job on the join optimization than you can. This is especially important if your application has been upgraded from version 6.5 to 7.0, or from 7.0 to 2000. [6.5, 7.0, 2000] Updated 11-1-2005

*****

One of the best ways to boost JOIN performance is to ensure that the JOINed tables include an appropriate WHERE clause to minimize the number of rows that need to be JOINed.

For example, I have seen many developers perform a simple JOIN on two tables, which is not all that unusual. The problem is that each table may contain over a million rows each. Instead of just JOINing the tables, appropriate restrictive clauses needed to be added to the WHERE clause of each table in order to reduce the total number of rows to be JOINed. This simple step can really boost the performance of a JOIN of two large tables. Updated 11-1-2005

*****

In the SELECT statement that creates your JOIN, don't use an * (asterisk) to return all of the columns in both tables. This is bad form for a couple of reasons. First, you should only return those columns you need, as the less data you return, the faster your query will run. It would be rare that you would need all of the columns in all of the tables you have joined. Second, you will be returning two of each column used in your JOIN condition, which ends up returning way more data that you need, and hurting performance.

Take a look at these two queries:

USE Northwind

SELECT *

FROM Orders

INNER JOIN [Order Details]

ON Orders.OrderID = [Order Details].OrderID

and

USE Northwind

SELECT Orders.OrderID, Orders.OrderDate,

[Order Details].UnitPrice, [Order Details].Quantity,

[Order Details].Discount

FROM Orders

INNER JOIN [Order Details]

ON Orders.OrderID = [Order Details].OrderID

Both of these queries perform essentially the same function. The problem with the first one is that it returns not only too many columns (they aren't all needed by the application), but the OrderID column is returned twice, which doesn't provide any useful benefits. Both of these problems contribute to unnecessary server overhead, hurting performance. The moral of this story is never to use the * in your joins. [6.5, 7.0, 2000] Updated 7-24-2006

*****

While high index selectivity is generally an important factor that the Query Optimizer uses to determine whether or not to use an index, there is one special case where indexes with low selectivity can be useful speeding up SQL Server. This is in the case of indexes on foreign keys. Whether an index on a foreign key has either high or low selectivity, an index on a foreign key can be used by the Query Optimizer to perform a merge join on the tables in question. A merge join occurs when a row from each table is taken and then they are compared to see if they match the specified join criteria. If the tables being joined have the appropriate indexes (no matter the selectivity), a merge join can be performed, which is often much faster than a join to a table with a foreign key that does not have an index. [7.0, 2000] Updated 7-24-2006

*****

For very large joins, consider placing the tables to be joined in separate physical files in the same filegroup. This allows SQL Server to spawn a separate thread for each file being accessed, boosting performance. [6.5, 7.0, 2000] Updated 7-24-2006

*****

Don't use CROSS JOINS, unless this is the only way to accomplish your goal. What some inexperienced developers do is to join two tables using a CROSS JOIN, and then they use either the DISTINCT or the GROUP BY clauses to "clean up" the mess they have created. This, as you might imagine, can be a huge waste of SQL Server resources. [6.5, 7.0, 2000] Updated 7-24-2006

*****

If you have the choice of using a JOIN or a subquery to perform the same task, generally the JOIN (often an OUTER JOIN) is faster. But this is not always the case. For example, if the returned data is going to be small, or if there are no indexes on the joined columns, then a subquery may indeed be faster.

The only way to really know for sure is to try both methods and then look at their query plans. If this operation is run often, you should seriously consider writing the code both ways, and then select the most efficient code. [6.5, 7.0, 2000] Updated 8-21-2006

*****

We have a query that contains two subselects containing an aggregate function (SUM, Count, etc.) in the SELECT part. The query was performing sluggishly. We were able to isolate the problem down to the aggregate function in the subselect.

To rectify the problem, we reorganized the query so that there was still an aggregate function in the SELECT part, but replaced the subselects with a series of JOINS. The query executed much faster.

So, if this holds true — developers, as a rule, should use JOINS in lieu of subselects when the subselect contains aggregate functions. [7.0, 2000] Tip provided by Silverscape Technologies, Inc (www.silverscape.net) Updated 8-21-2006

*****

If you have a query with many joins, one alternative to de-normalizing a table to boost performance is to use an Indexed View to pre-join the tables. An Indexed View, which is only available from SQL Server 2000 Enterprise Edition, allows you to create a view that is actually a physical object that has its own clustered index. Whenever a base table of the Indexed View is updated, the Indexed View is also updated. As you can imagine, this can potentially reduce INSERT, UPDATE, and DELETE performance on the base tables. You will have to perform tests, comparing the pros and cons of performance in order to determine whether or not using an Indexed View to avoid joins in query is worth the extra performance cost caused by using them. [2000] Updated 8-21-2006

*****

If you have a query that uses a LEFT OUTER JOIN, check it carefully to be sure that is the type of join you really want to use. As you may know, a LEFT OUTER JOIN is used to create a result set that includes all of the rows from the left table specified in the clause, not just the ones in which the joined columns match. In addition, when a row in the left table has no matching rows in the right table, the result set row contains NULL values for all the selected columns coming from the right table. If this is what you want, then use this type of join.

The problem is that in the real world, a LEFT OUTER JOIN is rarely needed, and many developers use them by mistake. While you may end up with the data you want, you may also end up with more than the data you want, which contributes to unnecessary overhead and poor performance. Because of this, always closely examine why you are using a LEFT OUTER JOIN in your queries, and only use them if they are exactly what you need. Otherwise, use a JOIN that is more appropriate to your needs. [6.5, 7.0, 2000] Updated 8-21-2006

*****

If you are having difficulty tuning the performance of a poorly performing query that has one or more JOINs, check to see if they query plan created by the query optimizer is using a hash join. When the query optimizer is asked to join two tables that don't have appropriate indexes, it will often perform a hash join.

A hash join is resource intensive (especially CPU and I/O) and can slow the performance of your join. If the query in question is run often, you should consider adding appropriate indexes. For example, if you are joining column1 in table1 to column5 in table2, then column1 in table1 and column5 in table2 need to have indexes.

Once indexes are added to the appropriate columns used in the joins in your query, the query optimizer will most likely be able to use these indexes, performing a nested-loop join instead of a hash join, and performance will improve. [7.0, 2000] Updated 8-21-2006

Pasted from <http://www.sql-server-performance.com/tuning_joins.asp>

Tips on Optimizing SQL Server Database Design

Bad logical database design results in bad physical database design, and generally results in poor database performance. So, if it is your responsibility to design a database from scratch, be sure you take the necessary time and effort to get the logical database design right. Once the logical design is right, then you also need to take the time to get the physical design right.
Both the logical and physical design must be right before you can expect to get good performance out of your database. If the logical design is not right before you begin the development of your application, it is too late after the application has been implemented to fix it. No amount of fast, expensive hardware can fix the poor performance caused by poor logical database design. [6.5, 7.0, 2000] Updated 6-27-2006
*****
One strong indicator of poor database design is the use of SELECT DISTINCT or UNION (not UNION ALL) in your queries. A properly designed database almost always negates the need for using either of these statements.
Besides this, using SELECT DISTINCT or UNION uses up a lot of resources, that could be better spent elsewhere speeding up your SQL Server. [7.0, 2000] Added 12-6-2005
*****
Following standard database normalization recommendations when designing OLTP databases can greatly maximize a database's performance. Here's why:
Helps to reduce the total amount of redundant data in the database. The less data there is, the less work SQL Server has to perform, speeding its performance.
Helps to reduce the use of NULLS in the database. The use of NULLs in a database can greatly reduce database performance, especially in WHERE clauses.
Helps to reduce the number of columns in tables, which means that more rows can fit on a single data page, which helps to boost SQL Server read performance.
Help to reduce the amount of Transact-SQL code that needs to be written to deal with non-normalized data. The less code there is, the less that has to run, speeding your application's performance.
Helps to maximize the use of clustered indexes, the most powerful and useful type of index available to SQL Server. The more data is separated into multiple tables because of normalization, the more clustered indexes become available to help speed up data access.
Helps to reduce the total number of indexes in your database. The less columns tables have, the less need there is for multiple indexes to retrieve it. And the fewer indexes there are, the less negative is the performance effect of INSERTs, UPDATES, and DELETES.
[6.5, 7.0, 2000] Updated 12-6-2005
*****
If normalizing your OLTP database forces you to create queries with many multiple joins (4 or more), you may want to consider denormalizing some of the tables in order to reduce the number of required joins. Denormalization is the process of selectively taking normalized tables and re-combining the data in them in order to reduce the number of joins needed them to produce the necessary query results. Sometimes the addition of a single column of redundant data to a table from another table can reduce a 4-way join into a 2-way join, significantly boosting performance by reducing the time it takes to perform the join.
While denormalization can boost join performance, it can also have negative effects. For example, by adding redundant data to tables, you risk the following problems:
More data means SQL Server has to read more data pages than otherwise needed, hurting performance.
Redundant data can lead to data anomalies and bad data.
In many cases, extra code will have to be written to keep redundant data in separate tables in synch, which adds to database overhead.
As you consider whether to denormalize a database to speed joins, be sure you first consider if you have the proper indexes on the tables to be joined. It is possible that your join performance problem is more of a problem with a lack of appropriate indexes that it is of joining too many tables.
Before you decide to denormalize a properly normalized database, be sure you thoroughly consider all of the implications and test performance both before and after you denormalize to see if your efforts have really bought you anything. [6.5, 7.0, 2000] Updated 12-6-2005
*****
When you are first designing your application, don't automatically assume you have to de-normalize your data. Your first goal in the design process should be to normalize your data. Next, you can test your design with realistic data and transactions. At this point, if you see that de-normalization will help, then by all means do so. But don't assume you need to de-normalize data until you can prove (through testing) that de-normalizing is the right thing to do. [6.5, 7.0, 2000] Updated 6-27-2006
*****
In SQL Server 6.5, when a null was stored in a fixed-length column, it is actually treated as a variable-length column. This was done to conserve space. But in SQL Server 7.0 and SQL 2000, this is no longer true. If a null is stored in a fixed-length column, then the entire column (even though it is null) still takes up the same amount of fixed space as established when the column was created. For example, a CHAR(100) field takes up 100 characters of storage, even if there is a null in the column.
This change has several implications. If you are upgrading a database from SQL Server 6.5 to 7.0 or 2000, then you may find that your database grows in size, sometimes a lot. The extra storage requirements require more data pages, which in turn reduces I/O performance. The same goes if you follow the same design style in your SQL Server 7.0 and SQL 2000 databases as you did in SQL Server 6.5.
To prevent the above "data explosion" problem, you will need to change your table design tactics. You will need to make your columns that hold many nulls variable-length instead of fixed-length, assuming the length of the column varies considerably. Of course, if the column length is not great, or if nulls to be stored are few, then using fixed-length data types will work fine. [6.5, 7.0, 2000] Updated 6-27-2006
*****
One option database designers have is to put all of their data for their application in a single large database, or to place their data in multiple smaller databases. Generally, if the data is highly similar and used the same way (e.g. the database is used only for OLTP, not OLAP), then it is better to use one large database rather than several smaller databases.
For example, let's say you have customers all over the world and you are designing an OLTP application. Except for the fact that your customers are spread throughout the world, the application works the same for all of them. Under this scenario, it would be more efficient to use one database instead of creating separate smaller databases for each country or region of the world. The biggest reason you would want one database instead of multiple databases is to take advantage of SQL Server's ability to cache data, such as the data used in lookup tables.
If you use multiple databases instead of a single database, then each database would have to cache the same lookup tables, over and over for each database. The same applies to common stored procedures used in each database. This would present an inefficient use of SQL Server's memory and could potentially reduce performance. [6.5, 7.0, 2000] Updated 6-27-2006
*****
If you are designing a database that potentially could be very large, holding millions or billions of rows, consider the option of horizontally partitioning your large tables. Horizontal partitioning divides what would typically be a single table into multiple tables, creating many smaller tables instead of a single, large table. The advantage of this is that is generally is much faster to query a single small table than a single large table.
For example, if you expect to add 10 million rows a year to a transaction table, after five years it will contain 50 million rows. In most cases, you may find that most queries (although not all) queries on the table will be for data from a single year. If this is the case, if you partition the table into a separate table for each year of transactions, then you can significantly reduce the overhead of the most common of queries.
Taking advantage of horizontal partitioning requires early planning as the various queries you develop for your application need to be designed to take advantage of your database's design. It is much easier to implement horizontal partitioning in the early design stage than it is to implement after your application has been put into production. [6.5, 7.0, 2000] Updated 6-27-2006
*****
When designing columns to store formatted data, such as Social Security numbers or phone numbers, you have two choices. You can choose to store the data with no formatting or with formatting. Each choice has its pros and cons. If you store the data with formatting, then your disk space requirements will be slightly higher than if you don't store the data with formatting. If you store the data without formatting, then you will have to add or remove formatting each time you INSERT or UPDATE a record, which adds to CPU and memory overhead on the server. In other words, the choice you make affects your server's load.
Which option you should choose in order to maximize performance is not always obvious, but many database designers choose to store formatted data because they feel the savings in CPU time and memory overhead (because formatting isn't performed on the fly) is more important than saving some space on the disk drive. You will have to closely examine your application to determine, in your particular case, which option will save your application the most overhead. [6.5, 7.0, 2000] Updated 7-24-2006
*****
If you expect your database to be accessed simultaneously by many different users, consider what would happen to your database's scalability and performance if you designed your database with just a few tables holding virtually all the data. For example, say you are designing a sales-based application that could potentially have over 10,000 new rows inserted into it every day. And also assume that as many as 1% of these rows would need to be updated or deleted daily. And also assume that as many as 100 people could be accessing the same table at the same time. Can you see the problem this could cause? As a rule of thumb, design your database to minimize potential user contention. This will both boost scalability and performance of your application. [6.5, 7.0, 2000] Updated 7-24-2006
*****
SQL Server 7.0 and 2000 data pages are 8K (8192 bytes) in size. Of the 8192 available bytes in each data page, only 8060 bytes are used to store a row. The rest is used for overhead.
So how does this information affect database design and the performance of your application? To optimize SQL Server performance, you should design your rows in such a way as to maximize the number of rows that can fit into a single data page. The more densely rows are stored in data pages, the less I/O SQL Server has to perform when reading data pages from disk, and the more rows it can squeeze into the buffer. The more rows that you can fit into the buffer increases the likelihood that the data you need is in the buffer and not on the disk, saving even more valuable I/O resources.
For example, if you have a row that was 4031 bytes long, then only one row could fit into a data page, and the remaining 4029 bytes left in the page would be empty. (Keep in mind that a row can be up to 8060 bytes, and half of this is 4030 bytes.) This is a great waste of space which can affect the I/O performance of the server.
If you run into a situation like this, try to redesign the row, if possible, so that the row is 4030 bytes or less. This way you would get two rows in each page and I/O performance would be greatly enhanced. This not only applies to this particular example. It can also apply to cases where you are trying to fit three, four, or more records efficiently into a single data page. [7.0, 2000] Updated 7-24-2006
*****
By default, TEXT, NTEXT, and IMAGE data is stored separately from the rest of the data in a table. The table itself (in the appropriate columns) contains a 16-bit pointer that points to separate data pages that contain the TEXT, NTEXT, or IMAGE data. The reason this is done is to enhance performance.
For example, if these datatypes hold large quantities of data, and they were stored in a table with the rest of the data, this would require SQL Server to work harder when accessing the table. The larger the table, the more I/O SQL Server has to perform in order to accomplish its tasks.
But if the TEXT, NTEXT, and IMAGE data is stored separately from the table, then SQL Server uses less I/O to access it, assuming that the TEXT, NTEXT, or IMAGE data is not being accessed in the current operation. On the other hand, if the TEXT, NTEXT, and IMAGE data does need to be accessed, SQL Server has to go to extra work to retrieve it because it has to locate the pointer in the table, and then find the necessary data from the data pages. Assuming the TEXT, NTEXT, and IMAGE data does not have to be accessed frequently, this is a good performance tradeoff.
But in some cases, storing these datatypes away from the table (in their own data pages) is not the most efficient method. This is especially true if the TEXT, NTEXT, or IMAGE data is not large and if the data is frequently accessed. In this case, it is often better to store this data in the table itself, instead of separate data pages. The reason for this is because it is faster for SQL Server to retrieve data directly from the table than retrieving it from separate data pages, as described above.
Starting with SQL Server 2000, you now have the option to store smaller TEXT, NTEXT, and IMAGE data directly in a table instead of in separate data pages. If you want to turn this option on, you will have to set the "text in row" table option to permit this. For example:
sp_tableoption 'tablename', 'text in row', 'on'
or
sp_tableoption 'tablename', 'text in row', 'size'
Where 'size' is a value ranging from 24 to 7,000 bytes.
Using this option in the right circumstances can greatly help your application's performance. On the other hand, if used incorrectly, it can hurt performance. [2000] Updated 7-24-2006
*****
If you have been using the "text in row" option in SQL Server 2000, but have now decided to no longer use it and to turn it off, SQL Server will automatically move any TEXT, NTEXT, and IMAGE data from regular data pages to a new area in the database where other similar data is stored, leaving only 16 byte pointers behind in the original data page. Depending on the amount of data that was stored using this "text in row" option in data pages, it may take a while for SQL Server to move all of the data. You may want to schedule such a task during a time when SQL Server is not too busy. [2000] Updated 7-24-2006
*****
Ideally, all tables should have a primary key in order to enforce entity integrity. This of course requires the use of a unique index. One method that is commonly used by database designers is to add an identity column to each of their tables in order to ensure uniqueness. In many cases, this is a good idea. But, if the rows in a table are naturally unique, then the use of an identity column is not needed. In fact, doing so can reduce the performance of your database because it may require the use of unnecessary joins, and because it uses up data pages unnecessarily, which requires SQL Server to perform more I/O to read your table.
So don't automatically add identify columns to your table. Only add them if there is no column (or columns if you want to create a composite primary key) in your table that can guarantee uniqueness. [6.5, 7.0, 2000] Updated 10-16-2006
*****
If you plan is to use replication as part of your SQL Server-based application, and your goal is high scalability and performance, then this factor should be considered when the database is first designed. The fact that you will be using replication, and the type of replication you plan to use, can directly affect how your databases should be designed. [6.5, 7.0, 2000] Updated 10-16-2006
*****
Don't reinvent the wheel when it comes to database design; in other words, take advantage of SQL Server's built-in features to enforce data integrity, instead of writing your own. These built-in integrity features use much less overhead and perform faster than the ones you can create.
Examples of the built-in SQL Server integrity features you should take advantage of include the CHECK, DEFAULT, NOT NULL, PRIMARY KEY, UNIQUE, and FOREIGN IDENTITY constraints. They also include IDENTITY and TIMESTAMP columns. You can also take advantage of the ability to create your own user-defined data types.
Not only are these built-in data integrity features faster than ones you create, they also enforce these rules completely and consistently. [6.5, 7.0, 2000] Updated 10-16-2006
*****
The Query Optimizer can use check constraints to help speed up query processing in some cases. For example, if you have a check constraint on a column of the integer data type that specifies that the lowest allowable number is 1,000, and that the highest allowable number is 10,000, the Query Optimizer can use this information to help it make better decisions when optimizing the query and selecting the correct index(es) to use.
Consider adding check constants to all columns where they are applicable. This not only helps to ensure "proper" data is entered into your tables, in can boost overall performance of some queries. [7.0, 2000] Updated 11-6-2006
*****
Generally, most SQL Server-related development is done on a "development" SQL Server box, then moved to a "testing" SQL Server box, and then is eventually moved to a "production" SQL Server box. One step you can take to help ensure that your new application will perform as expected once it goes into production is to perform regression testing and trend-analysis when moving any code (in the application or the database) from one box to another. Doing so helps to prevent unexpected performance surprises.
For example, you run a query on the development box and notice that it performs at sub-second speed. Because it is fast, you don't think that it needs performance tuning. But when you move it from the development box to the test box, you notice that the same query now takes 45 seconds. After a little research, you find out that the number of rows hit in each box in the tables is different. In one key table on the development server, you discover that is only has 5,000 rows, but the same key table on the test server has 1,600,100 rows. In addition, when you take a look at the query plan for the query in question, you discover that the query is performing a table scan on both servers. So right away, you realize that the query, which appeared to be fast on the development server, is only fast because it only has to scan 5,000 rows. Obviously, the query needs tuning (or an appropriate index added).
Problems such as the one described above can sneak up on you and hurt your application's performance. That's why it is very important to do performance regression testing and trend-analysis whenever you move code from one server to another during the development process. [6.5, 7.0, 2000] Updated 11-6-2006
*****
When you have a choice of using a constraint or a trigger to perform the same task, always choose the constraint. The same goes if you have the option of using either a constraint or a rule, or a constraint or a default. Constraints require less overhead than triggers, rules, and defaults, reducing the load on SQL Server, and boosting overall performance. [6.5, 7.0, 2000] Updated 6-27-2006
*****
Don't implement redundant integrity features in your database. For example, if you are using primary key and foreign key constraints to enforce referential integrity, don't add unnecessary overhead by also adding a trigger that performs the same function. The same goes for using both constraints and defaults or constraints and rules that perform redundant work. While this may sound obvious, it is not uncommon to find these kinds of problem in SQL Server databases. [6.5, 7.0, 2000] Updated 6-27-2006

Pasted from <http://www.sql-server-performance.com/database_design.asp>

订阅：博文 (Atom)

Went Blog

OLAP 与 OLTP的区别

SQL-Transaction

Hash Join & Merge Join

Performance Tuning SQL Server Joins

Tips on Optimizing SQL Server Database Design

博客归档