亞馬遜推出的Aurora數(shù)據(jù)庫引擎,支持一份存儲(chǔ),一主多讀的架構(gòu)。這個(gè)架構(gòu)和Oracle RAC類似,也是共享存儲(chǔ),但是只有一個(gè)實(shí)例可以執(zhí)行寫操作,其他實(shí)例只能執(zhí)行讀操作。相比傳統(tǒng)的基于復(fù)制的一主多讀,節(jié)約了存儲(chǔ)和網(wǎng)絡(luò)帶寬的成本。
我們可以使用PostgreSQL的hot standby模式來模擬這種共享存儲(chǔ)一主多讀的架構(gòu),但是需要注意幾點(diǎn),hot standby也會(huì)對數(shù)據(jù)庫有寫的動(dòng)作,例如recovery時(shí),會(huì)修改控制文件,數(shù)據(jù)文件等等,這些操作是多余的。另外很多狀態(tài)是存儲(chǔ)在內(nèi)存中的,所以內(nèi)存狀態(tài)也需要更新。
還有需要注意的是:
pg_xlogpg_logpg_clogpg_multixactpostgresql.confrecovery.confpostmaster.pid最終實(shí)現(xiàn)一主多備的架構(gòu),需要通過改PG內(nèi)核來實(shí)現(xiàn):
這些文件應(yīng)該是每個(gè)實(shí)例對應(yīng)一份。postgresql.conf, recovery.conf, postmaster.pid, pg_controlhot standby不執(zhí)行實(shí)際的恢復(fù)操作,但是需要更新自己的內(nèi)存狀態(tài),如當(dāng)前的OID,XID等等,以及更新自己的pg_control。在多實(shí)例間,要實(shí)現(xiàn)主到備節(jié)點(diǎn)的OS臟頁的同步,數(shù)據(jù)庫shared buffer臟頁的同步。模擬過程
不改任何代碼,在同一主機(jī)下啟多實(shí)例測試,會(huì)遇到一些問題。(后面有問題描述,以及如何修改代碼來修復(fù)這些問題)
主實(shí)例配置文件:
# vi postgresql.conflisten_addresses='0.0.0.0'port=1921max_connections=100unix_socket_directories='.'ssl=onssl_ciphers='EXPORT40'shared_buffers=512MBhuge_pages=trymax_PRepared_transactions=0max_stack_depth=100kBdynamic_shared_memory_type=posixmax_files_per_process=500wal_level=logicalfsync=offsynchronous_commit=offwal_sync_method=open_datasyncfull_page_writes=offwal_log_hints=offwal_buffers=16MBwal_writer_delay=10mscheckpoint_segments=8archive_mode=offarchive_command='/bin/date'max_wal_senders=10max_replication_slots=10hot_standby=onwal_receiver_status_interval=1shot_standby_feedback=onenable_bitmapscan=onenable_hashagg=onenable_hashjoin=onenable_indexscan=onenable_material=onenable_mergejoin=onenable_nestloop=onenable_seqscan=onenable_sort=onenable_tidscan=onlog_destination='csvlog'logging_collector=onlog_directory='pg_log'log_truncate_on_rotation=onlog_rotation_size=10MBlog_checkpoints=onlog_connections=onlog_disconnections=onlog_duration=offlog_error_verbosity=verboselog_line_prefix='%ilog_statement='none'log_timezone='PRC'autovacuum=onlog_autovacuum_min_duration=0autovacuum_vacuum_scale_factor=0.0002autovacuum_analyze_scale_factor=0.0001datestyle='iso,timezone='PRC'lc_messages='C'lc_monetary='C'lc_numeric='C'lc_time='C'default_text_search_config='pg_catalog.english' # vi recovery.donerecovery_target_timeline='latest'standby_mode=onprimary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60' # vi pg_hba.conflocal replication postgres trusthost replication postgres 127.0.0.1/32 trust啟動(dòng)主實(shí)例。
postgres@digoal-> pg_ctl start啟動(dòng)只讀實(shí)例,必須先刪除postmaster.pid,這點(diǎn)PostgreSQL新版本加了一個(gè)PATCH,如果這個(gè)文件被刪除,會(huì)自動(dòng)關(guān)閉數(shù)據(jù)庫,所以我們需要注意,不要使用最新的PGSQL,或者把這個(gè)patch干掉先。
postgres@digoal-> cd $PGDATApostgres@digoal-> mv recovery.done recovery.confpostgres@digoal-> rm -f postmaster.pidpostgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"查看當(dāng)前控制文件狀態(tài),只讀實(shí)例改了控制文件,和前面描述一致。
postgres@digoal-> pg_controldata |grep stateDatabase cluster state: in archive recovery連到主實(shí)例,創(chuàng)建表,插入測試數(shù)據(jù)。
psql -p 1921postgres=# create table test1(id int);CREATE TABLEpostgres=# insert into test1 select generate_series(1,10);INSERT 0 10在只讀實(shí)例查看插入的數(shù)據(jù)。
postgres@digoal-> psql -h 127.0.0.1 -p 1922postgres=# select * from test1; id---- 1 2 3 4 5 6 7 8 9 10(10 rows)主實(shí)例執(zhí)行檢查點(diǎn)后,控制文件狀態(tài)會(huì)改回生產(chǎn)狀態(tài)。
psql -p 1921postgres=# checkpoint;CHECKPOINTpostgres@digoal-> pg_controldata |grep stateDatabase cluster state: in production但是如果在只讀實(shí)例執(zhí)行完檢查點(diǎn),又會(huì)改回恢復(fù)狀態(tài)。
postgres@digoal-> psql -h 127.0.0.1 -p 1922psql (9.4.4)postgres=# checkpoint;CHECKPOINTpostgres@digoal-> pg_controldata |grep stateDatabase cluster state: in archive recovery注意到,上面的例子有1個(gè)問題,用流復(fù)制的話,會(huì)從主節(jié)點(diǎn)通過網(wǎng)絡(luò)拷貝XLOG記錄,并覆蓋同一份已經(jīng)寫過的XLOG記錄的對應(yīng)的OFFSET,這是一個(gè)問題,因?yàn)榭赡軙?huì)造成主節(jié)點(diǎn)看到的數(shù)據(jù)不一致(比如一個(gè)數(shù)據(jù)塊改了多次,只讀實(shí)例在恢復(fù)時(shí)將它覆蓋到老的版本了,在主實(shí)例上看到的就會(huì)變成老版本的BLOCK,后面再來改這個(gè)問題,禁止只讀實(shí)例恢復(fù)數(shù)據(jù))。
另一方面,我們知道PostgreSQL standby會(huì)從三個(gè)地方(流、pg_xlog、restore_command)讀取XLOG進(jìn)行恢復(fù),所以在共享存儲(chǔ)的環(huán)境中,我們完全沒有必要用流復(fù)制的方式,直接從pg_xlog目錄讀取即可。修改recovery.conf參數(shù),將以下注釋
# primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'重啟只讀實(shí)例。
pg_ctl stop -m fastpostgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"重新測試數(shù)據(jù)一致性。主實(shí)例:
postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10postgres=# insert into test1 select generate_series(1,10);INSERT 0 10只讀實(shí)例:
postgres=# select count(*) from test1; count------- 60(1 row)問題分析和解決
截至目前,有幾個(gè)問題未解決:
standby還是要執(zhí)行recovery的操作,recovery產(chǎn)生的write操作會(huì)隨著只讀實(shí)例數(shù)量的增加而增加。另外recovery有一個(gè)好處,解決了臟頁的問題,主實(shí)例shared buffer中的臟頁不需要額外的同步給只讀實(shí)例了。recovery還會(huì)帶來一個(gè)嚴(yán)重的BUG,回放可能和當(dāng)前主節(jié)點(diǎn)操作同一個(gè)data page;或者回放時(shí)將塊回放到老的狀態(tài),而實(shí)際上主節(jié)點(diǎn)又更新了這個(gè)塊,造成數(shù)據(jù)塊的不一致。如果此時(shí)只讀實(shí)例關(guān)閉,然后立即關(guān)閉主實(shí)例,數(shù)據(jù)庫再起來時(shí),這個(gè)數(shù)據(jù)塊是不一致的;standby還是會(huì)改控制文件;在同一個(gè)$PGDATA下啟動(dòng)實(shí)例,首先要?jiǎng)h除postmaster.pid;關(guān)閉實(shí)例時(shí),已經(jīng)被刪除postmaster.pid的實(shí)例,只能通過找到postgres主進(jìn)程的pid,然后發(fā)kill -s 15, 2或3的信號來關(guān)閉數(shù)據(jù)庫;
static void set_mode(char *modeopt) { if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0) { shutdown_mode = SMART_MODE; sig = SIGTERM; } else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0) { shutdown_mode = FAST_MODE; sig = SIGINT; } else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0) { shutdown_mode = IMMEDIATE_MODE; sig = SIGQUIT; } else { write_stderr(_("%s: unrecognized shutdown mode /"%s/"/n"), progname, modeopt); do_advice(); exit(1); } }當(dāng)主節(jié)點(diǎn)刪除rel page時(shí),只讀實(shí)例回放時(shí),會(huì)報(bào)invalid xlog對應(yīng)的rel page不存在的錯(cuò)誤,這個(gè)也是只讀實(shí)例需要回放日志帶來的問題。非常容易重現(xiàn)這個(gè)問題,刪除一個(gè)表即可。
2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,20,,2015-10-09 13:29:15 CST,1/0,0,WARNING,01000,"page 8 of relation base/151898/185251 does not exist",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"report_invalid_page, xlogutils.c:67","" 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,21,,2015-10-09 13:29:15 CST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"log_invalid_page, xlogutils.c:91",""這個(gè)報(bào)錯(cuò)可以先注釋這一段來繞過,從而可以演示下去。
src/backend/access/transam/xlogutils.c /* Log a reference to an invalid page */ static void log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno, bool present) { ////// /* * Once recovery has reached a consistent state, the invalid-page table * should be empty and remain so. If a reference to an invalid page is * found after consistency is reached, PANIC immediately. This might seem * aggressive, but it's better than letting the invalid reference linger * in the hash table until the end of recovery and PANIC there, which * might come only much later if this is a standby server. */ //if (reachedConsistency) //{ // report_invalid_page(WARNING, node, forkno, blkno, present); // elog(PANIC, "WAL contains references to invalid pages"); //}由于本例是在同一個(gè)操作系統(tǒng)中演示,所以沒有遇到OS的dirty page cache的問題,如果是不同主機(jī)的環(huán)境,我們需要解決OS dirty page cache 的同步問題,或者消除dirty page cache,如使用direct IO。或者集群文件系統(tǒng)如gfs2。如果要產(chǎn)品化,至少需要解決以上問題。
先解決Aurora實(shí)例寫數(shù)據(jù)文件、控制文件、檢查點(diǎn)的問題。
增加一個(gè)啟動(dòng)參數(shù),表示這個(gè)實(shí)例是否為Aurora實(shí)例(即只讀實(shí)例)
# vi src/backend/utils/misc/guc.c /******** option records follow ********/ static struct config_bool ConfigureNamesBool[] = { { {"aurora", PGC_POSTMASTER, CONN_AUTH_SETTINGS, gettext_noop("Enables advertising the server via Bonjour."), NULL }, &aurora, false, NULL, NULL, NULL },新增變量
# vi src/include/postmaster/postmaster.h extern bool aurora;禁止Aurora實(shí)例更新控制文件
# vi src/backend/access/transam/xlog.c #include "postmaster/postmaster.h" bool aurora; void UpdateControlFile(void) { if (aurora) return;禁止Aurora實(shí)例啟動(dòng)bgwriter進(jìn)程
# vi src/backend/postmaster/bgwriter.c #include "postmaster/postmaster.h" bool aurora; /* * Main entry point for bgwriter process * * This is invoked from AuxiliaryProcessMain, which has already created the * basic execution environment, but not enabled signals yet. */ void BackgroundWriterMain(void) { ////// pg_usleep(1000000L); /* * If an exception is encountered, processing resumes here. * * See notes in postgres.c about the design of this coding. */ if (!aurora && sigsetjmp(local_sigjmp_buf, 1) != 0) { ////// /* * Do one cycle of dirty-buffer writing. */ if (!aurora) { can_hibernate = BgBufferSync(); ////// } pg_usleep(1000000L); } }禁止Aurora實(shí)例啟動(dòng)checkpointer進(jìn)程
# vi src/backend/postmaster/checkpointer.c #include "postmaster/postmaster.h" bool aurora; ////// /* * Main entry point for checkpointer process * * This is invoked from AuxiliaryProcessMain, which has already created the * basic execution environment, but not enabled signals yet. */ void CheckpointerMain(void) { ////// /* * Loop forever */ for (;;) { bool do_checkpoint = false; int flags = 0; pg_time_t now; int elapsed_secs; int cur_timeout; int rc; pg_usleep(100000L); /* Clear any already-pending wakeups */ if (!aurora) ResetLatch(&MyProc->procLatch); /* * Process any requests or signals received recently. */ if (!aurora) AbsorbFsyncRequests(); if (!aurora && got_SIGHUP) { got_SIGHUP = false; ProcessConfigFile(PGC_SIGHUP); /* * Checkpointer is the last process to shut down, so we ask it to * hold the keys for a range of other tasks required most of which * have nothing to do with checkpointing at all. * * For various reasons, some config values can change dynamically * so the primary copy of them is held in shared memory to make * sure all backends see the same value. We make Checkpointer * responsible for updating the shared memory copy if the * parameter setting changes because of SIGHUP. */ UpdateSharedMemoryConfig(); } if (!aurora && checkpoint_requested) { checkpoint_requested = false; do_checkpoint = true; BgWriterStats.m_requested_checkpoints++; } if (!aurora && shutdown_requested) { /* * From here on, elog(ERROR) should end with exit(1), not send * control back to the sigsetjmp block above */ ExitOnAnyError = true; /* Close down the database */ ShutdownXLOG(0, 0); /* Normal exit from the checkpointer is here */ proc_exit(0); /* done */ } /* * Force a checkpoint if too much time has elapsed since the last one. * Note that we count a timed checkpoint in stats only when this * occurs without an external request, but we set the CAUSE_TIME flag * bit even if there is also an external request. */ now = (pg_time_t) time(NULL); elapsed_secs = now - last_checkpoint_time; if (!aurora && elapsed_secs >= CheckPointTimeout) { if (!do_checkpoint) BgWriterStats.m_timed_checkpoints++; do_checkpoint = true; flags |= CHECKPOINT_CAUSE_TIME; } /* * Do a checkpoint if requested. */ if (!aurora && do_checkpoint) { bool ckpt_performed = false; bool do_restartpoint; /* use volatile pointer to prevent code rearrangement */ volatile CheckpointerShmemStruct *cps = CheckpointerShmem; /* * Check if we should perform a checkpoint or a restartpoint. As a * side-effect, RecoveryInProgress() initializes TimeLineID if * it's not set yet. */ do_restartpoint = RecoveryInProgress(); /* * Atomically fetch the request flags to figure out what kind of a * checkpoint we should perform, and increase the started-counter * to acknowledge that we've started a new checkpoint. */ SpinLockAcquire(&cps->ckpt_lck); flags |= cps->ckpt_flags; cps->ckpt_flags = 0; cps->ckpt_started++; SpinLockRelease(&cps->ckpt_lck); /* * The end-of-recovery checkpoint is a real checkpoint that's * performed while we're still in recovery. */ if (flags & CHECKPOINT_END_OF_RECOVERY) do_restartpoint = false; ////// ckpt_active = false; } /* Check for archive_timeout and switch xlog files if necessary. */ if (!aurora) CheckArchiveTimeout(); /* * Send off activity statistics to the stats collector. (The reason * why we re-use bgwriter-related code for this is that the bgwriter * and checkpointer used to be just one process. It's probably not * worth the trouble to split the stats support into two independent * stats message types.) */ if (!aurora) pgstat_send_bgwriter(); /* * Sleep until we are signaled or it's time for another checkpoint or * xlog file switch. */ now = (pg_time_t) time(NULL); elapsed_secs = now - last_checkpoint_time; if (elapsed_secs >= CheckPointTimeout) continue; /* no sleep for us ... */ cur_timeout = CheckPointTimeout - elapsed_secs; if (!aurora && XLogArchiveTimeout > 0 && !RecoveryInProgress()) { elapsed_secs = now - last_xlog_switch_time; if (elapsed_secs >= XLogArchiveTimeout) continue; /* no sleep for us ... */ cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs); } if (!aurora) rc = WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, cur_timeout * 1000L /* convert to ms */ ); /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. */ if (rc & WL_POSTMASTER_DEATH) exit(1); } } ////// /* SIGINT: set flag to run a normal checkpoint right away */ static void ReqCheckpointHandler(SIGNAL_ARGS) { if (aurora) return; int save_errno = errno; checkpoint_requested = true; if (MyProc) SetLatch(&MyProc->procLatch); errno = save_errno; } ////// /* * AbsorbFsyncRequests * Retrieve queued fsync requests and pass them to local smgr. * * This is exported because it must be called during CreateCheckPoint; * we have to be sure we have accepted all pending requests just before * we start fsync'ing. Since CreateCheckPoint sometimes runs in * non-checkpointer processes, do nothing if not checkpointer. */ void AbsorbFsyncRequests(void) { CheckpointerRequest *requests = NULL; CheckpointerRequest *request; int n; if (!AmCheckpointerProcess() || aurora) return; //////禁止Aurora實(shí)例手工調(diào)用checkpoint命令
# vi src/backend/tcop/utility.c #include "postmaster/postmaster.h" bool aurora; ////// void standard_ProcessUtility(Node *parsetree, const char *queryString, ProcessUtilityContext context, ParamListInfo params, DestReceiver *dest, char *completionTag) { ////// case T_CheckPointStmt: if (!superuser() || aurora) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT")));改完上面的代碼,重新編譯一下,現(xiàn)在接近一個(gè)DEMO了。現(xiàn)在Aurora實(shí)例不會(huì)更新控制文件,不會(huì)寫數(shù)據(jù)文件,不會(huì)執(zhí)行checkpoint,是我們想要的結(jié)果。啟動(dòng)只讀實(shí)例時(shí),加一個(gè)參數(shù)aurora=true,表示啟動(dòng)Aurora實(shí)例。
pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922 -c aurora=true"不過要產(chǎn)品化,還有很多細(xì)節(jié)需要考慮,這只是一個(gè)DEMO。阿里云RDS的小伙伴們加油!
還有一種更保險(xiǎn)的玩法,共享存儲(chǔ)多讀架構(gòu),需要存儲(chǔ)兩份數(shù)據(jù)。其中一份是主實(shí)例的存儲(chǔ),它自己玩自己的,其他實(shí)例不對它做任何操作;另一份是standby的,這部作為共享存儲(chǔ),給多個(gè)只讀實(shí)例來使用。
參考
https://aws.amazon.com/cn/rds/aurora/src/backend/access/transam/xlog.c
/* * Open the WAL segment containing WAL position 'RecPtr'. * * The segment can be fetched via restore_command, or via walreceiver having * streamed the record, or it can already be present in pg_xlog. Checking * pg_xlog is mainly for crash recovery, but it will be polled in standby mode * too, in case someone copies a new segment directly to pg_xlog. That is not * documented or recommended, though. * * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should * prepare to read WAL starting from RedoStartLSN after this. * * 'RecPtr' might not point to the beginning of the record we're interested * in, it might also point to the page or segment header. In that case, * 'tliRecPtr' is the position of the WAL record we're interested in. It is * used to decide which timeline to stream the requested WAL from. * * If the record is not immediately available, the function returns false * if we're not in standby mode. In standby mode, waits for it to become * available. * * When the requested record becomes available, the function opens the file * containing it (if not open already), and returns true. When end of standby * mode is triggered by the user, and there is no more WAL available, returns * false. */ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt, XLogRecPtr tliRecPtr) { ////// static pg_time_t last_fail_time = 0; pg_time_t now; /*------- * Standby mode is implemented by a state machine: * * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just * pg_xlog (XLOG_FROM_XLOG) * 2. Check trigger file * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM) * 4. Rescan timelines * 5. Sleep 5 seconds, and loop back to 1. * * Failure to read from the current source advances the state machine to * the next state. * * 'currentSource' indicates the current state. There are no currentSource * values for "check trigger", "rescan timelines", and "sleep" states, * those actions are taken when reading from the previous source fails, as * part of advancing to the next state. *------- */
新聞熱點(diǎn)
疑難解答
圖片精選