在Python中利用Into包整潔地進行數據遷移的教程

2020-02-23 00:26:42

字體：大中小

來源：轉載

供稿：網友

動機

我們花費大量的時間將數據從普通的交換格式（比如CSV），遷移到像數組、數據庫或者二進制存儲等高效的計算格式。更糟糕的是，許多人沒有將數據遷移到高效的格式，因為他們不知道怎么（或者不能）為他們的工具管理特定的遷移方法。

你所選擇的數據格式很重要，它會強烈地影響程序性能（經驗規律表明會有10倍的差距），以及那些輕易使用和理解你數據的人。

當提倡Blaze項目時，我經常說：“Blaze能幫助你查詢各種格式的數據。”這實際上是假設你能夠將數據轉換成指定的格式。

進入into項目

into函數能在各種數據格式之間高效的遷移數據。這里的數據格式既包括內存中的數據結構，比如：

列表、集合、元組、迭代器、numpy中的ndarray、pandas中的DataFrame、dynd中的array，以及上述各類的流式序列。

也包括存在于Python程序之外的持久化數據，比如：

CSV、JSON、行定界的JSON，以及以上各類的遠程版本

HDF5 (標準格式與Pandas格式皆可)、 BColz、 SAS、 SQL 數據庫 ( SQLAlchemy支持的皆可)、 Mongo

into項目能在上述數據格式的任意兩個格式之間高效的遷移數據，其原理是利用一個成對轉換的網絡（該文章底部有直觀的解釋）。

如何使用它

into函數有兩個參數：source和target。它將數據從source轉換成target。source和target能夠使用如下的格式：

Target Source Example

Object Object A particular DataFrame or list

String String ‘file.csv', ‘postgresql://hostname::tablename'

Type Like list or pd.DataFrame

所以，下邊是對into函數的合法調用：

>>> into(list, df) # create new list from Pandas DataFrame >>> into([], df) # append onto existing list >>> into('myfile.json', df) # Dump dataframe to line-delimited JSON >>> into(Iterator, 'myfiles.*.csv') # Stream through many CSV files >>> into('postgresql://hostname::tablename', df) # Migrate dataframe to Postgres >>> into('postgresql://hostname::tablename', 'myfile.*.csv') # Load CSVs to Postgres >>> into('myfile.json', 'postgresql://hostname::tablename') # Dump Postgres to JSON >>> into(pd.DataFrame, 'mongodb://hostname/db::collection') # Dump Mongo to DataFrame

Note that into is a single function. We're used to doing this with various to_csv, from_sql methods on various types. The into api is very small; Here is what you need in order to get started:

注意，into函數是一個單一的函數。雖然我們習慣于在各種類型上使用to_csv, from_sql等方法來完成這樣的功能，但接口into非常簡單。開始使用into函數前，你需要：

上一篇：python實現查詢IP地址所在地

下一篇：python中反射用法實例