!18191 Dataset operation API docs modification

Merge pull request !18191 from lizhenglong1992/code_docs_api
This commit is contained in:
i-robot 2021-06-11 14:33:47 +08:00 committed by Gitee
commit c3507a5017
2 changed files with 146 additions and 90 deletions

View File

@ -222,7 +222,7 @@ class Dataset:
def create_ir_tree(self):
"""
Internal method to create an IR tree.
Internal method to build an IR tree.
Returns:
DatasetNode, the root node of the IR tree.
@ -240,7 +240,8 @@ class Dataset:
def close_pool(self):
"""
Close multiprocessing pool in dataset.
Close multiprocessing pool in dataset. If you are familiar with multiprocessing library, you can regard this
as a deconstructor for a processingPool object.
"""
if hasattr(self, 'process_pool') and self.process_pool is not None:
self.process_pool.close()
@ -346,10 +347,10 @@ class Dataset:
they are full.
A length function is called on each row in the dataset. The row is then
bucketed based on its length and bucket_boundaries. When a bucket reaches its
bucketed based on its length and bucket boundaries. When a bucket reaches its
corresponding size specified in bucket_batch_sizes, the entire bucket will be
padded according to batch_info, and then batched. Each batch will be full,
except for maybe the last batch for each bucket.
padded according to batch_info, and then form a batch.
Each batch will be full, except one special case: the last batch for each bucket may not be full.
Args:
column_names (list[str]): Columns passed to element_length_function.
@ -357,14 +358,14 @@ class Dataset:
of the buckets. Must be strictly increasing. If there are n boundaries,
n+1 buckets are created: One bucket for [0, bucket_boundaries[0]), one
bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each
0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
0<i<n-1, and last bucket for [bucket_boundaries[n-1], inf).
bucket_batch_sizes (list[int]): A list consisting of the batch sizes for
each bucket. Must contain len(bucket_boundaries)+1 elements.
element_length_function (Callable, optional): A function that takes in
len(column_names) arguments and returns an int. If no value is
provided, then len(column_names) must be 1, and the size of the first
M arguments where M = len(column_names) and returns an integer. If no value
provided, parameter M the len(column_names) must be 1, and the size of the first
dimension of that column will be taken as the length (default=None).
pad_info (dict, optional): Represents how to batch each column. The key
pad_info (dict, optional): The information about how to batch each column. The key
corresponds to the column name, and the value must be a tuple of 2 elements.
The first element corresponds to the shape to pad to, and the second
element corresponds to the value to pad with. If a column is not
@ -421,16 +422,17 @@ class Dataset:
Note:
The order of using repeat and batch reflects the number of batches and per_batch_map.
It is recommended that the repeat operation be used after the batch operation.
It is recommended that the repeat operation applied after the batch operation finished.
Args:
batch_size (int or function): The number of rows each batch is created with. An
int or callable which takes exactly 1 parameter, BatchInfo.
drop_remainder (bool, optional): Determines whether or not to drop the last
possibly incomplete batch (default=False). If True, and if there are less
int or callable object which takes exactly 1 parameter, BatchInfo.
drop_remainder (bool, optional): Determines whether or not to drop the last block
whose data row number is less than batch size (default=False). If True, and if there are less
than batch_size rows available to make the last batch, then those rows will
be dropped and not propagated to the child node.
num_parallel_workers (int, optional): Number of workers to process the dataset in parallel (default=None).
num_parallel_workers (int, optional): Number of workers(threads) to process the dataset in parallel
(default=None).
per_batch_map (callable, optional): Per batch map callable. A callable which takes
(list[Tensor], list[Tensor], ..., BatchInfo) as input parameters. Each list[Tensor] represents a batch
of Tensors on a given column. The number of lists should match with number of entries in input_columns.
@ -445,7 +447,7 @@ class Dataset:
columns of the last operation. (default=None, output columns will have the same
name as the input columns, i.e., the columns will be replaced).
column_order (Union[str, list[str]], optional): List of all the desired columns to propagate to
the child node. This list must be a subset of all the columns in the dataset after
the child node. This list must be a permutation of all the columns in the dataset after
all operations are applied. The order of the columns in each row propagated to the
child node follow the order they appear in this list. The parameter is mandatory
if the len(input_columns) != len(output_columns). (default=None, all columns
@ -453,8 +455,8 @@ class Dataset:
same).
pad_info (dict, optional): Whether to perform padding on selected columns. pad_info={"col1":([224,224],0)}
would pad column with name "col1" to a tensor of size [224,224] and fill the missing with 0.
python_multiprocessing (bool, optional): Parallelize Python function per_batch_map with multiple worker
processes. This option could be beneficial if the function is computational heavy (default=False).
python_multiprocessing (bool, optional): Parallelize Python function per_batch_map with multi-processing.
This option could be beneficial if the function is computational heavy (default=False).
Returns:
BatchDataset, dataset batched.
@ -482,7 +484,7 @@ class Dataset:
@check_sync_wait
def sync_wait(self, condition_name, num_batch=1, callback=None):
"""
Add a blocking condition to the input Dataset.
Add a blocking condition to the input Dataset. A synchronize action will be applied.
Args:
condition_name (str): The condition name that is used to toggle sending next row.
@ -530,7 +532,7 @@ class Dataset:
@check_shuffle
def shuffle(self, buffer_size):
"""
Randomly shuffles the rows of this dataset using the following algorithm:
Randomly shuffles the rows of this dataset using the following policy:
1. Make a shuffle buffer that contains the first buffer_size rows.
2. Randomly select an element from the shuffle buffer to be the next row
@ -538,7 +540,7 @@ class Dataset:
3. Get the next row (if any) from the parent node and put it in the shuffle buffer.
4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.
A seed can be provided to be used on the first epoch. In every subsequent
A random seed can be provided to be used on the first epoch. In every subsequent
epoch, the seed is changed to a new one, randomly generated value.
Args:
@ -753,10 +755,10 @@ class Dataset:
@check_filter
def filter(self, predicate, input_columns=None, num_parallel_workers=None):
"""
Filter dataset by predicate.
Filter dataset by prediction.
Note:
If input_columns not provided or empty, all columns will be used.
If input_columns not provided or provided with empty, all columns will be used.
Args:
predicate (callable): Python callable which returns a boolean value. If False then filter the element.
@ -778,7 +780,7 @@ class Dataset:
@check_repeat
def repeat(self, count=None):
"""
Repeat this dataset count times. Repeat indefinitely if the count is None or -1.
Repeat this dataset N times where N = count. Repeat stochastically if the count is None or -1.
Note:
The order of using repeat and batch reflects the number of batches. It is recommended that
@ -788,7 +790,7 @@ class Dataset:
errors could occur since the amount of data is not the amount training requires.
Args:
count (int): Number of times the dataset is repeated (default=None).
count (int): Number of times the dataset is going to be repeated (default=None).
Returns:
RepeatDataset, dataset repeated.
@ -919,18 +921,19 @@ class Dataset:
Args:
sizes (Union[list[int], list[float]]): If a list of integers [s1, s2, , sn] is
provided, the dataset will be split into n datasets of size s1, size s2, , size sn
respectively. If the sum of all sizes does not equal the original dataset size, an
error will occur.
respectively. If the sum of all input sizes does not equal the original dataset size, an
error will throw.
If a list of floats [f1, f2, , fn] is provided, all floats must be between 0 and 1
and must sum to 1, otherwise an error will occur. The dataset will be split into n
and must sum to 1, otherwise an error will throw. The dataset will be split into n
Datasets of size round(f1*K), round(f2*K), , round(fn*K) where K is the size of the
original dataset.
If after rounding:
- Any size equals 0, an error will occur.
- The sum of split sizes < K, the difference will be added to the first split.
- The sum of split sizes > K, the difference will be removed from the first large
enough split such that it will have at least 1 row after removing the difference.
- The sum of split sizes < K, the difference of K - sigma(round(fi * k)) will be added to the first
split.
- The sum of split sizes > K, the difference of sigma(round(fi * K)) - K will be removed from the first
large enough split such that it will have at least 1 row after removing the difference.
randomize (bool, optional): Determines whether or not to split the data randomly (default=True).
If True, the data will be randomly split. Otherwise, each split will be created with
@ -990,7 +993,8 @@ class Dataset:
@check_zip_dataset
def zip(self, datasets):
"""
Zip the datasets in the input tuple of datasets. Columns in the input datasets must not have the same name.
Zip the datasets in the sense of input tuple of datasets. Columns in the input datasets must have different
name.
Args:
datasets (Union[tuple, class Dataset]): A tuple of datasets or a single class Dataset
@ -1014,7 +1018,8 @@ class Dataset:
@check_concat
def concat(self, datasets):
"""
Concatenate the datasets in the input list of datasets. The "+" operator is also supported to concatenate.
Concatenate the datasets in the input list of datasets.
The "+" operator is overloaded to supported to concatenate.
Note:
The column name, and rank and type of the column data must be the same in the input datasets.
@ -1070,8 +1075,8 @@ class Dataset:
"""
Project certain columns in input dataset.
The specified columns will be selected from the dataset and passed down
the pipeline in the order specified. The other columns are discarded.
The specified columns will be selected from the dataset and passed into
the pipeline with the order specified. The other columns are discarded.
Args:
columns(Union[str, list[str]]): List of names of the columns to project.
@ -1101,7 +1106,8 @@ class Dataset:
columns(Union[str, list[str]]): Column names to get words from.
freq_range(tuple[int]): A tuple of integers (min_frequency, max_frequency). Words within the frequency
range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency
range will be stored.
Naturally 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency
an be set to default, which corresponds to 0/total_words separately
top_k(int): Number of words to be built into vocab. top_k most frequent words are
taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken
@ -1169,10 +1175,11 @@ class Dataset:
vocab_size(int): Vocabulary size.
character_coverage(int): Percentage of characters covered by the model, must be between
0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like
Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
Japanese or Chinese character sets, and 1.0 for other languages with small character sets
like English or Latin.
model_type(SentencePieceModel): Model type. Choose from unigram (default), bpe, char, or word.
The input sentence must be pretokenized when using word type.
params(dict): contains more optional parameters of sentencepiece library
params(dict): Any extra optional parameters of sentencepiece library according to your raw data
Returns:
SentencePieceVocab, vocab built from the dataset.
@ -1272,16 +1279,16 @@ class Dataset:
@check_device_send
def to_device(self, send_epoch_end=True, create_data_info_queue=False):
"""
Transfer data through CPU, GPU or Ascend devices.
Transfer data from CPU to GPU or Ascend or other devices.
Args:
send_epoch_end (bool, optional): Whether to send end of sequence to device or not (default=True).
send_epoch_end (bool, optional): Whether to send the end of sequence to device or not (default=True).
create_data_info_queue (bool, optional): Whether to create queue which stores
types and shapes of data or not(default=False).
Note:
If device is Ascend, features of data will be transferred one by one. The limitation
of data transmission per time is 256M.
of data transmission per second is 256M.
Returns:
TransferDataset, dataset for transferring.
@ -1297,7 +1304,7 @@ class Dataset:
Save the dynamic data processed by the dataset pipeline in common dataset format.
Supported dataset formats: 'mindrecord' only
Implicit type casting exists when saving data as 'mindrecord'. The table below shows how to do type casting.
Implicit type casting exists when saving data as 'mindrecord'. The transform table shows how to do type casting.
.. list-table:: Implicit Type Casting when Saving as 'mindrecord'
:widths: 25 25 50
@ -1376,10 +1383,10 @@ class Dataset:
@check_tuple_iterator
def create_tuple_iterator(self, columns=None, num_epochs=-1, output_numpy=False, do_copy=True):
"""
Create an iterator over the dataset. The data retrieved will be a list of ndarrays of data.
Create an iterator over the dataset. The datatype retrieved back will be a list of ndarrays.
To specify which columns to list and the order needed, use columns_list. If columns_list
is not provided, the order of the columns will not be changed.
is not provided, the order of the columns will remain unchange.
Args:
columns (list[str], optional): List of columns to be used to specify the order of columns
@ -1413,7 +1420,7 @@ class Dataset:
@check_dict_iterator
def create_dict_iterator(self, num_epochs=-1, output_numpy=False):
"""
Create an iterator over the dataset. The data retrieved will be a dictionary.
Create an iterator over the dataset. The data retrieved will be a dictionary datatype.
The order of the columns in the dictionary may not be the same as the original order.
@ -1510,7 +1517,7 @@ class Dataset:
def get_col_names(self):
"""
Get names of the columns in the dataset
Renturn the names of the columns in dataset
Returns:
list, list of column names in the dataset.
@ -1551,7 +1558,7 @@ class Dataset:
def get_dataset_size(self):
"""
Get the number of batches in an epoch.
Return the number of batches in an epoch.
Returns:
int, number of batches.
@ -1703,7 +1710,7 @@ class Dataset:
def get_batch_size(self):
"""
Get the size of a batch.
Return the size of batch.
Returns:
int, the number of data in a batch.
@ -1717,7 +1724,7 @@ class Dataset:
def get_repeat_count(self):
"""
Get the replication times in RepeatDataset else 1.
Get the replication times in RepeatDataset (default is 1).
Returns:
int, the count of repeat.
@ -1731,7 +1738,7 @@ class Dataset:
def get_class_indexing(self):
"""
Get the class index.
Return the class index.
Returns:
dict, a str-to-int mapping from label name to index.
@ -1870,7 +1877,7 @@ class MappableDataset(SourceDataset):
def use_sampler(self, new_sampler):
"""
Will make the current dataset use the new_sampler provided.
Make the current dataset use the new_sampler provided by other API.
Args:
new_sampler (Sampler): The sampler to use for the current dataset.

View File

@ -202,62 +202,111 @@ class GraphData:
def get_all_neighbors(self, node_list, neighbor_type, output_format=OutputFormat.NORMAL):
"""
Get `neighbor_type` neighbors of the nodes in `node_list`.
We try to use the following example to illustrate the definition of these formats. 1 represents connected
between two nodes, and 0 represents not connected.
.. list-table:: Adjacent Matrix
:widths: 20 20 20 20 20
:header-rows: 1
* -
- 0
- 1
- 2
- 3
* - 0
- 0
- 1
- 0
- 0
* - 1
- 0
- 0
- 1
- 0
* - 2
- 1
- 0
- 0
- 1
* - 3
- 1
- 0
- 0
- 0
.. list-table:: Normal Format
:widths: 20 20 20 20 20
:header-rows: 1
* - src
- 0
- 1
- 2
- 3
* - dst_0
- 1
- 2
- 0
- 1
* - dst_1
- -1
- -1
- 3
- -1
.. list-table:: COO Format
:widths: 20 20 20 20 20 20
:header-rows: 1
* - src
- 0
- 1
- 2
- 2
- 3
* - dst
- 1
- 2
- 0
- 3
- 1
.. list-table:: CSR Format
:widths: 40 20 20 20 20 20
:header-rows: 1
* - offsetTable
- 0
- 1
- 2
- 4
-
* - dstTable
- 1
- 2
- 0
- 3
- 1
Args:
node_list (Union[list, numpy.ndarray]): The given list of nodes.
neighbor_type (int): Specify the type of neighbor.
output_format (OutputFormat, optional): Output storage format (default=OutputFormat.NORMAL)
It can be any of [OutputFormat.NORMAL, OutputFormat.COO, OutputFormat.CSR].
It can be any of [OutputFormat.NORMAL, OutputFormat.COO, OutputFormat.CSR].
Returns:
For NORMAL format or COO format
numpy.ndarray, array of neighbors.
If CSR format is specified, two numpy.ndarrays will return.
The first is offset table, the second is neighbors
numpy.ndarray which represents the array of neighbors will return.
As if CSR format is specified, two numpy.ndarrays will return.
The first one is offset table, the second one is neighbors
Examples:
We try to use the following example to illustrate the definition of these formats. 1 represents connected
between two nodes, and 0 represents not connected.
Raw Data:
0 1 2 3
0 0 1 0 0
1 0 0 1 0
2 1 0 0 1
3 1 0 0 0
Normal format
>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> neighbors = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2)
NORMAL:
dst_0 dst_1
0 1 -1
1 2 -1
2 0 3
3 1 -1
COO format
>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> neighbors_coo = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2,
output_format=OutputFormat.COO)
COO:
src dst
0 1
1 2
2 0
2 3
3 1
CSR format
>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> offset_table, neighbors_csr = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2,
output_format=OutputFormat.CSR)
CSR:
offset table: dst table:
0 1
1 2
2 0
4 3
1
Raises:
TypeError: If `node_list` is not list or ndarray.