diff --git a/advanced_simplification.md b/advanced_simplification.md index 2d3f2061..ac7ce54f 100644 --- a/advanced_simplification.md +++ b/advanced_simplification.md @@ -19,6 +19,9 @@ kernelspec: # _Advanced simplification_ % remove underscores in title when tutorial is complete or near-complete +:::{todo} +This tutorial is only partly complete: and there are a number of sections containing TODO items. +::: This is a companion to the basic {ref}`sec_simplification` tutorial. It focuses on details of `simplify` behavior that are useful when you need precise @@ -55,6 +58,8 @@ tables to be {meth}`sorted `). Simplifying tables in place is often useful for {ref}`forward-time simulations `. ::: +(sec_advanced_simplification_map_nodes)= + ## 1) Tracking node ID changes With default settings, simplification compacts tables and therefore reassigns node @@ -74,6 +79,8 @@ Note that when simplifying tables in-place using {meth}`TableCollection.simplify is always returned. To avoid compacting the node table, and leave node IDs unchanged, use `filter_nodes=False`. +(sec_advanced_simplification_map_nodes_reverse)= + ### Obtaining the reverse map Often you might want a reverse map, mapping the new node IDs to the old ones. Here's @@ -94,21 +101,52 @@ print("New sample ID 0", "maps to old ID", int(reverse_map[0])) ## 2) Keeping input roots :::{todo} -This is easy to illustrate, and useful for forward sims / census approaches +The `keep_input_roots=True` argument is easy to illustrate, and useful for +forward sims / census approaches. +::: + +## 3) Keeping ancestral individuals + +In some cases, a tree sequence might contain historical individuals which are associated +with nodes that are not samples, and you wish to retain information on individuals which +remain ancestral after simplifying. For example a forward-time simulation could +define individuals for all nodes in the past, including the +{ref}`pedigree links ` between parents and children, +and you wish to retain the chain of individuals that define that portion of the pedigree +which is relevant to the genetic ancestry (see also discussion in the SLiM manual, and in +[SLiM issue #139](https://github.com/MesserLab/SLiM/issues/139)). + +To keep all the individuals associated with genetic ancestry, you can use +`keep_unary_in_individuals=True`. In particular, this means +that ancestral nodes which are not coalescent anywhere along the genome, +but which are associated with an individual, will be retained (and +so the referenced individuals will be retained too). + +:::{todo} +Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to +create a simulator that saves pedigree information into each individual, and we could distill +some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into an example +of storing a coherent pedigree. ::: -## 3) Setting sample flags +The `keep_unary_in_individuals` argument is a specific example of keeping some, but not all, +non-coalescent ancestry in the tree sequence. If you need to retain a known set of +non-coalescent nodes, it can be helpful to treat them as focal samples and use the +`update_sample_flags=False` option, as described next. + + +## 4) Setting sample flags Normally the nodes that are provided to the `simplify()` function are marked as sample nodes in the output (by setting the `NODE_IS_SAMPLE` flag), and other nodes have that flag unset. -If you provide the `update_sample_flags=False` option, all node flags are left unchanged. +If you provide the `update_sample_flags=False` argument, all node flags are left unchanged. Here are some cases where that can be useful. ### Parallel simplification One use for the `update_sample_flags=False` option combines it with `filter_nodes=False`, to ensure that the node table remains untouched during simplification. -This is primarily a use-case targetted at developers of forward simulators, and allows +This is primarily a use-case targeted at developers of forward simulators, and allows logically disjunct parts of the edge table to be simplified in parallel, without risking two parallel processes trying to alter the same data. @@ -220,24 +258,6 @@ d3arg = argviz.D3ARG.from_ts(ts=subset_arg) d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples"); ``` -## 4) Keeping individuals - -In some cases, a tree sequence might contain historical individuals which are associated -with nodes that are not samples, and you wish to retain information on individuals which are -ancestral to the sample nodes. For example a forward-time simulation could -define individuals for all nodes in the past, including the pedigree links between parents -and children (see also discussion in the SLiM manual, and at -https://github.com/MesserLab/SLiM/issues/139). - -To keep all the individuals associated with genetic ancestry, you can use -`keep_unary_in_individuals=True`. - -:::{todo} -Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to -create a simulator that saves pedigree information into each individual, and we could distill -some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into that. -::: - ## 5) reduce_to_site_topology :::{todo} diff --git a/simplification.md b/simplification.md index 10c099c6..348cc301 100644 --- a/simplification.md +++ b/simplification.md @@ -38,17 +38,24 @@ def create_notebook_data(): # Simplification The {meth}`~TreeSequence.simplify` method provides one of the most powerful ways to modify a -[tskit](https://tskit.dev) {class}`TreeSequence`. It removes and modifies edges to leave only the -ancestry of a provided set of focal nodes. By default it ensuring these focal nodes are marked as -samples and removes non-ancestral nodes and associated objects such as individuals and populations. -It is commonly used: +[tskit](https://tskit.dev) {class}`TreeSequence`. + +At a high level, simplification works as follows: it starts from a chosen set of focal nodes +and then traces their ancestry back through the tree sequence. Any nodes, edges, and mutations +(as well as individuals, populations, and sites) that are not needed to represent that ancestry +are discarded, and the remaining information is compacted into a new, equivalent tree sequence. +During this process, IDs of nodes and other objects may change. In particular, non-coalescent +nodes are usually removed, unless you ask to keep them. + +Simplification is commonly used: * In forward simulations, to remove lineages that have gone extinct * To create a smaller tree sequence focussed on a subset of samples * To remove redundant nodes and other tskit objects (e.g. unreferenced populations) -Other less common uses, such as retaining unary regions of coalescent nodes, and -simplification in parallel, are described in the {ref}`sec_advanced_simplification` tutorial. +Other less common uses, such as retaining all ancestral individuals, retaining unary +regions of coalescent nodes, and simplifying without touching the node table, +are described in the {ref}`sec_advanced_simplification` tutorial. ## A single tree example @@ -93,7 +100,7 @@ ts_simp2.draw_svg(**plot_params) Note that the example above also used another `filter_` argument, setting `filter_sites=False`, so that the first site, which has no mutations after simplification, is also retained (it is shown as a bare tick mark on the X axis, -around position 250). However, mutations above unused nodes are still deleted +around position 250). However, mutations above unused nodes are still deleted, so mutation IDs are not guaranteed to stay the same. To further reduce the size of the simplified tree sequence, simplification normally @@ -106,11 +113,11 @@ ts_simp3.draw_svg(**plot_params) ``` :::{note} -As modifying a tree sequence can change the IDs of nodes, sites, and other objects, -it can be useful to use {ref}`metadata `: -information that stays associated with tskit objects even when their IDs change. -When simplifying, it is also possible to keep track of node ID changes by using -the `map_nodes` parameter, as demonstrated later in this tutorial. +As modifying a tree sequence can change the IDs of nodes, sites, and other objects, it +can be useful to use {ref}`metadata `: information that stays +associated with tskit objects even when their IDs change. When simplifying, it is +also possible to keep track of node ID changes by using the `map_nodes` parameter, +see the {ref}`advanced simplification ` tutorial. ::: ## A larger simplification example diff --git a/viz.md b/viz.md index 9d28d961..2c806b92 100644 --- a/viz.md +++ b/viz.md @@ -746,7 +746,7 @@ css_string = ( # Override default node text position to be based at (0, 0) relative to the node pos # Note that the .tree specifier is needed to make this more specific than the default - # positioning which is targetted at ".lab.lft" and ".lab.rgt" + # positioning which is targeted at ".lab.lft" and ".lab.rgt" ".tree .node > .lab {transform: translate(0, 0); text-anchor: middle; font-size: 7pt}" # For leaf nodes, override the above positioning using a subsequent CSS style @@ -941,7 +941,7 @@ itself (and not its descendants) a slightly different specification is required, involving, the "`>`" symbol, or [child combinator](https://www.w3.org/TR/selectors-3/#child-combinators) (we have, in fact, used it in several previous examples). The following plot shows the difference -when all decendant symbols are targetted, versus just the immediate child symbol: +when all decendant symbols are targeted, versus just the immediate child symbol: ```{code-cell} ipython3 node_style1 = ".n13 .sym {fill: yellow}" # All symbols under node 13 @@ -953,7 +953,7 @@ ts_small.draw_svg(y_axis=True, y_ticks=y_tick_pos, x_lim=x_limits, style=css_str Another example of modifying the style target is *negation*. This is needed, for example, to target nodes that are *not* leaves (i.e. internal nodes). One way to do this is to target *all* the node symbols first, then replace the style with a more specific -targetting of the leaf symbols only: +targeting of the leaf symbols only: ```{code-cell} ipython3 hide_internal_symlabs = ".node > .sym, .node > .lab {display: none}" @@ -1770,7 +1770,7 @@ def tanglegram( lft_node_map, lft = reorder_tree_nodes(lft, leaves) lft_rev_map = make_reverse_map(lft_node_map) - # Have to change the node labels, because even provided ones will be targetting the wrong IDs + # Have to change the node labels, because even provided ones will be targeting the wrong IDs lft_node_labels = {u: node_labels[v] for u, v in enumerate(lft_node_map) if v in node_labels} if order[1] is None: # We do not reorder the RH tree, so the node IDs should stay as-is