Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 42 additions & 22 deletions advanced_simplification.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ kernelspec:
# _Advanced simplification_
% remove underscores in title when tutorial is complete or near-complete

:::{todo}
This tutorial is only partly complete: and there are a number of sections containing TODO items.
:::

This is a companion to the basic {ref}`sec_simplification` tutorial.
It focuses on details of `simplify` behavior that are useful when you need precise
Expand Down Expand Up @@ -55,6 +58,8 @@ tables to be {meth}`sorted <TableCollection.sort>`). Simplifying tables in place
is often useful for {ref}`forward-time simulations <sec_tskit_forward_simulations>`.
:::

(sec_advanced_simplification_map_nodes)=

## 1) Tracking node ID changes

With default settings, simplification compacts tables and therefore reassigns node
Expand All @@ -74,6 +79,8 @@ Note that when simplifying tables in-place using {meth}`TableCollection.simplify
is always returned. To avoid compacting the node table, and leave node IDs unchanged, use
`filter_nodes=False`.

(sec_advanced_simplification_map_nodes_reverse)=

### Obtaining the reverse map

Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
Expand All @@ -94,21 +101,52 @@ print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
## 2) Keeping input roots

:::{todo}
This is easy to illustrate, and useful for forward sims / census approaches
The `keep_input_roots=True` argument is easy to illustrate, and useful for
forward sims / census approaches.
:::

## 3) Keeping ancestral individuals

In some cases, a tree sequence might contain historical individuals which are associated
with nodes that are not samples, and you wish to retain information on individuals which
remain ancestral after simplifying. For example a forward-time simulation could
define individuals for all nodes in the past, including the
{ref}`pedigree links <msprime:sec_pedigrees_encoding>` between parents and children,
and you wish to retain the chain of individuals that define that portion of the pedigree
which is relevant to the genetic ancestry (see also discussion in the SLiM manual, and in
[SLiM issue #139](https://github.com/MesserLab/SLiM/issues/139)).

To keep all the individuals associated with genetic ancestry, you can use
`keep_unary_in_individuals=True`. In particular, this means
that ancestral nodes which are not coalescent anywhere along the genome,
but which are associated with an individual, will be retained (and
so the referenced individuals will be retained too).

:::{todo}
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
create a simulator that saves pedigree information into each individual, and we could distill
some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into an example
of storing a coherent pedigree.
:::

## 3) Setting sample flags
The `keep_unary_in_individuals` argument is a specific example of keeping some, but not all,
non-coalescent ancestry in the tree sequence. If you need to retain a known set of
non-coalescent nodes, it can be helpful to treat them as focal samples and use the
`update_sample_flags=False` option, as described next.


## 4) Setting sample flags

Normally the nodes that are provided to the `simplify()` function are marked as sample
nodes in the output (by setting the `NODE_IS_SAMPLE` flag), and other nodes have that flag unset.
If you provide the `update_sample_flags=False` option, all node flags are left unchanged.
If you provide the `update_sample_flags=False` argument, all node flags are left unchanged.
Here are some cases where that can be useful.

### Parallel simplification

One use for the `update_sample_flags=False` option combines it with `filter_nodes=False`,
to ensure that the node table remains untouched during simplification.
This is primarily a use-case targetted at developers of forward simulators, and allows
This is primarily a use-case targeted at developers of forward simulators, and allows
logically disjunct parts of the edge table to be simplified in parallel, without
risking two parallel processes trying to alter the same data.

Expand Down Expand Up @@ -220,24 +258,6 @@ d3arg = argviz.D3ARG.from_ts(ts=subset_arg)
d3arg.draw(title=f"A full ARG, subset to {subset_arg.num_samples} samples");
```

## 4) Keeping individuals

In some cases, a tree sequence might contain historical individuals which are associated
with nodes that are not samples, and you wish to retain information on individuals which are
ancestral to the sample nodes. For example a forward-time simulation could
define individuals for all nodes in the past, including the pedigree links between parents
and children (see also discussion in the SLiM manual, and at
https://github.com/MesserLab/SLiM/issues/139).

To keep all the individuals associated with genetic ancestry, you can use
`keep_unary_in_individuals=True`.

:::{todo}
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
create a simulator that saves pedigree information into each individual, and we could distill
some of the discussion from https://github.com/MesserLab/SLiM/issues/139 into that.
:::

## 5) reduce_to_site_topology

:::{todo}
Expand Down
31 changes: 19 additions & 12 deletions simplification.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,24 @@ def create_notebook_data():
# Simplification

The {meth}`~TreeSequence.simplify` method provides one of the most powerful ways to modify a
[tskit](https://tskit.dev) {class}`TreeSequence`. It removes and modifies edges to leave only the
ancestry of a provided set of focal nodes. By default it ensuring these focal nodes are marked as
samples and removes non-ancestral nodes and associated objects such as individuals and populations.
It is commonly used:
[tskit](https://tskit.dev) {class}`TreeSequence`.

At a high level, simplification works as follows: it starts from a chosen set of focal nodes
and then traces their ancestry back through the tree sequence. Any nodes, edges, and mutations
(as well as individuals, populations, and sites) that are not needed to represent that ancestry
are discarded, and the remaining information is compacted into a new, equivalent tree sequence.
During this process, IDs of nodes and other objects may change. In particular, non-coalescent
nodes are usually removed, unless you ask to keep them.

Simplification is commonly used:

* In forward simulations, to remove lineages that have gone extinct
* To create a smaller tree sequence focussed on a subset of samples
* To remove redundant nodes and other tskit objects (e.g. unreferenced populations)

Other less common uses, such as retaining unary regions of coalescent nodes, and
simplification in parallel, are described in the {ref}`sec_advanced_simplification` tutorial.
Other less common uses, such as retaining all ancestral individuals, retaining unary
regions of coalescent nodes, and simplifying without touching the node table,
are described in the {ref}`sec_advanced_simplification` tutorial.


## A single tree example
Expand Down Expand Up @@ -93,7 +100,7 @@ ts_simp2.draw_svg(**plot_params)
Note that the example above also used another `filter_` argument, setting
`filter_sites=False`, so that the first site, which has no mutations after
simplification, is also retained (it is shown as a bare tick mark on the X axis,
around position 250). However, mutations above unused nodes are still deleted
around position 250). However, mutations above unused nodes are still deleted,
so mutation IDs are not guaranteed to stay the same.

To further reduce the size of the simplified tree sequence, simplification normally
Expand All @@ -106,11 +113,11 @@ ts_simp3.draw_svg(**plot_params)
```

:::{note}
As modifying a tree sequence can change the IDs of nodes, sites, and other objects,
it can be useful to use {ref}`metadata <sec_tutorial_metadata>`:
information that stays associated with tskit objects even when their IDs change.
When simplifying, it is also possible to keep track of node ID changes by using
the `map_nodes` parameter, as demonstrated later in this tutorial.
As modifying a tree sequence can change the IDs of nodes, sites, and other objects, it
can be useful to use {ref}`metadata <sec_tutorial_metadata>`: information that stays
associated with tskit objects even when their IDs change. When simplifying, it is
also possible to keep track of node ID changes by using the `map_nodes` parameter,
see the {ref}`advanced simplification <sec_advanced_simplification_map_nodes>` tutorial.
:::

## A larger simplification example
Expand Down
8 changes: 4 additions & 4 deletions viz.md
Original file line number Diff line number Diff line change
Expand Up @@ -746,7 +746,7 @@ css_string = (

# Override default node text position to be based at (0, 0) relative to the node pos
# Note that the .tree specifier is needed to make this more specific than the default
# positioning which is targetted at ".lab.lft" and ".lab.rgt"
# positioning which is targeted at ".lab.lft" and ".lab.rgt"
".tree .node > .lab {transform: translate(0, 0); text-anchor: middle; font-size: 7pt}"

# For leaf nodes, override the above positioning using a subsequent CSS style
Expand Down Expand Up @@ -941,7 +941,7 @@ itself (and not its descendants) a slightly different specification is required,
involving, the "`>`" symbol, or
[child combinator](https://www.w3.org/TR/selectors-3/#child-combinators) (we have,
in fact, used it in several previous examples). The following plot shows the difference
when all decendant symbols are targetted, versus just the immediate child symbol:
when all decendant symbols are targeted, versus just the immediate child symbol:

```{code-cell} ipython3
node_style1 = ".n13 .sym {fill: yellow}" # All symbols under node 13
Expand All @@ -953,7 +953,7 @@ ts_small.draw_svg(y_axis=True, y_ticks=y_tick_pos, x_lim=x_limits, style=css_str
Another example of modifying the style target is *negation*. This is needed, for example,
to target nodes that are *not* leaves (i.e. internal nodes). One way to do this is to
target *all* the node symbols first, then replace the style with a more specific
targetting of the leaf symbols only:
targeting of the leaf symbols only:

```{code-cell} ipython3
hide_internal_symlabs = ".node > .sym, .node > .lab {display: none}"
Expand Down Expand Up @@ -1770,7 +1770,7 @@ def tanglegram(

lft_node_map, lft = reorder_tree_nodes(lft, leaves)
lft_rev_map = make_reverse_map(lft_node_map)
# Have to change the node labels, because even provided ones will be targetting the wrong IDs
# Have to change the node labels, because even provided ones will be targeting the wrong IDs
lft_node_labels = {u: node_labels[v] for u, v in enumerate(lft_node_map) if v in node_labels}
if order[1] is None:
# We do not reorder the RH tree, so the node IDs should stay as-is
Expand Down
Loading