2024-10-15

複雑な構造変異を検出するためのロングリードベースの手法 FindCSV

　構造変異は遺伝病や進化のメカニズムにおいて重要な役割を果たしている。過去10年間、単純な構造変異を検出するために広範な研究が行われ、確立された検出方法が開発された。しかし、最近の研究では、単純な構造変異に比べて複雑な構造変異が個体に与える影響が大きい可能性があることが浮き彫りになっている。にもかかわらず、この分野には複雑な構造変異に特化した正確な検出法がまだない。したがって、非常に効率的で正確な検出法を開発することが最も重要である。このニーズに応えるため、本著者らはディープラーニング技術とコンセンサス配列を活用し、ロングリード配列データを用いたSVの検出を強化するFindCSVと呼ばれる新規手法を提案する。既存の手法と比較して、FindCSVは複雑な構造変異や単純な構造変異の検出において優れた性能を示した。FindCSVは、実データおよびシミュレーションデータにおいて、複雑で単純な構造変異を妥当な精度で検出する新しい手法である。このプログラムのソースコードはhttps://github.com/nwpuzhengyan/FindCSVにある。

Snifflesは現在最も人気のあるSV検出手法である。DeBreakとcuteSVは非常に高い再現率と精度を持つことが示されている効率的な手法である。SVcnnは本著者らが以前単純なSVを検出するのに用いた手法である。SVisionは第3世代のシーケンスデータでCSVを検出するために特別に使用された手法である。～　

しかし現在のところcomplex SVs (CSVs)を正確に検出するための検出方法は不足している。実験の結果、CSV検出における2つの主要な課題が明らかになった： 1. CSVの正確な定義がないこと、2.CSVの多様な性質により、正しいブレークポイントを正確に特定し報告することが検出手法にとって困難であること。～　

~ FindCSVは前のステップで得られた領域を画像に変換する。この画像表現では、各行はリードに対応し、各列は参照ゲノム上の位置に対応する。変換プロセスでは、CIGAR文字列の各文字を特定のルールに従ってピクセルに変換する。CIGAR文字列の各文字をルールに従ってピクセルに変換することで、SV候補領域内のリードのバリエーションとアラインメントパターンを視覚的に表現する5色の画像を作成する。INSは参照ゲノム上の位置を占めないため、FindCSVは、挿入長に基づいてマッチ（M）の一部を挿入（I）に置き換える。置換処理の後、領域は画像に変換され、SV画像と呼ばれる。～　

FindCSVはSV候補領域のフィルタリングに畳み込みニューラルネットワークモデルを採用する。具体的には、FindCSVはフィルタリングモデルとしてLeNetモデルを選択する。LeNetモデルは3つの畳み込み層、2つのサブサンプリング層、2つの完全接続層から構成される。これらの層は入力データから特徴を抽出し、表現を学習するように設計されている。（以下略）

インストール

condaで環境を作り、pipでツールを導入した（GPU: RTX 3090）。

依存

1. python3
2. pysam
3. cigar
4. numpy
5. pyfaidx
6. copy
7. time
8. argparse
9. PIL
10. pytorch
11. torchvision
12. os
13. swalign

Github

git clone https://github.com/nwpuzhengyan/FindCSV.git
cd FindCSV/
mamba create -n findCSV python=3.11 -y
conda activate findCSV
pip install pysam cigar numpy pyfaidx TIME-python argparse pillow torch torchvision os-sys swalign pycopy-copy

> python FindCSV.py -h

$ python FindCSV.py -h

usage: FindCSV.py [-h] bam fasta

positional arguments:

bam bam file

fasta fasta file

options:

-h, --help show this help message and exit

実行方法

ロングードのマッピングのbamとfastaファイルを指定する（アライナーはNGMLR、Minimap、Minimap2が想定されている）。bam.baiファイルも必要。fastaファイルは解凍されている必要がある。

python FindCSV.py input.bam reference.fasta

出力例（ref. : A.thaliana、SRAのONTデータの１つをダウンロードし、minnimap2の-x map-ontでマッピングして使用）

> head -n 20 FindCSV_result.vcf

FindCSVはSV領域を画像に変換し、その画像をSV_into_imageフォルダに格納する。各画像の名前は染色体名とSVの開始位置と終了位置からなる。

SV_image/

1_14866598-14867705.png

各リードが画像の1行を占める。黄色と青のピクセルはマッチ、黒のピクセルは欠失、赤のピクセルは挿入、緑のピクセルはミスマッチを表す。

1_13993090-13993201

IGV

（少しずれているのはキャプチャの誤差による）

引用

FindCSV: a long-read based method for detecting complex structural variations

Yan Zheng & Xuequn Shang

BMC Bioinformatics volume 25, Article number: 315 (2024)

2024-10-11

大規模なゲノム配列セットのANI値を計算する LZ-ANI

2024 Preprint ANI 高速なツール all versus all sequence comarison

　LZ-ANIは、大規模なゲノム配列セットの平均ヌクレオチド同一性（ANI）を決定するための、高速でメモリ効率のよいツールである。このツールはLempel-Ziv構文解析を使用し、一致するヌクレオチドと不一致のヌクレオチドを高感度で識別し、ANIの正確な決定を可能にする。LZ-ANIの効率性は、簡素化されたインデルの処理モデルに由来し、最も高感度なBLASTn検索に匹敵する精度を維持しながら、アライメントベースのツール（BLASTn、MegaBLASTなど）よりもはるかに高速である。

LZ-ANIは、ウイルスゲノムの比較およびクラスタリングツールであるVclustの主要コンポーネントになっている。ウイルスゲノム用に最適化されているが、LZ-AANIのパラメータは、バクテリアや古細菌のような長いゲノム用にカスタマイズすることができる。

インストール

リリースページからビルド済みバイナリをダウンロードできる。

ビルド依存

C++17 compiler
NASM ("sudo apt install nasm")

Github

git clone --recurse-submodules https://github.com/refresh-bio/lz-ani
cd lz-ani && make -j

> ./lz-ani

$ ./lz-ani

lz-ani 1.2.0 (2024-10-09) by Sebastian Deorowicz, Adam Gudys

Tool for rapid determination of similarities among sets of DNA sequences

Usage:

lz-ani <mode> [options]

Modes:

all2all - all to all

Options - input specification:

--in-fasta <file_name> - FASTA file (for multisample-fasta mode)

--in-txt <file_name> - text file with FASTA file names

--in-dir <path> - directory with FASTA files

--multisample-fasta <bool> - multi sample FASTA input (default: true)

--flt-kmerdb <fn> <float> - filtering file (kmer-db output) and threshold

Options - output specification:

-o, --out <file_name> - output file name

--out-ids <file_name> - output file name for ids file (optional)

--out-alignment <file_name>- output file name for ids file (optional)

--out-in-percent <bool> - output in percent (default: false)

--out-type <type> - one of:

tsv - two tsv files with: results defined by --out-format and sequence ids (default)

single-txt - combined results in single txt file

--out-format <type> - comma-separated list of values:

query,reference,qidx,ridx,qlen,rlen,tani,gani,ani,qcov,rcov,len_ratio,nt_match,nt_mismatch,num_alns

you can include also meta-names:

complete=qidx,ridx,query,reference,tani,gani,ani,qcov,rcov,num_alns,len_ratio,qlen,rlen,nt_match,nt_mismatch

lite=qidx,ridx,tani,gani,ani,qcov,num_alns,len_ratio

standard=qidx,ridx,query,reference,tani,gani,ani,qcov,num_alns,len_ratio

(default: standard)

--out-filter <par> <float> - store only results with <par> (can be: tani, gani, ani, cov) at least <float>; can be used multiple times

Options - LZ-parsing-related:

-a, --mal <int> - min. anchor length (default: 11)

-s, --msl <int> - min. seed length (default: 7)

-r, --mrd <int> - max. dist. between approx. matches in reference (default: 40)

-q, --mqd <int> - max. dist. between approx. matches in query (default: 40)

-g, --reg <int> - min. considered region length (default: 35)

--aw <int> - approx. window length (default: 15)

--am <int> - max. no. of mismatches in approx. window (default: 7)

--ar <int> - min. length of run ending approx. extension (default: 3)

Options - other:

-t, --threads <int> - no of threads; 0 means auto-detect (default: 0)

-V, --verbose <int> - verbosity level (default: 1)

実行方法

LZ-ANIはデフォルトでANIの全対全比較を行うよう設計されている（"lz-ani all2all"）。したがって、入力配列のすべてのペアについて配列類似性が計算される。将来的にペアワイズ比較のモードも搭載する予定とレポジトリに書かれている。

比較したい（ウィルス）ゲノム配列全てを含むmulti-fastaファイルを指定する。

./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv

--in-fasta FASTA file (for multisample-fasta mode)
--multisample-fasta multi sample FASTA input (default: true)
--out output file name

テストデータでは0,1秒以内に計算が終了した。

> head ani.tsv

出力フォーマットはレポジトリで説明されている（link）。7つの配列類似性尺度で計算され、最終的にtatal ANI (tANI)が報告される。

もしくはゲノムのfastaファイルを含むディレクトリかfastaファイルのリストを指定する。fastaファイルはgzip圧縮されていても使用できる。

./lz-ani all2all --in-dir example/fna --out ani.tsv

--in-txt text file with FASTA file names
--in-dir <path> — directory with FASTA files

out-filterオプションを使用すると出力をフィルタリングできる。Query coverage (aligned fraction) が0.85以上、ANI0.95以上。

./lz-ani all2all --in-fasta example/multifasta.fna --out ani.tsv --out-filter ani 0.95 --out-filter qcov 0.85

指定された基準を満たすゲノムペアのみ報告される。

論文より

LZ-ANIパラメータは置換、欠失、挿入、逆位、重複、転座を含む模擬変異を含む10,000組のファージゲノムに対して最適化された。

引用

Ultrafast and accurate sequence alignment and clustering of viral genomes

Andrzej Zielezinski, Adam Gudyś, Jakub Barylski, Krzysztof Siminski, Piotr Rozwalak, Bas E. Dutilh, Sebastian Deorowicz

bioRxiv, Posted July 02, 2024.

2024-10-10

比較ゲノミクスのための遺伝子座の可視化ツール LoVis4u

2024 bacteria Preprint 結果の視覚化 (visualization) ゲノム比較 (comparative genomics) plasmid phage gene cluster 高速なツール

　比較ゲノム解析では、ゲノムの遺伝子座のアラインメントを可視化することがよくある。PythonやRのライブラリからスタンドアローンのGUIまで、このタスクのためにいくつかのソフトウェアツールが利用可能であるが、高速で自動化された使用法と出版可能なベクター画像の作成を提供するツールが不足している。

ここでは、LoVis4uを紹介する。LoVis4uは、複数のゲノム遺伝子座を高度にカスタマイズ可能かつ高速に可視化するために設計されたコマンドラインツールとPython APIである。LoVis4uは、GenBankまたはGFFファイルからのアノテーションデータに基づいて、PDF形式のベクター画像を生成する。LoVis4uは、原核生物ゲノムのプラスミドやユーザー定義領域だけでなく、バクテリオファージの全ゲノムを可視化することができる。さらに、LoVis4uは、入力配列中のアクセサリー遺伝子やコア遺伝子を同定し、ハイライトするためのオプションのデータ処理ステップを提供する。

LoVis4uはPython3で実装されており、LinuxとMacOS上で動作する。コマンドラインインターフェースは、ほとんどの実用的なユースケースをカバーし、提供されるPython APIは、Pythonプログラム内での使用、外部ツールへの統合、追加のカスタマイズを可能にする。ソースコードはGitHubページで入手できる：github.com/art-egorov/lovis4u。例によるガイドを含む詳細なドキュメントは、ソフトウェアのホームページから入手できる： art-egorov.github.io/lovis4u。

https://art-egorov.github.io/lovis4u/

Gallery

https://art-egorov.github.io/lovis4u/Gallery/gallery/

インストール

ubuntu22.04にcondaで環境を作ってインストールした。また、WSLのubuntu22でもテストした。

OS: MacOS, Linux
Python version: >= 3.9

#PyPI ( link )
mamba create -n lovis4u python=3.11 -y
conda activate lovis4u
python3 -m pip install lovis4u

#Linuxマシンを使っている場合、インストール後に'lovis4u --linux` コマンドを実行して、Linuxのmmseqsバイナリに切り替える必要がある
lovis4u --linux

#mmseqsが入ってない場合は導入する
mamba install -c conda-forge -c bioconda mmseqs2 -y

> lovis4u -h

LoVis4u (version 0.0.9):

Home page and documentation: https://github.com/art-egorov/lovis4u

The Atkinson Lab 4U | AE

-------------------------------

COMMAND-LINE PARAMETERS

-------------------------------

[POST-INSTALL STEPS]

--data

Creates the 'lovis4u_data' folder in the current working directory.

The folder contains adjustable configuration files used by lovis4u

(e.g. config, palettes...)

--linux

Replaces the mmseqs path in the pre-made config file from the MacOS

version [default] to the Linux version.

--mac

Replaces the mmseqs path in the pre-made config file from the Linux

version [default] to the MacOS version.

-------------------------------

[MANDATORY ARGUMENTS]

-gff <folder>

Path to a folder containing extended gff files.

Each gff file should contain corresponding nucleotide sequence.

(designed to handle pharokka produced annotation files).

-gb <folder>

Path to a folder containing genbank files.

-------------------------------

[OPTIONAL ARGUMENTS | DATA PROCESSING]

-ufid, --use-filename-as-id

Use filename (wo extension) as track (contig) id instead

of the contig id written in the gff/gb file.

-laf, --locus-annotation-file <file path>

Path to the locus annotation table.

(See documentation for details)

-faf, --feature-annotation-file <file path>

Path to the feature annotation table.

(See documentation for details)

-mmseqs-off, --mmseqs-off

Deactivate mmseqs clustering of proteomes of loci.

-cl-owp, --cluster-only-window-proteins

Cluster only proteins that are overlapped with

the visualisation windows, not all.

-fv-off, --find-variable-off

Deactivate annotation of variable or conserved protein clusters.

-cl-off, --clust_loci-off

Deactivate defining locus order and using similarity based hierarchical

clustering of proteomes.

-oc, --one-cluster

Consider all sequences to be members of one cluster but use clustering

dendrogram to define the optimal order.

-reorient_loci, --reorient_loci

Auto re-orient loci (set new strands) if they are not matched.

(Function tries to maximise co-orientation of homologous features.)

-------------------------------

[OPTIONAL ARGUMENTS | LOCUS VISUALISATION]

-sgc-off, --set-group-colour-off

Deactivate auto-setting of feature fill and stroke colours.

(Pre-set colours specified in feature annotation table will be kept.)

-sgcf, --set-group-colour-for <feature_group1 [feature group2 ...]>

Space-separated list of feature groups for which colours should be set.

[default: variable, labeled]

-scc, --set-category-colour

Set category colour for features and plot category colour legend.

-cct, --category-colour-table <file path>

Path to the table with colour code for categories.

Default table can be found in lovis4u_data folder.

-lls, --locus-label-style <id|description|full>

Locus label style based on input sequence annotation.

-llp, --locus-label-position <left|bottom>

Locus label position on figure.

-safl, --show-all-feature-labels

Display all feature labels.

-sflf, --show-feature-label-for <feature_group1 [feature group2 ...]>

Space-separated list of feature groups for which label should be shown.

[default: variable, labeled]

-sfflf, --show-first-feature-label-for <feature_group1 [feature group2 ...]>

Space-separated list of feature group types for which label will be displayed

only for the first occurrence of feature homologues group.

[default: shell/core]

-ifl, --ignored-feature-labels <feature_label1 [feature_label2 ...]>

Space-separated list of feature names for which label won't be shown.

[default: hypothetical protein, unknown protein]

-sxa, --show-x-axis

Plot individual x-axis for each locus track.

-hix, --hide-x-axis

Do not plot individual x-axis for each locus track.

-dml, --draw-middle-line

Draw middle line for each locus.

-mm-per-nt, --mm-per-nt <float value>

Scale which defines given space for each nt cell on canvas.

[default: 0.05]

-fw, --figure-width <float value>

Output figure width in mm.

-------------------------------

[OPTIONAL ARGUMENTS | ADDITIONAL TRACKS]

-hl, --homology-links

Draw homology link track.

-slt, --scale-line-track

Draw scale line track.

-------------------------------

[OPTIONAL ARGUMENTS | OTHERS]

-o <name>

Output dir name. It will be created if it does not exist.

[default: lovis4u_{current_date}; e.g. uorf4u_2022_07_25-20_41]

--pdf-name <name>

Name of the output pdf file (will be saved in the output folder).

[default: lovis4u.pdf]

-c <standard|<file.cfg>

Path to a configuration file or name of a pre-made config file

[default: standard]

-------------------------------

[MISCELLANEOUS ARGUMENTS]

-h, --help

Show this help message and exit.

-v, --version

Show program version.

--debug

Provide detailed stack trace for debugging purposes.

--parsing-debug

Provide detailed stack trace for debugging purposes

for failed reading of gff/gb files.

-q, --quiet

Don't show progress messages.

テストラン

テスト用のデータをカレントにコピーするオプションが用意されている。

lovis4u --data
cd lovis4u_data/guide/

実行するにはGFFファイルのディレクトリを指定する。

cd lovis4u_data/guide/
lovis4u -gff gff_files/ -hl --set-category-colour -c A4p2

--set-category-colour Set category colour for features and plot category colour legend.
-c Path to a configuration file or name of a pre-made config file [default: standard]

結果はディレクトリに保存される。

lovis4u.pdf

レイアウトは便利なプリセットパラメータが準備されている。-cで指定する。

（gallaryより転載）

単一配列の視覚化。相同タンパク質群を異なる色でハイライトする。

cd lovis4u/lovis4u/lovis4u_data/guide/
lovis4u -gff single_gff_file/ -hl --set-category-colour -c A4p2 --set-group-colour-for conserved

-hl Draw homology link track.
--set-category-colour Set category colour for features and plot category colour legend.
-c Path to a configuration file or name of a pre-made config file [default: standard]
--set-group-colour-for Space-separated list of feature groups for which colours should be set. [default: variable, labeled]

78の大腸菌ファージのBASEL phage collection（Maffei et.al. PLOS Biology）

 lovis4u -gff BaselCollection/  -hl --set-category-colour -c A4p2 -fw 500

（途中まで）

その他

"--locus-annotation-file"でTSVを指定することでユーザーが定義した領域を視覚化できる
細菌ゲノム全体ではなく、特定のゲノム領域やプラスミド、ファージなどの視覚化向けに設計されている。

引用

LoVis4u: Locus Visualisation tool for comparative genomics

Artyom A. Egorov, Gemma C. Atkinson

bioRxiv, Posted September 14, 2024.

関連

高速かつ高感度なプロテイン検索ツール MMseqs2

2024-10-09

PCRプライマーをデザインする primers

PCR primer

レポジトリより

これはPCRプライマーを作成するための小さくて簡単なツールである。用途はDNAアセンブリである。Primer3の代わりにプライマーを選択する理由は以下の通り：

特徴：Gibson アセンブリーやGolden GateクローニングのようなDNAアセンブリーフローに特化している。プライマーの5'末端に配列を付加しながらプライマーを設計できる。
シンプルさ：小さくてシンプルなPython CLI/ライブラリで、依存関係は1つ（seqfold）。インストールが簡単で使いやすい。
インターフェース：Biopython Seqクラスのプライマーを受け入れて作成する。他のアプリケーションとの統合を容易にするためにJSONを出力する。
ライセンス: コピーレフトなGPL v2ライセンスではなく、寛容でビジネスフレンドリーなMITを採用。

インストール

Github

#PyPI
pip install primers

> primers -h

usage: primers [-h] [--version] {score,create} ...

Create or score PCR primers

positional arguments:

{score,create}

score score existing primers

create create primers to amplify a sequence

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

> primers score -h

$ primers score -h

usage: primers score [-h] [-s SEQ] [-t SEQ] [-j | --json | --no-json] primer [primer ...]

positional arguments:

primer primer sequences

optional arguments:

-h, --help show this help message and exit

-s SEQ the sequence that was amplified

-t SEQ sequence to check for off-target binding sites

-j, --json, --no-json

write the primers to a JSON array

> primers create -h

$ primers create -h

usage: primers create [-h] [-f SEQ] [-fl INT INT] [-r SEQ] [-rl INT INT] [-t SEQ] [-j | --json | --no-json] SEQ

positional arguments:

SEQ create primers to amplify this sequence

optional arguments:

-h, --help show this help message and exit

-f SEQ additional sequence to add to FWD primer (5' to 3')

-fl INT INT space separated min-max range for the length to add from '-f' (5' to 3')

-r SEQ additional sequence to add to REV primer (5' to 3')

-rl INT INT space separated min-max range for the length to add from '-r' (5' to 3')

-t SEQ sequence to check for off-target binding sites

-j, --json, --no-json

write the primers to a JSON array

実行方法

primersは、長さ、tm、GC比、二次構造、オフターゲット結合を最適化しながらペアを選ぶ。最も単純なケースでは、増幅したい配列を渡すだけでデザインできる。

primers create CTACTAATAGCACACACGGGGACTAGCATCTATCTCAGCTACGATCAGCATC

出力例

設計するプライマーの5'末端を任意の配列にする（ターゲットにはない配列；例えば制限酵素でカットされる配列など）。

primers create CTACTAATAGCACACACGGGGACTAGCATCTATCTCAGCTACGATCAGCATC -f GGG -r CCC

-f additional sequence to add to FWD primer (5' to 3')
-r additional sequence to add to REV primer (5' to 3')

5'末端に付加する最適な部分配列を選択させたい場合、-fl or -rlでmin-maxサイズを指定する。

primers create CTACTAATAGCACACACGGGGACTAGCATCTATCTCAGCTACGATCAGCATC -fl 2 5 -f CCCCC

-fl INT INT space separated min-max range for the length to add from '-f' (5' to 3')
-rl INT INT space separated min-max range for the length to add from '-r' (5' to 3')

CLIでは固定のデザインパラメータが適用されますが、pythonのコマンドプロンプトではPrimer3の任意の設定を適用してプライマーを設計する事ができます。レポジトリで簡単な例が説明されています。

ほか、レポジトリより

通常、オフターゲット結合部位は避けるべきである。primersでは、オフターゲット結合部位とは、プライマーの3'末端の最後の10ベアペアに<= 1のミスマッチがある部位を指す。この定義は実験的に支持されている：Wu, J. H., Hong, P. Y., & Liu, W. T. (2009). Wu, J. H., Hong, P. Y., & Liu, W. T. (2009).
デフォルトでは、プライマーはcreate(seq)に渡されたseqパラメータ内のオフターゲットをチェックする。しかし、オプションのofftarget_check引数（CLIの場合は-t）に別の配列を渡すと、プライマーを別の配列に対してチェックすることができる。これはプラスミドのような大きなDNA配列の一部をPCRするときに便利である。
"--json"フラグは、JSON形式でプライマーを表示し、スコアリングの詳細を表示する。

引用

https://github.com/Lattice-Automation/primers

関連

2024-10-08

k-merの起源となる配列を見つける Back to sequences

k-mer 高速なツール fasta/fastqの操作 JOSS 2024

2024/10/09追記

　生のシーケンスデータの処理に特化したバイオインフォマティクスツールの大部分は、k-mersの概念を多用している。これにより、データの冗長性（ひいてはメモリの圧迫）を減らし、シーケンスエラーを破棄し、操作可能で容易に比較できる固定サイズのオブジェクトを処分することができる。欠点は、各k-merとそれが属する元の配列セットとの間のリンクが一般に失われることである。この文脈で考慮されるデータ量を考えると、この関連性を探し出すにはコストがかかる。この研究では、「back_to_sequences 」を紹介する。「back_to_sequences 」は、興味のあるk-merの集合にインデックスを付け、配列の集合をストリーミングし、インデックス付けされたk-merを少なくとも1つ含むものを抽出するように設計されたシンプルなツールである。さらに、配列中のk-merの出現数も提供する。back_to_sequencesは1ミリ秒あたり約200のショートリードをストリームし、数億のリード中のk-merを数分で検索できる。

Documentation

https://b2s-doc.readthedocs.io/en/latest/index.html

My first paper on @JOSS_TheOJ ! https://t.co/kbhzvWh0Xp
— Pierre Marijon 🏳️‍🌈 (@pierre_marijon) 2024年9月24日

Hey kmers' fans!
The "back_to_sequences" tool https://t.co/IgCdQbOOzd is now published in JOSS https://t.co/12CulElV0v.
Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
More to read in this thread: https://t.co/Eh7QI33y3w pic.twitter.com/OhK19xJEcF
— Pierre Peterlongo (@p_peterlongo) 2024年9月23日

インストール

ubuntu22.04LTSでテストした（rustc --version => rustc 1.78.0）。

To compile from source

Rust

Github

git clone https://github.com/pierrepeterlongo/back_to_sequences.git
cd back_to_sequences
RUSTFLAGS="-C target-cpu=native" cargo install --path .

> back_to_sequences -h

Back to sequences: find the origin of kmers

Usage: back_to_sequences [OPTIONS] --in-kmers <IN_KMERS>

Options:

--in-kmers <IN_KMERS>

Input fasta file containing the original kmers

Note: back_to_sequences considers the content as a set of kmers

This means that a kmer is considered only once,

even if it occurs multiple times in the file.

If the stranded option is not used (default), a kmer

and its reverse complement are considered as the same kmer.

--in-sequences <IN_SEQUENCES>

Input fasta or fastq [.gz] file containing the original sequences (eg. reads).

The stdin is used if not provided

(and if `--in_filelist` is not provided neither) [default: ]

--in-filelist <IN_FILELIST>

Input txt file containing in each line a path to a fasta or fastq [.gz] file

containing the original sequences (eg. reads).

Note1: if this option is used, the `--out_filelist` option must be used.

The number of lines in out_filelist must be the same as in_filelist

Note2: Incompatible with `--in_sequences` [default: ]

--out-sequences <OUT_SEQUENCES>

Output file containing the filtered original sequences (eg. reads).

It will be automatically in fasta or fastq format depending on the input file.

If not provided, only the in_kmers with their count is output [default: ]

--out-filelist <OUT_FILELIST>

Output txt file containing in each line a path to a fasta or fastq [.gz] file

that will contain the related output file from the input files list [default: ]

--out-kmers <OUT_KMERS>

If provided, output a text file containing the kmers that occur in the reads

with their

* number of occurrences

* their occurrence positions if the --output_kmer_positions option is used

Note: if `--in_filelist` is used the output counted kmers are

those occurring the last input file of that list [default: ]

--counted-kmer-threshold <COUNTED_KMER_THRESHOLD>

If out_kmers is provided, output only reference kmers whose number of occurrences

is at least equal to this value.

If out_kmers is not provided, this option is ignored [default: 0]

--output-kmer-positions

If out_kmers is provided, either only count their number of occurrences (default)

or output their occurrence positions (read_id, position, strand) if this option is used

--output-mapping-positions

If provided, output matching positions on sequences in the

out_sequence file(s)

-k, --kmer-size <KMER_SIZE>

Size of the kmers to index and search [default: 31]

-m, --min-threshold <MIN_THRESHOLD>

Output sequences are those whose ratio of indexed kmers is in ]min_threshold; max_threshold]

Minimal threshold of the ratio (%) of kmers that must be found in a sequence to keep it (default 0%).

Thus by default, if no kmer is found in a sequence, it is not output. [default: 0]

--max-threshold <MAX_THRESHOLD>

Output sequences are those whose ratio of indexed kmers is in ]min_threshold; max_threshold]

Maximal threshold of the ratio (%) of kmers that must be found in a sequence to keep it (default 100%).

Thus by default, there is no limitation on the maximal number of kmers found in a sequence. [default: 100]

--stranded

Used original kmer strand (else canonical kmers are considered)

--query-reverse

Query the reverse complement of reads. Useless without the --stranded option

--no-low-complexity

Do not index low complexity kmers (ie. with a Shannon entropy < 1.0)

-t, --threads <THREADS>

Number of threads

Note: if not provided, the number of threads is set to the number of logical cores [default: 0]

-h, --help

Print help

-V, --version

Print version

実行方法

探索するk-mer配列のfastaと探索対象のリード、出力として返すリード名、さらに任意でk-merカウント数を示したテキストを指定する。

back_to_sequences --in-kmers compacted_kmers.fasta --in-sequences reads.fasta --out-sequences filtered_reads.fasta  --out-kmers counted_kmers.txt

--in-kmers Input fasta file containing the original kmers
--in-sequences Input fasta or fastq [.gz] file containing the original sequences (eg. reads)
--out-sequences Output file containing the filtered original sequences (eg. reads).
--out-kmers If provided, output a text file containing the kmers that occur in the reads with their * number of occurrences or * their occurrence positions if the --output_kmer_positions option is used
-k Size of the kmers to index and search [default: 31]
-t Number of threads

出力例

filtered_reads.fastaは、入力にfastqフォーマットのリードを指定した時はfastqフォーマットとして出力される。

> head counted_kmers.txt

$ head counted_kmers.txt

CGCTCTATCTGAAACGCGGCGATACGATTTA 1

CGGAATTTGGAGAGAGTATTTTGGTCACGGC 1

ACCGACACGAAGCGACCCTCGCCCTGATCAT 2

CGCAACGTCGCGCGGATGACCGTGCTGCTCC 1

CTGCTCGATCTGTTCTTTGGTGACGGGCTTA 1

CCGCGCCAGCTTGATCGCCATAGTGCCAATC 1

ACATCGCCCACTTCGCGCAGACGAAAGCCCG 1

AACCTTCCGCAAAGGCAAGCGCCATACCGAC 1

ATTTTTCGGTCGCCATTGGCGCAAGCGCTGA 1

CGGCGGTACGTTCACATTCTCAGCGAGCTCC 1

compacted_kmers.fastaに50%以上、多くても70%以下のkmerが含まれる配列のみ出力するフィルターを指定。

back_to_sequences --in-kmers compacted_kmers.fasta --in-sequences reads.fasta --out-sequences filtered_reads.fasta  --out-kmers counted_kmers.txt --min-threshold 50 --max-threshold 70

--min-threshold Output sequences are those whose ratio of indexed kmers is in ]min_threshold
--max-threshold Output sequences are those whose ratio of indexed kmers is in ]min_threshold; max_threshold]

ほかにもstrand指定や逆相補の配列をターゲットにするオプションなどがある（Document参照）。

論文より

配列の集合から目的の一意なk-merを見つけるには、古典的なgrepや、The Platinum Searcher（PT）や The Silver Searcherのような最近のパターンマッチングツールを使うことができる。テストとして、MacBook、Apple M2 proを使用し、grep、The Platinum Searcher、The Silver Searcherを使用して、Q100Mデータセット（セクション2.1参照）の1つのk-merを照会した。grepは44秒を要し、ptは15秒、agは400秒後に停止した（result）。従って、単純に外挿すると、grepの場合は1台のコンピューターで100万個のk-merを検索するのに約500日かかり、back_to_sequencesを使った場合は5～6分である。

このプログラムを適切なk値で使うと、関心ある配列をシークエンスしたリードをfastq全部から濃縮することもできると思います（Mirabaitのように）。配列からk-merを発生させるにはjellyfishが使えます。jellyfishは過去に紹介しています（link）。

引用

Back to sequences: Find the origin of 𝑘-mers

Anthony Baire, Pierre Marijon, Francesco Andreace, Pierre Peterlongo

JOSS, Published: 23 September 2024

関連

2024-10-03

シークエンシングリードから直接分類学的プロファイリングを行う MetabuliのGUIアプリケーション（ノートPCでも動作）

GUIツール metagenome GTDB 生物種の推定 (taxonomic profiling)

MetabuliのGUIアプリがリリースされているので簡単に紹介します。

https://github.com/steineggerlab/Metabuli-App/releases/tag/v1.0.0

"これはMetabuli Appの最初のリリースで、これまでコマンドライン経由でのみ利用可能だったMetabuliメタゲノム分類ツールをデスクトップユーザーに提供する。このアプリにより、ユーザーは数回クリックするだけで分類学的プロファイリングを実行したり、過去のレポートをアップロードして素早く視覚化することができる。インタラクティブでカスタマイズ可能なサンキープロットとkronaチャートも含まれており、データの可視化と結果の探索が強化されている。また、特定の分類群に分類されたリードを抽出し、サンキー画像をダウンロードすることもできる。さらに、このアプリには過去10件のジョブ履歴が保存され、すばやくアクセスできるようになっている。"

Our metagenomic classifier Metabuli, which combines AA/DNA information, is now available as app on Windows, MacOS, and Linux. The app developed by @haysunny_hi allows classifying and analyzing metagenomic samples on a notebook! Read more about the app here https://t.co/pfim5rZabc pic.twitter.com/mAldCubH9R
— Martin Steinegger 🇺🇦 (@thesteinegger) October 1, 2024

インストール

M1 macstudioでテストした。

Platforms Supported

macOS (Universal .dmg)
Windows (.exe)
Linux (AppImage .AppImage)

Github

https://github.com/steineggerlab/Metabuli-App/releasesよりダウンロードする。

実行方法

上に貼られている動画で使い方は説明されています。ここでは流れだけ簡単に確認します。

ダウンロード後、Applicationsにコピーして実行する。

セキュリティの関係で開けないという警告が出る場合、システム環境設定＝＞セキュリティで”ここまま開く”を選ぶ。

起動した。シングルエンドとペアエンドのショートリード、ロングリードを分析できる。

初回はデータベースをダウンロードする必要がある（MetabuliのDBをダウンロードしているならそれも使える）。

ダウンロードできるDB。目的に合わせて選ぶ。

ここでは以前MetabuliのサイトからダウンロードしたGTDB R214を指定した。出力と最大RAM使用量も指定する。

計算にはしばらく時間がかかる（M1 macstudioだと200MBx2のペアエンドfastqで15秒ほど）。

ここでは”LOAD SAMPLE DATA”の結果を見てみる。

rootから種までアサインされたリードの比率が示されている。

Sankey plot

Configure

図はHTMLや画像として保存できる。

Krona plot

右端のextract readsボタンからはその分類群のfastqを取り出して書き出すことが出来る。

Metabuliはユーザー指定のランダムアクセスメモリー（RAM）制限内で動作し、ストレージに保存さえできれば、あらゆるサイズのデータベースを使用できる。コマンドライン版では、8ギガバイトのRAMを搭載したM1 MacBook airのようなノートブックでも動作することがアナウンスされている。
MetabuliはRAM上にDBをロードしない。たくさんサンプルがあっても、DBのロード時間を気にすることなく何度も実行できる（直接教えていただきました）。

引用

Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

Jaebeom Kim & Martin Steinegger

Nature Methods, Published: 20 May 2024

関連

2024-10-02

minimap2インデックスに既知バリアント情報を組み込むことで、WGSでのSNVコールを改善する minimap2_index_modifier

human genome variant index mapping 2024 BMC Bioinformatics

　リファレンスゲノム配列に対するリードのアライメントは、次世代シーケンサー（NGS）技術によって得られたヒト全ゲノムシーケンスデータの解析における重要なステップの1つである。遺伝的変異の臨床的解釈の結果やゲノムワイド関連研究GWASの結果など、その後の解析ステップの質は、アライメントの結果としてリードの位置が正しく特定されるかどうかに依存している。ヒトNGS全ゲノムシーケンスデータの量は常に増加している。世界中で多くのヒトゲノム配列決定プロジェクトが行われており、その結果、配列決定されたヒトゲノムの遺伝子変異に関する大規模なデータベースが作成されている。このような既知の遺伝子変異に関する情報は、例えばゲノムグラフを作成することで、新しい個体について得られたシーケンスデータを解析する際のリードアライメント段階におけるアライメントの質を向上させるために利用することができる。リードを線形リファレンスゲノムにアライメントする既存の方法はアライメント速度が速いが、リードをゲノムグラフにアライメントする方法はゲノムの可変領域において精度が高い。線形リファレンス配列のインデックスに既知の遺伝子変異を考慮したリードアライメント法を開発することで、両手法の利点を組み合わせることができる。

　本論文では、minimap2_index_modifierツールを紹介する。このツールは、既知の一塩基バリアントとヒト集団特有の挿入/欠失（indel）を用いて、リファレンスゲノムの修正インデックスを構築することができる。modified minimap2インデックスの使用により、バイオインフォマティクスパイプラインを変更することなく、また計算オーバーヘッドを大幅に追加することなく、バリアントコールの品質が向上する。PrecisionFDA Truth Challenge V2のベンチマークデータ（GRCh38リニアリファレンス（GCA_000001405.15）にアライメントされたHG002ショートリードデータ、パラメータk = 27、w = 14）を用いて、Human Pangenome Reference Consortiumの遺伝的バリアントを用いてインデックスを修正すると、偽陰性の遺伝的バリアントの数が9500以上減少し、偽陽性の数が7000以上減少することが実証された。

~minimap2ツールの主な特徴の1つは、minimizersを使用することである。minimizersとは、与えられた配列内に位置する短い部分文字列（または、そうでなければk-measure）のことである。minimizers（最小化子）という用語は、アライメントの過程で比較する必要のある部分文字列の数を最小化するという考え方に由来する。minimizerは、与えられた配列ウィンドウ内の各k-merに与えられたハッシュ関数を適用し、ユニークな整数値を作成することで計算される。次に整数値をソートし、最小値を持つk-measureを最小化子として選択する（論文図2）。ミニマイザーを使用することで、アライメント処理が高速になる。

線形リファレンス配列にインデックスを付ける場合、ハッシュテーブルが作成され、キーは最小化子配列にハッシュ関数を適用した結果であり、値はゲノム参照配列の位置のリストである。

minimap2ツールのインデックスを変更する可能性は、ハッシュテーブルが線形リファレンス配列の所与の位置における最小化因子の数に制約を課さないという事実によって提供され、遺伝的バリアントに関する情報を追加しても、その後のアラインメントアルゴリズムには影響しない。言い換えれば、線形リファレンス配列インデックスは、ゲノムグラフと同様に、遺伝的バリアントの追加によって誘導される分岐の追加を可能にする（論文図3）。（以下省略）

インストール

依存

To compile from source, use this version of tools:

GCC/G++ 11.4.0+
HTSlib v1.17

Github

#HTSlibが必要（samtools解説）先にminimap2_index_modifierをcloneして、minimap2_index_modifierのrootで実行する
git clone --recurse-submodules https://github.com/samtools/htslib.git
cd htslib/
autoheader
autoreconf -i
./configure --prefix=/usr/local
make -j10
make install
cd ../

#本体(minimap2のfolk)*1（注;特別なoptionを使わない限り本家minimap2と同じものだが、パスを通す時は本家と混同しないように注意）
git clone https://github.com/ispras/minimap2_index_modifier.git
cd minimap2_index_modifier/
make -j10 

#docker image(hub)
docker pull egorguga/minimap2_index_modifier:2.24

> ./minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer (preferrable for PacBio)

-k INT k-mer size (no larger than 28) [15]

-w INT minimizer window size [10]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

--vcf-file-with-variants FILE pass VCF FILE to modify index

--parse-haplotype parse haplotypes from VCF to generate real SNP combinations, otherwise use all.

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

-N INT retain at most INT secondary alignments [5]

Alignment:

-A INT matching score [2]

-B INT mismatch penalty (larger value for lower divergence) [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-o FILE output alignments to FILE [stdout]

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT number of threads [3]

-K NUM minibatch size for mapping [500M]

--version show version number

Preset:

-x STR preset (always applied before other options; see minimap2.1 for details)

- map-pb/map-ont - PacBio CLR/Nanopore vs reference mapping

- map-hifi - PacBio HiFi reads vs reference mapping

- ava-pb/ava-ont - PacBio/Nanopore read overlap

- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence

- splice/splice:hq - long-read/Pacbio-CCS spliced alignment

- sr - genomic short-read mapping

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

ビルド済みindex

２つのヒトゲノムアセンブリバージョンでのmodified indexが利用できる。

Nextcloud

テストラン

test/README.mdの通り進めます。

1、indexing

cd minimap2_index_modifier/tests/
#比較のため通常のindexとバリアントを反映させたindexを作る
#A normal
../minimap2 -ax sr -d test.mni test.fasta

#B modified index (このminimap2 folkでないと実行できない点に注意)
../minimap2 -ax sr -d test.modified.mni --vcf-file-with-variants test_long_chr1_not_the_same.vcf.gz test.fasta

=>diffコマンドで異なるか確認
diff test.mni test.modified.mni

２、mapping

../minimap2 -ax sr -t 4 test.modified.mni test.1.fq test.2.fq > mapped.sam

アラインメントフォーマットには変化はないが、バリアントを含む部位でのアラインメントが改善されている可能性がある。

レポジトリと論文より

入力 VCF ファイルにリファレンスからのvariant が含まれていない場合、修正されたインデックスは通常のものと同じになる。
既知の遺伝子変異を含むデータベースの例としては、dbSNP、gnomAD、1000 Genomes Project、Human Pangenome Reference Consortiumなどがある。これらのデータベースはすべて、線形リファレンス配列インデックスを修正するために使用できる。1000 Genomes Projectのデータセットには段階的な遺伝子型が含まれており、一緒に出現する遺伝子変異のみを考慮することができる。

パイプラインに影響を与えないため、最小限の手間でバリアントコールパイプラインを修正することが可能になります。気になるのは、図７のグラフで、わずかな差ではありますが、デフォルトindexのminimap2はrecallの指標でbwa mem2に負けていて、ラージコホートのバリアントを使ったmodified indexを使ったminimap2はbwa mem2を上回る点でしょうか。当時minimap2がショートリードマッピングでわずかに劣る可能性が指摘されてましたが、それが見えているのかと思いました

引用

Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

Egor Guguchkin, Artem Kasianov, Maksim Belenikin, Gaukhar Zobkova, Ekaterina Kosova, Vsevolod Makeev & Evgeny Karpulevich

BMC Bioinformatics volume 25, Article number: 238 (2024)

デフォルトでは"../htslib"にある HTSlibに対してビルドされるので、取ってきたminimap2_index_modifierの直下でHTSlibライブラリをビルドした。

関連

高速なロング/ショートリードアライナー minimap2

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複雑な構造変異を検出するためのロングリードベースの手法 FindCSV

大規模なゲノム配列セットのANI値を計算する LZ-ANI

比較ゲノミクスのための遺伝子座の可視化ツール LoVis4u

PCRプライマーをデザインする primers

k-merの起源となる配列を見つける Back to sequences

シークエンシングリードから直接分類学的プロファイリングを行う MetabuliのGUIアプリケーション（ノートPCでも動作）

minimap2インデックスに既知バリアント情報を組み込むことで、WGSでのSNVコールを改善する minimap2_index_modifier