This problem is taken from the book Introduction to Algorithms, Third Edition By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein:
In order to transform one source string of text x[1..m] to a target string y[1..n], we can perform various transformation operations. Our goal is, given x and y, to produce a series of transformations that change x to y. We use an array ́z—assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, z is empty, and at termination, we should have z[j] = y[j] for j = 1, 2,..., n. We maintain current indices i into x and j into z, and the operations are allowed to alter z and these indices. Initially, i = j = 1. We are required to examine every character in x during the transformation, which means that at the end of the sequence of transformation operations, we must have i = m + 1.The running time and space requirement of my solution is
We may choose from among six transformation operations:
Copy a character from x to z by setting ́z[j] = x[i] and then incrementing both i and j. This operation examines x[i].
Replace a character from x by another character c, by setting z[j] = c, and then incrementing both i and j . This operation examines x[i].
Delete a character from x by incrementing i but leaving j alone. This operation examines x[i].
Insert the character c into z by setting z[j] = c and then incrementing j , but leaving i alone. This operation examines no characters of x.
Twiddle (i.e., exchange) the next two characters by copying them from x to z but in the opposite order; we do so by setting z[j] = x[i + 1] and ́z[j + 1] = x[i] and then setting i = i + 2 and j = j + 2. This operation examines x[i] and x[i + 1].
Kill the remainder of x by setting i = m + 1. This operation examines all characters in x that have not yet been examined. This operation, if performed, must be the final operation.
a. Given two sequences x[1..m] and y[1..n] and set of transformation operation costs, the edit distance from x to y is the cost of the least expensive operation sequence that transforms x to y. Describe a dynamic-programming algorithm that finds the edit distance from x[1..m] to y[1..n] and prints an optimal operation sequence. Analyze the running time and space requirements of your algorithm
Optimal substructure of ED:
We define for and as the minimum edit distance to transform to without kill:
For and
For and :
For and :
For and :
Then let be the minimum edit distance with kill:
Optimal alignment
The edit-distance problem generalizes the problem of aligning two DNA sequences. There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences and consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences and have the same length but do not have a space in the same position (i.e., for no position are both and a space). Then we assign a “score” to each position. Position receives a score as follows:For the optimal alignment only the copy, replace, delete and insert operations are needed.
The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x = GATCGGCAT and y = CAATGTGAATC, one alignment is
- if and neither is a space,
- if and neither is a space,
- if either or is a space.
G ATCG GCAT
CAAT GTGAATC
- *++*+*+ -++*
A under a position indicates a score of for that position, a indicates a score of , and a indicates a score of , so that this alignment has a total score of
.
b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace,delete, insert, twiddle, and kill.
- Copy is used for a position with a score of +.
- Replace is used for a position with a score of -.
- Insert and delete is used for a position with a score of * (For insert the white space will end up in x' and for delete in y').
So we just need to solve the edit distance problem with this subset of operations and the scores negated.
Code in Rust:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | use std::ops::{Index, IndexMut}; #[derive(Debug)] struct Matrix<T> { vec: Vec<T>, rows: usize, columns: usize, } impl<T: Clone> Matrix<T> { fn new(default: T, r: usize, c: usize) -> Matrix<T> { Matrix { vec: vec![default; r * c], rows: r, columns: c, } } } impl<T> Matrix<T> { fn check(&self, row: usize, column: usize) { assert!(row < self.rows, "row out of range"); assert!(column < self.columns, "column out of range"); } } impl<T> IndexMut<(usize, usize)> for Matrix<T> { fn index_mut<'a>(&'a mut self, index: (usize, usize)) -> &'a mut T { let (r, c) = index; self.check(r, c); &mut self.vec[r * self.columns + c] } } impl<T> Index<(usize, usize)> for Matrix<T> { type Output = T; fn index<'a>(&'a self, index: (usize, usize)) -> &'a T { let (r, c) = index; self.check(r, c); &self.vec[r * self.columns + c] } } #[derive(Debug, Copy, Clone)] struct EditPosition { idx: usize, prev: char, current: char, } impl EditPosition { fn new(i: usize, p: char, c: char) -> EditPosition { EditPosition { idx: i, prev: p, current: c, } } fn valid_twiddle(&self, other: &EditPosition) -> bool { self.idx > 1 && other.idx > 1 && self.current == other.prev && self.prev == other.current } fn valid_copy(&self, other: &EditPosition) -> bool { self.current == other.current } } use std::collections::HashMap; #[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)] enum EditOp { Copy, Replace, Delete, Insert, Twiddle, Kill, } impl EditOp { fn valid(&self, i: EditPosition, j: EditPosition) -> bool { use EditOp::*; match *self { Copy => i.valid_copy(&j), Twiddle => i.valid_twiddle(&j), _ => true, } } fn displacement(&self) -> (usize, usize) { use EditOp::*; match *self { Copy | Replace => (1, 1), Delete => (1, 0), Insert => (0, 1), Twiddle => (2, 2), Kill => (0, 0), } } fn dec(&self, d: (usize, usize)) -> (usize, usize) { let (i, j) = self.displacement(); (d.0 - i, d.1 - j) } fn align<F, T>(&self, mut from_itr: F, mut to_itr: T, new_from: &mut String, new_to: &mut String) where F: Iterator<Item = char>, T: Iterator<Item = char> { fn add_white_space<I>(a: &mut String, b: &mut String, mut a_itr: I) where I: Iterator<Item = char> { a.push(a_itr.next().unwrap()); b.push(' '); } use EditOp::*; match *self { Copy | Replace => { new_from.push(from_itr.next().unwrap()); new_to.push(to_itr.next().unwrap()); } Delete => add_white_space(new_from, new_to, from_itr), Insert => add_white_space(new_to, new_from, to_itr), _ => panic!("Operation is not suported"), } } } type CostMap = HashMap<EditOp, isize>; type EditTable = Matrix<(EditOp, isize)>; fn create_table(from_len: usize, to_len: usize, costs: &CostMap) -> EditTable { use EditOp::*; let i_cost = *(costs.get(&Insert) .expect("Insert most be in costs map")); let d_cost = *(costs.get(&Delete) .expect("Delete most be in costs map")); let mut table = EditTable::new((Kill, 0), from_len, to_len); for i in 1..from_len { table[(i, 0)] = (Delete, table[(i - 1, 0)].1 + d_cost); } for j in 1..to_len { table[(0, j)] = (Insert, table[(0, j - 1)].1 + i_cost); } table } fn edit_distance(from: &str, to: &str, costs: &CostMap) -> (isize, Operations) { let from_len = from.chars().count(); let to_len = to.chars().count(); let mut table = create_table(from_len + 1, to_len + 1, costs); let (mut ilast, mut jlast) = (' ', ' '); for (i, ichar) in from.chars().enumerate() { for (j, jchar) in to.chars().enumerate() { let ij = (i + 1, j + 1); let ipos = EditPosition::new(ij.0, ilast, ichar); let jpos = EditPosition::new(ij.1, jlast, jchar); table[ij] = costs.iter() .filter(|&(o, _)| o != &EditOp::Kill) .filter(|&(o, _)| o.valid(ipos, jpos)) .map(|(o, c)| (*o, table[o.dec(ij)].1 + c)) .min_by_key(|&(_, c)| c) .expect("no operation was valid"); jlast = jchar; } ilast = ichar; } add_kill(&table, costs, from_len, to_len) } fn add_kill(table: &EditTable, costs: &CostMap, from_len: usize, to_len: usize) -> (isize, Operations) { let no_kill_cost = table[(from_len, to_len)].1; let (i, cost, kill) = costs.get(&EditOp::Kill) .map_or((from_len, no_kill_cost, None), |kc| min_with_kill(&table, from_len, to_len, *kc)); (cost, build_operations(kill, &table, i, to_len)) } type Operations = Vec<EditOp>; fn min_with_kill(table: &EditTable, from_size: usize, to_size: usize, kill_cost: isize) -> (usize, isize, Option<EditOp>) { let no_kill_cost = table[(from_size, to_size)].1; (1..from_size) .map(|i| (i, table[(i, to_size)].1 + kill_cost)) .map(|(i, c)| (i, c, Some(EditOp::Kill))) .chain(Some((from_size, no_kill_cost, None)).into_iter()) .min_by_key(|&(_, cost, _)| cost) .unwrap() } fn build_operations(kill: Option<EditOp>, table: &EditTable, i: usize, j: usize) -> Operations { let itr = std::iter::repeat(()) .scan((i, j), |s, _| { let op = table[*s].0; *s = op.dec(*s); Some(op) }) .take_while(|op| op != &EditOp::Kill); let mut stack: Vec<_> = kill.into_iter() .chain(itr) .collect(); stack.reverse(); stack } fn optimal_alignment(from: &str, to: &str, cost_map: &CostMap) -> (String, String) { let (_, ops) = edit_distance(from, to, cost_map); let mut new_from = String::new(); let mut new_to = String::new(); let mut from_itr = from.chars(); let mut to_itr = to.chars(); for op in ops {; op.align(&mut from_itr, &mut to_itr, &mut new_from, &mut new_to); } (new_from, new_to) } fn main() { let mut cost_map = HashMap::new(); cost_map.insert(EditOp::Delete, 2); cost_map.insert(EditOp::Insert, 2); cost_map.insert(EditOp::Copy, -1); cost_map.insert(EditOp::Replace, 1); let from = "CBADDKDDBBTPPEP"; let to = "CGAEEDDKDDBBTEEFGKKK"; let (f, t) = optimal_alignment(from, to, &cost_map); println!("{}\n{}", f, t); } |
No comments:
Post a Comment