Tuesday, October 4, 2016

Edit distance


This problem is taken from the book Introduction to Algorithms, Third Edition By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein:
In order to transform one source string of text x[1..m] to a target string y[1..n], we can perform various transformation operations. Our goal is, given x and y, to produce a series of transformations that change x to y. We use an array ́z—assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, z is empty, and at termination, we should have z[j] = y[j] for j = 1, 2,..., n. We maintain current indices i into x and j into z, and the operations are allowed to alter z and these indices. Initially, i = j = 1. We are required to examine every character in x during the transformation, which means that at the end of the sequence of transformation operations, we must have i = m + 1.
We may choose from among six transformation operations:
Copy a character from x to z by setting ́z[j] = x[i] and then incrementing both i and j. This operation examines x[i].
Replace a character from x by another character c, by setting z[j] = c, and then incrementing both i and j . This operation examines x[i].
Delete a character from x by incrementing i but leaving j alone. This operation examines x[i].
Insert the character c into z by setting z[j] = c and then incrementing j , but leaving i alone. This operation examines no characters of x.
Twiddle (i.e., exchange) the next two characters by copying them from x to z but in the opposite order; we do so by setting z[j] = x[i + 1] and ́z[j + 1] = x[i] and then setting i = i + 2 and j = j + 2. This operation examines x[i] and x[i + 1].
Kill the remainder of x by setting i = m + 1. This operation examines all characters in x that have not yet been examined. This operation, if performed, must be the final operation.
a. Given two sequences x[1..m] and y[1..n] and set of transformation operation costs, the edit distance from x to y is the cost of the least expensive operation sequence that transforms x to y. Describe a dynamic-programming algorithm that finds the edit distance from x[1..m] to y[1..n] and prints an optimal operation sequence. Analyze the running time and space requirements of your algorithm
The running time and space requirement of my solution is O ( n m + m + n ).

Optimal substructure of ED:


let x = ( α 1 , α 2 , α 3 , . . . , α m ) and y = ( β 1 , β 2 , β 3 , . . . , β n ) .
We define c ( i , j ) for 0 i m and 0 j n as the minimum edit distance to transform ( α 1 , . . . , α i ) to ( β 1 , . . . , β j ) without kill:

For 1 i m and 1 j n
c ( i , j ) = m i n { c ( i 1 , j ) + c o s t ( d e l e t e ) c ( i , j 1 ) + c o s t ( i n s e r t ) { c ( i 1 , j 1 ) + c o s t ( c o p y ) i f α i = β j o t h e r w i s e c ( i 1 , j 1 ) + c o s t ( r e p l a c e ) { c ( i 2 , j 2 ) + c o s t ( t w i d d l e ) i f i > 1 j > 1 α i = β j 1 α i 1 = β j o t h e r w i s e

For 1 i m and j=0:
c(i,j)=c(i-1,j) +cost(delete)

For 1 j n and i=0:
c(i,j)=c(i,j-1) +cost(insert)

For i=0 and j=0:
c(i,j)=0

Then let k ( i , j ) be the minimum edit distance with kill:

k ( i , j ) = m i n { c ( i , j ) m i n ( { c ( 1 , j ) , c ( 2 , j ) , . . . , c ( i 1 , j ) } ) + c o s t ( k i l l )

Optimal alignment


The edit-distance problem generalizes the problem of aligning two DNA sequences. There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences x and y consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences x and y have the same length but do not have a space in the same position (i.e., for no position j are both x[j] and y[j] a space). Then we assign a “score” to each position. Position j receives a score as follows:
  • +1 if x[j]=y[j] and neither is a space,
  • 1 if x[j]y[j] and neither is a space,
  • 2 if either x[j] or y[j] is a space.
The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x = GATCGGCAT and y = CAATGTGAATC, one alignment is
G ATCG GCAT
CAAT GTGAATC
- *++*+*+ -++*
A + under a position indicates a score of +1 for that position, a indicates a score of 1, and a indicates a score of 2, so that this alignment has a total score of
6·12·14·2=4.
b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace,delete, insert, twiddle, and kill.
For the optimal alignment only the copy, replace, delete and insert operations are needed.
  • Copy is used for a position with a score of +.
  • Replace is used for a position with a score of -.
  • Insert and delete is used for a position with a score of * (For insert the white space will end up in x' and for delete in y').
So we just need to solve the edit distance problem with this subset of operations and the scores negated. 

Code in Rust:



  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
use std::ops::{Index, IndexMut};

#[derive(Debug)]
struct Matrix<T> {
    vec: Vec<T>,
    rows: usize,
    columns: usize,
}

impl<T: Clone> Matrix<T> {
    fn new(default: T, r: usize, c: usize) -> Matrix<T> {
        Matrix {
            vec: vec![default; r * c],
            rows: r,
            columns: c,
        }
    }
}

impl<T> Matrix<T> {
    fn check(&self, row: usize, column: usize) {
        assert!(row < self.rows, "row out of range");
        assert!(column < self.columns, "column out of range");
    }
}

impl<T> IndexMut<(usize, usize)> for Matrix<T> {
    fn index_mut<'a>(&'a mut self, index: (usize, usize)) -> &'a mut T {
        let (r, c) = index;
        self.check(r, c);
        &mut self.vec[r * self.columns + c]
    }
}

impl<T> Index<(usize, usize)> for Matrix<T> {
    type Output = T;
    fn index<'a>(&'a self, index: (usize, usize)) -> &'a T {
        let (r, c) = index;
        self.check(r, c);
        &self.vec[r * self.columns + c]
    }
}

#[derive(Debug, Copy, Clone)]
struct EditPosition {
    idx: usize,
    prev: char,
    current: char,
}

impl EditPosition {
    fn new(i: usize, p: char, c: char) -> EditPosition {
        EditPosition {
            idx: i,
            prev: p,
            current: c,
        }
    }
    fn valid_twiddle(&self, other: &EditPosition) -> bool {
        self.idx > 1 && other.idx > 1 && self.current == other.prev && self.prev == other.current
    }

    fn valid_copy(&self, other: &EditPosition) -> bool {
        self.current == other.current
    }
}

use std::collections::HashMap;

#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
enum EditOp {
    Copy,
    Replace,
    Delete,
    Insert,
    Twiddle,
    Kill,
}

impl EditOp {
    fn valid(&self, i: EditPosition, j: EditPosition) -> bool {
        use EditOp::*;
        match *self {
            Copy    => i.valid_copy(&j),
            Twiddle => i.valid_twiddle(&j),
            _ => true,
        }
    }

    fn displacement(&self) -> (usize, usize) {
        use EditOp::*;
        match *self {
            Copy | Replace => (1, 1),
            Delete         => (1, 0),
            Insert         => (0, 1),
            Twiddle        => (2, 2),
            Kill           => (0, 0),
        }
    }

    fn dec(&self, d: (usize, usize)) -> (usize, usize) {
        let (i, j) = self.displacement();
        (d.0 - i, d.1 - j)
    }

    fn align<F, T>(&self,
                   mut from_itr: F,
                   mut to_itr: T,
                   new_from: &mut String,
                   new_to: &mut String)
        where F: Iterator<Item = char>,
              T: Iterator<Item = char>
    {
        fn add_white_space<I>(a: &mut String, b: &mut String, mut a_itr: I)
            where I: Iterator<Item = char>
        {
            a.push(a_itr.next().unwrap());
            b.push(' ');
        }

        use EditOp::*;
        match *self {
            Copy | Replace => {
                new_from.push(from_itr.next().unwrap());
                new_to.push(to_itr.next().unwrap());
            }
            Delete => add_white_space(new_from, new_to, from_itr),
            Insert => add_white_space(new_to, new_from, to_itr),
            _ => panic!("Operation is not suported"),
        }
    }
}

type CostMap = HashMap<EditOp, isize>;
type EditTable = Matrix<(EditOp, isize)>;

fn create_table(from_len: usize, to_len: usize, costs: &CostMap) -> EditTable {
    use EditOp::*;
    let i_cost = *(costs.get(&Insert)
        .expect("Insert most be in costs map"));
    let d_cost = *(costs.get(&Delete)
        .expect("Delete most be in costs map"));


    let mut table = EditTable::new((Kill, 0), from_len, to_len);
    for i in 1..from_len {
        table[(i, 0)] = (Delete, table[(i - 1, 0)].1 + d_cost);
    }
    for j in 1..to_len {
        table[(0, j)] = (Insert, table[(0, j - 1)].1 + i_cost);
    }
    table
}


fn edit_distance(from: &str, to: &str, costs: &CostMap) -> (isize, Operations) {
    let from_len  = from.chars().count();
    let to_len    = to.chars().count();
    let mut table = create_table(from_len + 1, to_len + 1, costs);

    let (mut ilast, mut jlast) = (' ', ' ');
    for (i, ichar) in from.chars().enumerate() {
        for (j, jchar) in to.chars().enumerate() {
            let ij   = (i + 1, j + 1);
            let ipos = EditPosition::new(ij.0, ilast, ichar);
            let jpos = EditPosition::new(ij.1, jlast, jchar);

            table[ij] = costs.iter()
                .filter(|&(o, _)| o != &EditOp::Kill)
                .filter(|&(o, _)| o.valid(ipos, jpos))
                .map(|(o, c)| (*o, table[o.dec(ij)].1 + c))
                .min_by_key(|&(_, c)| c)
                .expect("no operation was valid");
            jlast = jchar;
        }
        ilast = ichar;
    }
    add_kill(&table, costs, from_len, to_len)
}

fn add_kill(table: &EditTable,
            costs: &CostMap,
            from_len: usize,
            to_len: usize)
            -> (isize, Operations) {
    let no_kill_cost    = table[(from_len, to_len)].1;
    let (i, cost, kill) = costs.get(&EditOp::Kill)
        .map_or((from_len, no_kill_cost, None),
                |kc| min_with_kill(&table, from_len, to_len, *kc));

    (cost, build_operations(kill, &table, i, to_len))
}

type Operations = Vec<EditOp>;

fn min_with_kill(table: &EditTable,
                 from_size: usize,
                 to_size: usize,
                 kill_cost: isize)
                 -> (usize, isize, Option<EditOp>) {
    let no_kill_cost = table[(from_size, to_size)].1;
    (1..from_size)
        .map(|i| (i, table[(i, to_size)].1 + kill_cost))
        .map(|(i, c)| (i, c, Some(EditOp::Kill)))
        .chain(Some((from_size, no_kill_cost, None)).into_iter())
        .min_by_key(|&(_, cost, _)| cost)
        .unwrap()
}

fn build_operations(kill: Option<EditOp>, table: &EditTable, i: usize, j: usize) -> Operations {
    let itr = std::iter::repeat(())
        .scan((i, j), |s, _| {
            let op = table[*s].0;
            *s     = op.dec(*s);
            Some(op)
        })
        .take_while(|op| op != &EditOp::Kill);

    let mut stack: Vec<_> = kill.into_iter()
        .chain(itr)
        .collect();
    stack.reverse();
    stack
}


fn optimal_alignment(from: &str, to: &str, cost_map: &CostMap) -> (String, String) {
    let (_, ops)     = edit_distance(from, to, cost_map);
    let mut new_from = String::new();
    let mut new_to   = String::new();
    let mut from_itr = from.chars();
    let mut to_itr   = to.chars();

    for op in ops {;
        op.align(&mut from_itr, 
                 &mut to_itr, 
                 &mut new_from, 
                 &mut new_to);
    }
    (new_from, new_to)
}

fn main() {
    let mut cost_map = HashMap::new();
    cost_map.insert(EditOp::Delete, 2);
    cost_map.insert(EditOp::Insert, 2);
    cost_map.insert(EditOp::Copy, -1);
    cost_map.insert(EditOp::Replace, 1);

    let from   = "CBADDKDDBBTPPEP";
    let to     = "CGAEEDDKDDBBTEEFGKKK";
    let (f, t) = optimal_alignment(from, to, &cost_map);
    println!("{}\n{}", f, t);
}

No comments:

Post a Comment